NLP Notes PDF
Document Details

Uploaded by AdaptableChocolate5302
Tags
Summary
These notes cover various topics in Natural Language Processing (NLP), ranging from basic concepts like stemming and lemmatization to more advanced techniques like language models and speech recognition. The notes provide definitions, examples, and detailed explanations of different NLP methods.
Full Transcript
## TOPIC 4 - **Stemming** - Process of stripping off affixes to find basic morphological structure. Reduce the word to a stem or root form. - **Examples**: relational-relate, motoring-motor - **Stem** - untouchables - untouchable - **Base root/stem** - boxes-box - **Porter Algorithm**...
## TOPIC 4 - **Stemming** - Process of stripping off affixes to find basic morphological structure. Reduce the word to a stem or root form. - **Examples**: relational-relate, motoring-motor - **Stem** - untouchables - untouchable - **Base root/stem** - boxes-box - **Porter Algorithm** (for stemming) - lexicon-free FST stemmer. It uses rewrite rules to transform words. - **Examples**: - **ation** / **alate** - relational- relate, demonstrational-demonstrate - **ing** (if the word contains a vowel) - motoring - motor, monitoring - monitor - **ing** (if the word is less than 4 letters) - buying-buy - **ing** stem if short word - making - make, mutating - mutate - **Errors in stemming** - **Commission** (inc affix) → FP - **Examples**: - doing-doe (do) - generalization-generic (general) - numericals-numerous (numeric) - policy-police (policy) - European-european (europe) - **Omission** (exc affix) → FN - **Examples**: - organization-organ (organize) - matrices-matric (matice) - noisy-noisi (noise) - urgency-urgenc (urgent) - **Understemming** (2 words not stemmed to the same root) → FN - **Examples**: - adheres-adhere, adhesion-adhes (adhere) - **Overstemming** (2 words stemmed to the same root) → FP - **Examples**: - numerous-numer (number), numerical-numer (number) - **Lemma** - canonical/dictionary form - **Lemmatization** - group related words together, group words by their word sense/meaning. Understand context. - **Examples**: - pay - paying, paid, pays - be- is, was, am - **Examples**: - nitk.stem, WordNet Lemmatizer - **Word segmentation** - Process of segmenting/tokenizing text into words. - **Examples**: ED-Detection, EC-correction - **Spelling error detection** - **Non-word ED** (spelling errors) - **Isolated word ED** - **Context-dependent EC** - To correct spelling errors, use distance metric (choose the most likely word from all possible words) - **Minimum edit distance** - non-probabilistic method to find closest spelling. ## TOPIC 5 - **Language model** - statistical model, assigns probabilities over sequences of words. - **Uses**: speech recognition, OCR & hand-writing recognition, machine translation, spelling error detection, augmentative comm, text summarization, image captioning, etc. - **N-gram models** - assign probabilities to sequences of tokens. Based on previous token histograms. - **Examples**: unigram, bigram, trigram, nth gram - **Word prediction** - difficult because input is noisy/ambiguous, thus look at previous word to guess next word. - **Unigram Language model** - do not use histograms. - **Examples**: - `<s> my hometown is in Jordan <s>` - `<s> i am a computer science student. </s>` - `<s> i studied in Gombak. </s>` ## TOPIC 6 - **Types of ML** - **Supervised L** - classification - **Unsupervised L** - clustering - **Semi-supervised L** - **Sklearn** - better than NLTK, floating point, support most ML algo, easy to install, large dataset - **Data Preparation** - **Vectorization** (turn text into numerical vectors) → use tf-idf - **Tf-idf** (sklearn Tfidfvectorizer) - Result in a matrix, row: document, column: word, every cell represents the tf-idf score of the word. Used to evaluate how important a word is to a document. Used to evaluate how important a term is to a document. - **Application**: Information Retrieval and Text Mining - **ML Application w/ NLP** - **Personal productivity assistant** (Cortana) - **Language translator** - **Voice assistant** - **Recommendation system** - **Self-driving car** - **Deep Learning NLP** - **Transformers** - NN used to help discern word context (word2vect, fastText, Google BERT) - **Word Embeddings** - predict a word based on its context. - **Datapreprocessing** - **Remove stop words** - **Remove unnecessary punctuations** - **Normalization** (upper/lowercase) - **Tokenization** (split sentences into words) - **Stemming** (strip off affixes) - **Tf** - measures how frequently a term occurs in a document. - $tf(t) = \frac{number of terms t in a document}{total terms in the document}$ - **Idf** - measures how important a term is to a document. - $idf(t) = log_e\frac{number of documents}{number of documents with term t}$ - **Tf-idf** - $tf-idf = tf \times idf$