NLP Notes PDF

## TOPIC 4 - **Stemming** - Process of stripping off affixes to find basic morphological structure. Reduce the word to a stem or root form. - **Examples**: relational-relate, motoring-motor - **Stem** - untouchables - untouchable - **Base root/stem** - boxes-box - **Porter Algorithm** (for stemming) - lexicon-free FST stemmer. It uses rewrite rules to transform words. - **Examples**: - **ation** / **alate** - relational- relate, demonstrational-demonstrate - **ing** (if the word contains a vowel) - motoring - motor, monitoring - monitor - **ing** (if the word is less than 4 letters) - buying-buy - **ing** stem if short word - making - make, mutating - mutate - **Errors in stemming** - **Commission** (inc affix) → FP - **Examples**: - doing-doe (do) - generalization-generic (general) - numericals-numerous (numeric) - policy-police (policy) - European-european (europe) - **Omission** (exc affix) → FN - **Examples**: - organization-organ (organize) - matrices-matric (matice) - noisy-noisi (noise) - urgency-urgenc (urgent) - **Understemming** (2 words not stemmed to the same root) → FN - **Examples**: - adheres-adhere, adhesion-adhes (adhere) - **Overstemming** (2 words stemmed to the same root) → FP - **Examples**: - numerous-numer (number), numerical-numer (number) - **Lemma** - canonical/dictionary form - **Lemmatization** - group related words together, group words by their word sense/meaning. Understand context. - **Examples**: - pay - paying, paid, pays - be- is, was, am - **Examples**: - nitk.stem, WordNet Lemmatizer - **Word segmentation** - Process of segmenting/tokenizing text into words. - **Examples**: ED-Detection, EC-correction - **Spelling error detection** - **Non-word ED** (spelling errors) - **Isolated word ED** - **Context-dependent EC** - To correct spelling errors, use distance metric (choose the most likely word from all possible words) - **Minimum edit distance** - non-probabilistic method to find closest spelling. ## TOPIC 5 - **Language model** - statistical model, assigns probabilities over sequences of words. - **Uses**: speech recognition, OCR & hand-writing recognition, machine translation, spelling error detection, augmentative comm, text summarization, image captioning, etc. - **N-gram models** - assign probabilities to sequences of tokens. Based on previous token histograms. - **Examples**: unigram, bigram, trigram, nth gram - **Word prediction** - difficult because input is noisy/ambiguous, thus look at previous word to guess next word. - **Unigram Language model** - do not use histograms. - **Examples**: - `<s> my hometown is in Jordan <s>` - `<s> i am a computer science student. </s>` - `<s> i studied in Gombak. </s>` ## TOPIC 6 - **Types of ML** - **Supervised L** - classification - **Unsupervised L** - clustering - **Semi-supervised L** - **Sklearn** - better than NLTK, floating point, support most ML algo, easy to install, large dataset - **Data Preparation** - **Vectorization** (turn text into numerical vectors) → use tf-idf - **Tf-idf** (sklearn Tfidfvectorizer) - Result in a matrix, row: document, column: word, every cell represents the tf-idf score of the word. Used to evaluate how important a word is to a document. Used to evaluate how important a term is to a document. - **Application**: Information Retrieval and Text Mining - **ML Application w/ NLP** - **Personal productivity assistant** (Cortana) - **Language translator** - **Voice assistant** - **Recommendation system** - **Self-driving car** - **Deep Learning NLP** - **Transformers** - NN used to help discern word context (word2vect, fastText, Google BERT) - **Word Embeddings** - predict a word based on its context. - **Datapreprocessing** - **Remove stop words** - **Remove unnecessary punctuations** - **Normalization** (upper/lowercase) - **Tokenization** (split sentences into words) - **Stemming** (strip off affixes) - **Tf** - measures how frequently a term occurs in a document. - $tf(t) = \frac{number of terms t in a document}{total terms in the document}$ - **Idf** - measures how important a term is to a document. - $idf(t) = log_e\frac{number of documents}{number of documents with term t}$ - **Tf-idf** - $tf-idf = tf \times idf$

Document Details

Tags

Related

Summary

Full Transcript