Lecture 2 NLP PDF

‫سورة طه‪26-25 :‬‬ ‫‪1‬‬ BY DR. BELAL BADAWY AMIN 2 Introduction Structured Data Unstructured Data Can be displayed in rows, columns and Can't be displayed in rows, columns and...

‫سورة طه‪26-25 :‬‬ ‫‪1‬‬ BY DR. BELAL BADAWY AMIN 2 Introduction Structured Data Unstructured Data Can be displayed in rows, columns and Can't be displayed in rows, columns and relational data base relational data base Numbers, dates and strings Images, audio, video, word files, e- mail…......etc Estimated 20% of enterprise data Estimated 80% of enterprise data Require less storage Require more storage Easier to manage and protect with More difficult to manage and protect legacy solutions with legacy solutions Main Approaches in NLP (TimeLine) 1.Rule Based Approaches (slow manner) Regular expression Context –free grammars 2. Machine learning or Traditional (increase performance and accuracy) Linear classifier Likelihood maximization 3. Deep Learning Convolution neural network Recurrent neural network (more efficient and performance) Why NLP is important?  NLP is everywhere even if we don’t realize it  The majority of activities performed by humans are done through language  There are millions of gigabytes of data generating by social media, Apps messages and so on  All these channels are generating large amount of text data every second  And because of the large volumes of text data as well as highly unstructured data. We can no longer use the common approach to understand the text and this is where NLP comes. Why NLP is difficult?  It’s the nature of human language that makes NLP difficult  Humans gets the edge due to the communication skills he has  There are hundreds of natural languages each of which has different syntax rules, words can be ambiguous where there meaning it dependent on their context.  The rules that dictate the passing of information using natural languages are not easy for computers to understand. Techniques of NLP 1. Syntax analysis: refer to the arrangement of words in a sentence such that they make grammatical sense. It is used to assess how the natural language align with the grammatical rules. Here are some syntax rules techniques that can be used: Lemmatization: it entails reducing the various inflected forms of a word into a single form for easy analysis. Stemming: it involves cutting the inflected words to their root form. Morphological segmentation: it involves dividing words into individual units called morphemes. Word segmentation: it involves dividing a large piece of continuous text into distinct units. Part-of-speech tagging: it involves identifying the part of speech for every word. Parsing: it involves undertaking grammatical analysis for the provided sentence. Sentence breaking: it involves placing sentence boundaries on a large space of text. Techniques of NLP 2. Semantic analysis: refer to the meaning that is conveyed by a text. It is one of the difficult aspects of NLP that has not been fully resolved yet. It involves applying computer algorithms to understand the meaning and interrelations of words and sentences are structured. Here are some techniques can be used: Named entity recognition (NER): it involves determining the parts of a text that can be identified and categorized into preset groups. Word sense disambiguation: it involves giving meaning to a word based on the context. Natural language generation: it involves using databases to derive semantic intentions and convert them into human language. Words and Corpora Corpora: is a computer-readable collection of text or speech. Types are the number of distinct words in a corpus; if the set of words in the vocabulary is V, |V| is size of vocabulary. Tokens are the total number N of running words. Herdan's Law = where often.67 < β <.75 i.e., vocabulary size grows with > square root of the number of word tokens How many words in a corpus?  Example: they lay back on the San Francisco grass and looked at the stars and their How many? tokens ----------. types ----------. Corpora Words don't appear out of nowhere! A text is produced by a specific writer(s), at a specific time, in a specific variety, of a specific language, for a specific function. Corpora vary along dimension like ◦ Language: 7097 languages in the world ◦ Variety, like African American Language varieties. ◦ AAE Twitter posts might include forms like "iont" (I don't) ◦ Code switching, e.g., Spanish/English, Hindi/English: S/E: Por primera vez veo a @username actually being hateful! It was beautiful:) [For the first time I get to see @username actually being hateful! it was beautiful:) ] H/E: dost tha or ra- hega... dont wory... but dherya rakhe [“he was and will remain a friend... don’t worry... but have faith”] ◦ Genre: newswire, fiction, scientific articles, Wikipedia ◦ Author Demographics: writer's age, gender, ethnicity, SES Pre-processing text data Cleaning up the text data is necessary to highlight attributes that we are going to want our model to pick up on. Cleaning the data typically consist of number of steps: 1. Remove Punctuations 2. Converting text to lowercase 3. Tokenization 4. Remove stop – word 5. Lemmatization / stemming 6. Vectorization 7. Feature Engineering Tokenization Space-based tokenization A very simple way to tokenize ◦ For languages that use space characters between words ◦ Arabic, Cyrillic, Greek, Latin, etc., based writing systems ◦ Segment off a token between instances of spaces Unix tools for space-based tokenization ◦ The "tr" command ◦ Inspired by Ken Church's UNIX for Poets ◦ Given a text file, output the word tokens and their frequencies Issues in Tokenization Can't just blindly remove punctuation: ◦ m.p.h., Ph.D., AT&T, cap’n ◦ prices ($45.55) ◦ dates (01/02/06) ◦ URLs (http://www.stanford.edu) ◦ hashtags (#nlproc) ◦ email addresses ([email protected]) Clitic: a word that doesn't stand on its own ◦ "are" in we're, French "je" in j'ai, "le" in l'honneur When should multiword expressions (MWE) be words? ◦ New York, rock ’n’ roll Tokenization in languages without spaces Many languages (like Chinese, Japanese) don't use spaces to separate words! 姚明进入总决赛 “Yao Ming reaches the finals” 3 words? 姚明进入总决赛 YaoMing reaches finals 5 words? 姚明进入总决赛 Yao Ming reaches overall finals Word tokenization / segmentation So in Chinese it's common to just treat each character (zi) as a token. So the segmentation step is very simple In other languages (like Thai and Japanese), more complex word segmentation is required. The standard algorithms are neural sequence models trained by supervised machine learning. Another option for text tokenization Instead of white-space segmentation single-character segmentation Use the data to tell us how to tokenize. Subword tokenization (is a technique used in natural language processing (NLP) that involves breaking down words into smaller subwords or pieces.) Ex: “football” might be split into “foot”, and “ball” Sub word tokenization Three common algorithms: ◦ Byte-Pair Encoding (BPE) (Sennrich et al., 2016) ◦ Unigram language modeling tokenization (Kudo, 2018) ◦ Word Piece (Schuster and Nakajima, 2012) All have 2 parts: ◦ A token learner that takes a raw training corpus and induces a vocabulary (a set of tokens). ◦ A token segmenter that takes a raw test sentence and tokenizes it according to that vocabulary Byte Pair Encoding (BPE) token learner Let vocabulary be the set of all individual characters = {A, B, C, D,…, a, b, c, d….} Repeat: ◦ Choose the two symbols that are most frequently adjacent in the training corpus (say 'A', 'B') ◦ Add a new merged symbol 'AB' to the vocabulary ◦ Replace every adjacent 'A' 'B' in the corpus with 'AB'. Until k merges have been done. BPE token learner algorithm Byte Pair Encoding (BPE) Addendum Most subword algorithms are run inside space- separated tokens. So we commonly first add a special end-of-word symbol '__' before space in training corpus Next, separate into letters. BPE token segmenter algorithm On the test data, run each merge learned from the training data: ◦ Greedily ◦ In the order we learned them ◦ (test frequencies don't play a role) So: merge every e r to er, then merge er _ to er_, etc. Result: ◦ Test set "n e w e r _" would be tokenized as a full word ◦ Test set "l o w e r _" would be two tokens: "low er_" Properties of BPE tokens Usually include frequent words And frequent subwords Which are often morphemes like -est or –er A morpheme is the smallest meaning-bearing unit of a language unlikeliest has 3 morphemes un-, likely, and -est Word Normalization Putting words/tokens in a standard format ◦ U.S.A. or USA ◦ uhhuh or uh-huh ◦ Fed or fed ◦ am, is, be, are Case folding Applications like IR: reduce all letters to lower case ◦ Since users tend to use lower case ◦ Possible exception: upper case in mid-sentence? ◦ e.g., General Motors ◦ Fed vs. fed ◦ SAIL vs. sail For sentiment analysis, MT, Information extraction ◦ Case is helpful (US versus us is important)

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue