Full Transcript

NATURAL LANGUAGE PROCESSING DR. SHEFALI ARORA CHOUHAN ASSISTANT PROFESSOR DEPARTMENT OF CSE SYLLABUS COURSE OUTCOMES Compose key NLP elements to develop higher level processing chains, assess / evaluate NLP based systems Choose appropriate solutions for solving typical NLP sub problems (tokeni...

NATURAL LANGUAGE PROCESSING DR. SHEFALI ARORA CHOUHAN ASSISTANT PROFESSOR DEPARTMENT OF CSE SYLLABUS COURSE OUTCOMES Compose key NLP elements to develop higher level processing chains, assess / evaluate NLP based systems Choose appropriate solutions for solving typical NLP sub problems (tokenizing, tagging, parsing) Describe the typical problems and processing layers in NLP Analyze NLP problems to decompose them in adequate independent components Structured Vs Unstructured Data Structured data Unstructured data Well organized with rows and columns Lacks a clear structure Highly accessible and can be retrieved using Needs special tools to analyze SQL Goes into data warehouse Needs complex storage Quick decision making Takes longer to analyse Quantitative insights Qualitative insights Why NLP? Natural language processing (NLP) is a subfield of computer science and artificial intelligence (AI) that uses machine learning to enable computers to understand and communicate with human language Enables computers and digital devices to recognize, understand and generate text and speech by combining computational linguistics—the rule-based modeling of human language—together with statistical modeling, machine learning (ML) and deep learning. plays a growing role in enterprise solutions that help streamline and automate business operations, increase employee productivity and simplify mission-critical business processes. Phases of NLP Input Morphological processing Syntax analysis Semantic analysis Pragmatic analysis MORPHOLOGICAL ANALYSIS A lexicon is defined as a collection of words and phrases in a given language, with the analysis of this collection being the process of splitting the lexicon into components, based on what the user sets as parameters – paragraphs, phrases, words, or characters. A morpheme is a basic unit of English language construction, which is a small element of a word, that carries meaning. T These can be either a free morpheme (e.g. walk) or a bound morpheme (e.g. -ing, -ed), with the difference between the two being that the latter cannot stand on it’s own to produce a word with meaning, and should be assigned to a free morpheme to attach meaning. MORPHOLOGICAL ANALYSIS John ate the pizza (Input sentence) John, ate, the pizza (Tokenization) John,ate,the,pizza (Removal of stopwords) Ate-> Eat (Stemming:-Reduction of word to base form) N-gram language models Understands which word is followed by how may words? N-Grams are phrases cut out of a sentence with N cinsecutive words. Unigram takes a sentence and gives us all the words in that we fence. A Bigram takes a sentence and gives us sets of two consecutive words in the sentence. For instance, ate the Syntax Analysis Another example: S-> NP VP VP-> V NP NP-> Name NP-> Article N Name-> John V-> Ate Article->The N-> Apple Syntax analysis Syntax analysis or parsing is the process of checking grammar, word arrangement, and overall – the identification of relationships between words and whether those make sense. The process involved examination of all words and phrases in a sentence, and the structures between them. This process ensures that the structure and order and grammar of sentences makes sense, when considering the words and phrases that make up those sentences. Syntax analysis also involves tagging words and phrases with POS tags. There are two common methods, and multiple approaches to construct the syntax tree – top-down and bottom-up Semantic Analysis Semantic analysis is the third stage in NLP, when an analysis is performed to understand the meaning in a statement. This type of analysis is focused on uncovering the definitions of words, phrases, and sentences and identifying whether the way words are organized in a sentence makes sense semantically. She drank some books Drink book Noun Verb Verb Noun Discourse Analysis Discourse integration is the analysis and identification of the larger context for any smaller part of natural language structure (e.g. a phrase, word or sentence). During this phase, it’s important to ensure that each phrase, word, and entity mentioned are mentioned within the appropriate context. This analysis involves considering not only sentence structure and semantics, but also sentence combination and meaning of the text as a whole. Pragmatic Analysis Pragmatic analysis is the fifth and final phase of natural language processing. As the final stage, pragmatic analysis extrapolates and incorporates the learnings from all other, preceding phases of NLP. Pragmatic analysis involves the process of abstracting or extracting meaning from the use of language, and translating a text, using the gathered knowledge from all other NLP steps performed beforehand. Applications of NLP Sentiment Analysis. Text Classification. Chatbots & Virtual Assistants. Text Extraction. Machine Translation. Text Summarization. Market Intelligence. Auto-Correct. TEXT PRE-PROCESSING Common Terms Raw text to meaningful linguistic parts Character encoding identification: (ASCII, 8 bit character sets, Unicode) Language identification Text sectioning (Discard images, tables) Text segmentation (Convert corpus into words) Word segmentation (break up a sequence of characters) Sentence segmentation (Boundary detection and disambiguation) Character Set Dependence Nearly all texts were encoded in the 7-bit character set ASCII, which allowed only 128 (27) characters and included only the Roman (or Latin) alphabet and essential characters for writing English. Eight-bit character sets can encode 256 (28) characters using a single 8-bit byte, but most of these 8-bit sets reserve the first 128 characters for the original ASCII characters. A two-byte character set can represent 65,536 (216) distinct characters, since 2 bytes contain 16 bits. Determining individual characters in two-byte character sets involves grouping pairs of bytes representing a single character The Unicode 5.0 standard (Unicode Consortium 2006) seeks to eliminate this character set ambiguity by specifying a Universal Character Set that includes over 100,000 distinct coded characters derived from over 75 supported scripts representing all the writing systems commonly used today. The Unicode standard is most commonly implemented in the UTF-8 variable-length character encoding Language Dependence In many written Amharic texts, for example, both word and sentence boundaries are explicitly marked, while in written Thai texts neither is marked. English employs whitespace between most words and punctuation marks at sentence boundaries, but neither feature is sufficient to segment the text completely and unambiguously. Tibetan and Vietnamese both explicitly mark syllable boundaries, either through layout or by punctuation, but neither marks word boundaries. Written Chinese and Japanese have adopted punctuation marks for sentence boundaries, but neither denotes word boundaries. Corpus Dependence The increasing availability of large corpora in multiple languages that encompass a wide range of data types (e.g., newswire texts, email messages, closed captioning data, Internet news pages, and weblogs) has required the development of robust NLP approaches It has become increasingly clear that algorithms which rely on input texts to be well-formed are much less successful on these different types of texts. In many corpora, traditional prescriptive rules are commonly ignored. This fact is particularly important to our discussion of both word and sentence segmentation, which to a large degree depends on the regularity of spacing and punctuation. Most existing segmentation algorithms for natural languages are both language-specific and corpus-dependent, developed to handle the predictable ambiguities in a well-formed text. Depending on the origin and purpose of a text, capitalization and punctuation rules may be followed very closely (as in most works of literature), erratically (as in various newspaper texts), or not at all (as in email messages and personal Web pages TOKENIZATION Corpus-> Paragraph Documents-> Sentences Vocabulary-> Unique words Tokens are generated from the corpus Sentence tokenization can generate sentences from paragraph Word tokenization can be used to get word tokens from a sentence Sentence Segmentation Separation of punctuation marks Boundaries of sentences Numbers such 1.2% Abbreviations Presence of proper nouns before. Parts of speech Common Techniques for Pre-Processing Word Normalization:Putting words in a proper format such as U.S.A can be written as USA Case Folding: Reduction of letters to lower case, exceptions can be there such as use of proper nouns within the text. Stemming & Lemmitization Derivation of root word from inflected word Play, Played, Playing - > Play 0987431`NBArfsx Lemmitization Actual word may not be derived Actual word is derived which is meaningful Studies-> Studi Studies-> Study Porter Stemmer Algorithm Removes or replaces suffixes to retrieve original word Useful in Information retrieval from documents Collection-> CVCCVCCVVC->CVCVCVVC -> Conclude-> CVCVCV What will be the stemmed words? Busses-> Cars-> Planes-> Engines-> Dance-> Dances-> Dancing-> What will be the stemmed words? Busses->Buss Cars->Car Planes->Plane Engines-> Engin Dance-> Danc Dances-> Danc Dancing->Danc Give the Grammatical form Trouble Tree Oats Robbery Private Word Normalization & Sentence Segmentation Presence of abbreviations such as U.S.A, U.K , numbers such as 4.3, Dr., Inc. etc. (Ambiguity of.) as it affects sentence boundary Case folding: Change words to same case (Check for proper nouns in sentences before punctuation) Representation of word in root form or lemma (as explained in lemmatization) Perform tokenization (normal or ML based) or use regex to make rules Language Dependencies Need for compound splitters for instance in German No space between words in Chinese, Japanese, Sanskrit Find the actual words from which these words are derived Word Tokenization for Chinese/Sanskrit Start a pointer at the beginning of the string Find largest word in the dictionary that matches string starting at the pointer Move pointer over to the next word Sentence Segmentation in Python import spacy nlp = spacy.load("en_core_web_sm") doc = nlp(u"I Love Coding. I love NLP.") for sent in doc.sents: print(sent) doc1=list(doc.sents) Other Approaches to Solve Build a binary classifier to decide if it is the end of sentence or not Make rules or use regex Eg. Decision tree classifier Lemmatization Stemming does not resolve the semantic perspective of a word Often returns incorrect word or spellings Find lemma of a word based on the meaning Returns the dictionary form of the word Performs morphological analysis over words More powerful than stemming Lemmatization Works on the basis of morphological parsing Stem Morphemes-> smallest units of words Affix Morphological parser cats cat s Original word is substituted by root word, also known as lemma Meeting-> meet (Core word extraction) Mice-> mouse (plural to singular) Was-> be (tense conversion) WordNet Lemmatizer Publicly available lexical database of over 200 languages Wordnet links words into semantic relations wn = nltk.WordNetLemmatizer() wn.lemmatize(‘geese’) -> gees ps.stem(‘geese’) ->Geese wn.lemmatize(‘cacti’) -> cactus Applications & Drawbacks slower than stemming - Accuracy is more than stemming - Dictionary based approach - preferred to retain meaning in sentence - Depends heavily on POS tag for finding root word or lemma Applications - Making more generalized document matrix instead of sparse one - Widely used in web search results - Information retrieval Regular expressions for Segmentation Sequence of simple characters Can be case sensitive Used to find substrings in a given string Example, extracting # from tweets or removing stopwords Common patterns Pattern Words retrieved [] Bat, bat /[Bb]at/ - Uppercase letters /[A-Z]/ [^A-Z] Not an uppercase letter [e^] Either e or ^ ? Bat or bats /bats?/ (Optionality of previous letter) /beg.n/ Word that fits eg. begin or begun ^The$ Starts and ends with The End$ Ends with End Red|pink|blue Anyone out of the three Common Regex Patterns Character Description \ Escape character. Any character \s whitespace \S Not white space \d Digit \D Not digit \w Word character \W Not word character ^ Beginning of string $ End of string Common Regex Patterns Pattern Description Abc* Ab followed by 0 or more c Abc+ Ab followed by 1 or more c Abc? Ab followed by 0 or 1 c A(bc)* 0 or more copies of bc Abc{2,} Ab followed by 2 or more c Abc{2,5} Ab followed by 2 and upto 5 c Common Regex Functions Pattern Description strsplit(x,” “) Split x around space grep(“\\$”, x) Sentences using $ sign grep(“sh”,c(“ash”,”bash”) Matches both words Questions (|they('| a)re )colou?rs [bcr]at ^(\d*)[.,](\d+)$ ^[\s]*(.*?)[\s]*$ ^Hello me$ Solutions (|they('| a)re )colou?rs -> they are colours, they’re colors etc [bcr]at -> bat, cat, rat ^(\d*)[.,](\d+)$ -> Numbers such as 12.3, 12,3 ^[\s]*(.*?)[\s]*$ -> Text avoiding spaces ^Hello me$ -> Hello me Solve Construct a regex to 1. identify both cat and cats 2. Abc followed by 0-9 digits 3. That matches gray and grey 4. br followed by single character except for a new line and then 3 5. Matches t.forward 6. Uppercase letter followed by lowercase letter 7. Eight word character 8. Lowercase letters followed by space and then two to four digits Solve Construct a regex to 1. identify both cat and cats: cats? 2. Abc followed by 0-9 digits: abc\d* 3. That matches gray and grey: gr[ae]y 4. br followed by single character except for a new line and then 3: br.3 5. Matches t.forward: t\.forward 6. Uppercase letter followed by lowercase letter [A-Z][a-z]+ 7.Eight word character : \w{8} 8.Lowercase letters followed by space and then two to four digits : [a-z]*\s\d{2,4} Bound vs Free Morphemes Bound morphemes cannot appear by themselves eg happily (ily cannot appear on its own), but happy can appear on its own. Happy is a free morpheme House is a free morpheme, can be combined with other morphemes, for instance houses Stems are free morphemes Affixes are bound morphemes – ing, ily, un Relationships between words Walk + ing = Walking Case of inflectional morphology Creates new forms of word: bring, brings, brought Works on number, tense, case, gender Drive + er = Driver Case of Derivational Morphology Create new words and changes part of speech Verb-> Noun (for instance) Morphological Process Concatenation of affixes Hope+less-> hopeless Phonological changes on morpheme boundaries Happy + er->happier Reduplication of a part of word (in some languages such as Sanskrit) Suppletion- irregular relation between words Go-went Good-better Internal changes such as sang-> sung Compounding& Blending Rules for formation of new words Clipping refers to shortening of words, doc, lab etc. Morphological analysis Lemmatization -> lemma Tagging-> Considers context { saw-> {See, verb.past}} Morpheme segmentation {de-nation-al-iz-ation} Applications: Text to Speech synthesis Machine translation Information retrieval Issues? N-gram modelling Important for context sensitive spelling correction N-grams are continuous sequences of words or symbols, or tokens in a document. In technical terms, they can be defined as the neighboring sequences of items in a document. ‘n’ is just a variable that can have positive integer values, including 1,2,3, and so on.’n’ basically refers to multiple. Use Case: Sentiment Analysis sentiment analysis for the news dataset Pre-processing (Removal of punctuation,stop words) Feature extraction Train-test split Create unigrams/bigrams/trigrams for each of the news records belonging to each of the three categories of sentiments. Store the word and its count in the corresponding dictionaries. Convert these dictionaries to corresponding data frames. Fetch the top 10 most frequently used words. Visualize the most frequently used words for all the 3 categories-positive, negative and neutral.

Use Quizgecko on...
Browser
Browser