Unit-1 Part-1.pdf

Natural Language Processing By Dr.Shruti AIML Department ,SIT Syllabus Evaluation Scheme Sr.No Theory 1 Unit Test Unit -1 & 2 2 Research Seminar Unit 3 3 Quiz Unit 4 & 5 Mode of Communication Moodle Reference Books Online NLP Tutorials Best NLP Certificate Courses 1. Natural Language Processing Specialization- Coursera 2. Become a Natural Language Processing Expert- Udacity 3. Natural Language Processing- Coursera 4. Natural Language Processing in TensorFlow- Coursera 5. Introduction to Natural Language Processing in Python– DataCamp 6. Natural Language Processing with Deep Learning in Python -Udemy 7. Learn Natural Language Processing- Codecademy 8. Data Science: Natural Language Processing (NLP) in Python -Udemy 9. NLP -Natural Language Processing with Python- Udemy 10. Natural Language Processing with Python Certification Course- Edureka Top NLP Research Labs Apple NLP Engineer Resume NLP Tasks Machine Learning, Deep Learning, and NLP Applications of NLP Applications of NLP Types of NLP What is a language? Language is a structured system of communication that involves complex combinations of its constituent components, such as characters, words, sentences, etc. Linguistics is the systematic study of language. In order to study NLP, it is important to understand some concepts from linguistics about how language is structured Phases in NLP Phases in NLP:Pipeline Phases in NLP:Example Challenges in NLP Contextual words and phrases Ambiguity Ambiguity in NLP refers to sentences and phrases that potentially have two or more possible interpretations. Lexical ambiguity: a word that could be used as a verb, noun, or adjective. Semantic ambiguity: the interpretation of a sentence in context. For example: I saw the boy on the beach with my binoculars. This could mean that I saw a boy through my binoculars or the boy had my binoculars with him Syntactic ambiguity: In the sentence above, this is what creates the confusion of meaning. The phrase with my binoculars could modify the verb, “saw,” or the noun, “boy.” Even for humans this sentence alone is difficult to interpret without the context of surrounding text. POS (part of speech) tagging is one NLP solution Non Standardization in Languages NLP implementation Pipeline Text Preprocessing Text Cleaning –HTML tag removing, emoji handling, Spelling checker, etc. Basic Preprocessing —tokenization, stop word removal, removing digit, lowercasing, removing punctuations, removing hashtags and special characters Digression: POS, NER Normalization: Stemming and lemmatization Vectorization— One hot encoding, BoW,GLOVE, Tf-idf Feature Engineering One Hot Encoder Bag Of Word(BOW) n-grams Tf-Idf Word2vec Glove Model Building Heuristics Based ML Based DL Based Lexicon based Naïve Bayes CNN Regex based SVM RNN Decision Transformers Trees Autoencoders Punctuation Removal In this step, all the punctuations from the text are removed. string library of Python contains some pre-defined list of punctuations such as ‘!”#$%&'()*+,-./:;?@[\]^_`{|}~’ Lowering the Text It is one of the most common text preprocessing Python steps where the text is converted into the same case preferably lower case. But it is not necessary to do this step every time you are working on an NLP problem as for some problems lower casing can lead to loss of information. For example, if in any project we are dealing with the emotions of a person, then the words written in upper cases can be a sign of frustration or excitement. Tokenization Tokenization is essentially splitting a phrase, sentence, paragraph, or an entire text document into smaller units, such as individual words or terms. Each of these smaller units are called tokens. Stemming It is also known as the text standardization step where the words are stemmed or diminished to their root/base form. For example, words like ‘programmer’, ‘programming, ‘program’ will be stemmed to ‘program’. But the disadvantage of stemming is that it stems the words such that its root form loses the meaning or it is not diminished to a proper English word. Lemmatization It stems the word but makes sure that it does not lose its meaning. Lemmatization has a pre-defined dictionary that stores the context of words and checks the word in the dictionary while diminishing. Methods to Perform Tokenization in Python Tokenization using Python’s split() function Tokenization using Regular Expressions (RegEx) Tokenization using NLTK, spaCy library, Keras, Gensim Tokenization using Split() split() method as it is the most basic one. It returns a list of strings after breaking the given string by the specified separator. By default, split() breaks a string at each space. We can change the separator to anything. Tokenization using Regex Regular expression is a sequence of character(s) mainly used to find and replace patterns in a string or file The most common uses of Commonly used methods regex regular expressions are: methods are under “re” library: Search a string (search 1.re.match() and match) 2.re.search() Finding a string (findall) 3.re.findall() Break string into a sub 4.re.split() strings (split) 5.re.sub() Replace part of a string 6.re.compile() (sub) RegEx Functions Metacharacters Special Sequences Sets Stop Word Removal Stopwords are the commonly used words and are removed from the text as they do not add any value to the analysis. These words carry less or no meaning. NLTK library consists of a list of words that are considered stopwords for the English language. Some of them are : [i, me, my, myself, we, our, ours, ourselves, you, you’re, you’ve, you’ll, you’d, your, yours, yourself, yourselves, he, most, other, some, such, no, nor, not, only, own, same, so, then, too, very, s, t, can, will, just, don, don’t, should, should’ve, now, d, ll, m, o, re, ve, y, ain, aren’t, could, couldn’t, didn’t, didn’t] Decontracting the words In English contractions, we often drop the vowels from a word to form the contractions. Removing contractions contributes to text standardization and is useful when we are working on Twitter data, on reviews of a product as the words play an important role in sentiment analysis. POS Tagging Part-of-speech-tagging is a labeling/tagging mechanism where the words in the given corpus is tagged grammatically, it means the words are labeled as a noun, verb, adjective, adverb, etc. It even does fine-grained tagging like ‘noun-plural’. It even considers tense while tagging. POS Tagging Application: Text 2 Speech recognition Word sense disambiguation Syntactics Parsing Syntactic analysis or parsing The purpose of this phase is to draw exact meaning, or dictionary meaning from the text. Syntax analysis checks the text for meaningfulness comparing to the rules of formal grammar. For example, the sentence like “hot ice-cream” would be rejected by semantic analyzer. In this sense, syntactic analysis or parsing may be defined as the process of analyzing the strings of symbols in natural language conforming to the rules of formal grammar. Concept of Parser It is defined as the software component designed for taking input data (text) and giving structural representation of the input after checking for correct syntax as per formal grammar. The main roles of the parse include − To report any syntax error. To recover from commonly occurring error so that the processing of the remainder of program can be continued. To create parse tree. To create symbol table. To produce intermediate representations (IR). Types of Parsing Derivation divides parsing into the followings two types − Top-down Parsing Bottom-up Parsing In this kind of parsing, the parser starts In this kind of parsing, the parser starts constructing the parse tree from the start with the input symbol and tries to symbol and then tries to transform the construct the parser tree up to the start start symbol to the input. symbol. The most common form of top down parsing uses recursive procedure to process the input. The main disadvantage of recursive descent parsing is backtracking. Types of Derivation To decide which non-terminal to be replaced with production rule Left-most Derivation In the left-most derivation, the sentential form of an input is scanned and replaced from the left to the right. The sentential form in this case is called the left-sentential form. Right-most Derivation In the right-most derivation, the sentential form of an input is scanned and replaced from right to left. The sentential form in this case is called the right-sentential form. Concept of Parse Tree It is the graphical depiction of a derivation. The start symbol of derivation serves as the root of the parse tree. In every parse tree, the leaf nodes are terminals and interior nodes are non-terminals. A property of parse tree is that in-order traversal will produce the original input string. Concept of Grammar Grammar is very essential and important to describe the syntactic structure of well-formed programs. A mathematical model of grammar was given by Noam Chomsky in 1956, which is effective for writing computer languages. Mathematically, a grammar G can be formally written as a 4-tuple (N, T, S, P) where N or VN = set of non-terminal symbols, i.e., variables. T or ∑ = set of terminal symbols. S = Start symbol where S ∈ N P denotes the Production rules for Terminals as well as Non-terminals. It has the form α → β, where α and β are strings on VN ∪ ∑ and least one symbol of α belongs to VN Phrase Structure or Constituency Grammar Example : “This tree is illustrating the constituency relation” Constituency Grammar Or Phrase Structure Graphical Representation Of The Above Output Of Code To get the graphical representation we have to run one more line of code i.e. output.draw() Semantic Parsing Stemming It is also known as the text standardization step where the words are stemmed or diminished to their root/base form. For example, words like ‘programmer’, ‘programming, ‘program’ will be stemmed to ‘program’. But the disadvantage of stemming is that it stems the words such that its root form loses the meaning or it is not diminished to a proper English word. Lemmatization It stems the word but makes sure that it does not lose its meaning. Lemmatization has a pre- defined dictionary that stores the context of words and checks the word in the dictionary while diminishing.

Document Details

Tags

Related

Full Transcript

Upgrade to continue