Chapter 6: Natural Language Processing PDF

Chapter 6. Natural Language Processing Dr. Feras Al-Obeidat Feras Al-Obeidat. INS 646 1 Introduction Natural language processing (NLP) is an area of algorithms that is focused on processing unstructured data. This chapter is focused on unstructured data with a natural language text format. Organizations always have large corpuses of unstructured text data, either in the form of word documents, PDFs, email body, or web documents. Feras Al-Obeidat. INS 646 2 Needs for Text Processing - NLP With advances in technology, organizations have started relying on large volumes of text information. For example, a legal firm has lots of information in the form of bond papers, legal agreements, court orders, law documents, and so on. Utilize these valuable textual assets, and convert the information into knowledge through intelligent machines. NLP for big data uses tons of text data from various sources to determine relationships and patterns across contents received from those sources. Feras Al-Obeidat. INS 646 3 Types of NLP NLP has two types of approaches: Supervised NLP Unsupervised NLP Feras Al-Obeidat. INS 646 4 Types of NLP - Supervised The supervised learning NLP approach involves using supervised learning algorithms such as Naive Bayes and Random Forests. In these algorithms, models are created based on the predicted output given to them for training an input set. That means supervised learning approaches are not self- learning but they train and fine-tune models based on the target output provided to them. Feras Al-Obeidat. INS 646 5 Types of NLP - Unsupervised Unsupervised learning algorithms do not rely on the fact that the target output is provided to them for model training. They draw deductions from input records given to them as a result of multiple iterations over data learning from the output of previous iterations, and tuning weights and parameters to optimize results. Recurrent neural nets (RNN) is one of the common unsupervised learning algorithms used in natural language processing. Feras Al-Obeidat. INS 646 6 Topics Natural language processing basics Text preprocessing Feature extraction Applying NLP techniques Implementing sentiment analysis Feras Al-Obeidat. INS 646 7 Natural language processing basics - Definition NLP is a collection of processes, algorithms, and tools used by intelligent systems to interpret text data written in human language for actionable insights. NLP is all about interpreting unstructured data. Feras Al-Obeidat. INS 646 8 Natural language processing basics NLP organizes unstructured text data and uses sophisticated methods to solve a plethora of problems, such as: Sentiment analysis Document classification Text summarization Feras Al-Obeidat. INS 646 9 The following diagram represents some of the basics hierarchical steps involved in NLP: Feras Al-Obeidat. INS 646 10 Steps involved in NLP- Type of machine learning Type of machine learning: NLP can be performed either using supervised learning algorithms or as unsupervised learning algorithms. Supervised learning algorithms include Naive Bayes, SVM, and Random Forest. Unsupervised learning algorithms include Feed Forward Neural Networks (Multi Layer Perceptron) and Recurrent Neural Network (RNN). One important thing to note here is that the preprocessing and feature-extraction steps are same for both classes of algorithms. What differs is how you train your model. Supervised learning requires labeled output as their input, and unsupervised learning would predict the outcome without any labeled output. Feras Al-Obeidat. INS 646 11 Steps involved in NLP- Text preprocessing Text preprocessing: This step is required because raw natural text cannot be used in NLP systems. This will result in bad or not-very- accurate output. Some of the common text preprocessing steps are removing stop words, replacing capital letter words, and removing special characters. Another common step in text preprocessing is part of-speech tagging, which is also called annotation. Text normalization in the form of stemming and lemmatization is also applied. Feras Al-Obeidat. INS 646 12 Steps involved in NLP - Feature extraction Feature extraction: For any ML algorithm to work on text, these texts have to be converted into some form of numerical input. Text to numeric Feature extraction employs common techniques of converting input text to numerical input in the form of vectors. Feras Al-Obeidat. INS 646 13 Steps involved in NLP- Model training Model training: Model training is process of establishing or finding a mathematical function that can be used to predict the outcome based on the given input. The process of finding a function involves multiple iterations and parameter tuning. Feras Al-Obeidat. INS 646 14 Steps involved in NLP- Model verification Model verification: This step is the process of verifying models resulting from the model training process. Generally, you divide your training dataset into an 80:20 ratio. 80% of data is used for model training and 20% of the data is used for validating the correctness of the model. Feras Al-Obeidat. INS 646 15 Model deployment and APIs After the models have been verified, you deploy your models so that they can be used to predict the outcomes in the context of enterprise applications. You can save these models on a storage location where they can be read in- memory and can be applied to a dataset to predict its outcome. In distributed processing, they are generally saved in a Hadoop-distributed file system so that Hadoop batch processes can read and apply those models. In the case of web applications, they are stored in the form of Python pickle files, and these pickle files are read and processed upon each prediction request. Although, for applications to use this, you would require API layers to be exposed on top of it. These API layers can be restful APIs or come in the form of packaged jars deployed to the location where applications are hosted. Once the APIs are exposed, they can be used by a variety of web applications, mobile applications, or analytics or BI engines. Feras Al-Obeidat. INS 646 16 Text preprocessing- preprocessing Preprocessing the data is the process of cleaning and preparing the text for classification and derivation of meaning. Since our data may have a lot of noise, uninformative parts, such as HTML tags, need to be eliminated or re-aligned. Remove words that do not make much impact on the overall semantic of the textual context. Feras Al-Obeidat. INS 646 17 Text preprocessing steps Text preprocessing involves a few steps, such as: Reading the corpus, Tokenization, Stop-words removal, Stemming and Lemmatization. Converting into Numerical Form Feras Al-Obeidat. INS 646 18 Corpus A corpus is known as the entire collection of text documents. For example, thousands of emails in a collection that we need to process and analyze. This group of emails is known as a corpus as it contains all the text documents. Feras Al-Obeidat. INS 646 19 Tokenize Dividing the given sentence or collection of words of a text document into separate /individual words is known as tokenization. Removes the unnecessary characters such as punctuation. For example, if we have a sentence such as: Input: He really liked the London City. He is there for two more days. Tokens: He, really, liked, the, London, City, He, is, there, for, two, more, days We end up with 13 tokens for the above input sentence. Feras Al-Obeidat. INS 646 20 Text preprocessing In addition to these, some of the basic and generic techniques that improve accuracy involve converting the text to lower case, removing numbers (based on the context), removing punctuation, stripping white spaces Feras Al-Obeidat. INS 646 21 Removing stop words Stop words are words that occur more frequently in the sentence and make the text heavier and less important for the analysis, they should be excluded from the input. Having stop words in your text confuses your algorithm as these stop words do not have contextual meaning and increase dimensional features of your term vectors. Therefore, it is imperative that these stop words be removed for better model accuracy. Examples of stop words are I, am, is, “I’, ‘this’, ‘was’, ‘am’, ‘but’ and the, etc. Increase the computation overhead without adding too much value or insight. drop these stopwords from the tokens. In PySpark, we use StopWordsRemover to remove the stop-words.” Feras Al-Obeidat. INS 646 22 Bag of Words- BOW This is the methodology through which we can represent the text data into numerical form for it to be used by Machine Learning or any other analysis. Text data is generally unstructured and varies in its length. BOW (Bag of Words) allows us to convert the text form into a numerical vector form by considering the occurrence of the words in text documents. For example, Doc 1: The best thing in life is to travel Doc 2: Travel is the best medicine Feras Al-Obeidat. INS 646 23 Bag of Words, example Doc 1: The best thing in life is to travel Doc 2: Travel is the best medicine Doc 3: One should travel more often Vocabulary: The list of unique words appearing in all the documents in known as a vocabulary. In the above example, we have 13 unique words that are part of the vocabulary. Each document can be represented by this vector of fixed size 13. The best thing in life is to travel medicine one should more often Feras Al-Obeidat. INS 646 24 Bag of Words, example Doc 1: The best thing in life is to travel Doc 2: Travel is the best best medicine Doc 3: One should travel more often The best thing in life is to travel medicine one should more often Doc1 1 1 1 1 1 1 1 1 0 0 0 0 0 Doc2 1 2 0 0 0 1 0 1 1 0 0 0 0 Doc3 0 0 0 0 0 0 0 1 0 1 1 1 1 Feras Al-Obeidat. INS 646 25 BOW limitations The BOW does not consider the order of words in the document and the semantic meaning of the word and hence is the most baseline method to represent the text data into numerical form. There are other ways by which we can convert the textual data into numerical form, which are mentioned next. Feras Al-Obeidat. INS 646 26 Count Vectorizer BOW, we saw the representation of occurrence of words by simply 1 or 0 and did not consider the frequency of the words. The count vectorizer instead takes count of the word appearing in the particular document. We will use the same text documents that we created earlier while using tokenization.” The drawback of using the Count Vectorizer method is that it doesn’t consider the co-occurrences of words in other documents. In simple terms, the words appearing more often would have a larger impact on the feature vector. Hence, another approach to convert text data into numerical form is known as Term Frequency – inverse Document Frequency (TF-IDF). Feras Al-Obeidat. INS 646 27 Stemming Porter Snowball Lancaster Feras Al-Obeidat. INS 646 28 Stemming Different forms of a word often communicate essentially the same meaning. Consider an example of a search engine when a user searches shoe or when they search for shoes. But the presence of both words can confuse models. So for better accuracy, we need to convert these different forms of the word in its row format. Stemming is converting a word in a text into its raw format. For example, introduction, introduced, and introducing all turn into introduce after stemming. The purpose is to remove various suffixes, to reduce the number of words. Feras Al-Obeidat. INS 646 29 Stemming- porter Also, this helps the model to avoid confusion while getting trained. Many stemming algorithms are available, such as porter stemming: removes suffixes from base words or terms in the English dictionary. The whole purpose of Porter Stemmer is to improve the performance of the NLP model training exercise. Example: SSESS -> SS - This rule converts SSESS suffix of the word into SS. For example, prepossess - > preposs IES -> I - This rule converts IES suffix of the word into I. For example, ties -> ti SS -> SS - If the word has SS as suffix this won’t change. For example, Success -> Success If the word has S as suffix this would remove the suffix. For example, Pens -> Pen. (s removed) Feras Al-Obeidat. INS 646 30 Stemming- Snowball This is better in accuracy than porter algorithms. The snowball rule example is given as follows: ied or ies -> replace by i if preceded by more than one letter, otherwise by ie. cries -> cri ties -> tie, Feras Al-Obeidat. INS 646 31 Stemming- Lancaster stemming The fastest algorithm here, it will greatly reduce your working set of words ies -> y - This rule converts ies suffix of the word into y. cries -> cry Feras Al-Obeidat. INS 646 32 Lemmatization Bit different from stemming. Stemming generally removes end characters from a word with the expectation that they will get the correct base word. Lemmatization tries to overcome this limitation of stemming. It tries to find out the base form of the word, called the lemma. It uses the WordNet lexical knowledge dictionary to get the correct base form of a word. E.g., Playing à Play Plays à Play Played à Play Feras Al-Obeidat. INS 646 33 N-grams N-gram is a continuous sequence of N-words or tokens in a given sentence or continuous sequence of text. N is defined as an integer value starting from 1. So, N-Gram could be: Uni-Gram(N=1), Bi-Gram(N=2) or Tri-Gram(N=3). N-gram algorithms or programs identify all continuous adjacent sequences of words in a given sentence tokens. example sentence, This is Big Data AI Book. See the following example of Uni-Gram, Bi-Gram, and Tri-Gram examples: Feras Al-Obeidat. INS 646 34 Uni-Gram, Bi-Gram, and Tri-Gram example Feras Al-Obeidat. INS 646 35 Feature extraction NLP system does not understand string values. They need numerical input to build models, sometimes they are also called numerical features. Feature extraction in NLP is converting a set of text information into a set of numerical features. Any machine learning algorithm that you are going to train would need features in numerical vector forms as it does not understand the string. There are many ways text can be represented as numerical vectors. Some such ways are BOW, One hot encoding, TF-IDF, Word2Vec, and CountVectorizer. Feras Al-Obeidat. INS 646 36 TF-IDF The TF-IDF method of feature extraction uses a scalar product of term frequency (TF) and inverse document frequency (IDF) to calculate the numerical vector of a token or term. TF-IDF not only calculates the importance of a word in a specific document but also measures its importance in other documents of a corpus. Moreover, it tries to normalize any word that is overly frequent in the entire corpus. TF, or Term Frequency, is a term’s occurrence in a document. We can use the HashingTF library in Spark to compute the term's frequency. HashingTF creates the sparse vector corresponding to each document representing index and frequency. Feras Al-Obeidat. INS 646 37 TF-IDF TF (term frequency) measures the importance of a word in a particular document only and not with respect to the entire corpus of documents. Moreover, overly frequent words in a large document may not be that important with respect to the entire corpus. This is where IDF comes into the picture; it represents the inverse of the share of the documents in which the regarded term can be found. Feras Al-Obeidat. INS 646 38 TF-IDF: Mathematical Formula The goal of TF-IDF is to find words of higher relevance. The algorithm keeps track of the local relevance of a word in a document using TF calculations and the global relevance of a word in the entire training corpus using IDF calculations. Finally, both the calculations are multiplied to get the final weights of a word. Term Frequency (TF) Formula is the count of all terms in a document n is the count of term t in document d t is the term or word in a document d, Feras Al-Obeidat. INS 646 39 TF-IDF: mathematical formula term frequency in a document total number of documents in a corpus Number of documents containing the word Feras Al-Obeidat. INS 646 40 TF-IDF Term frequency – inverse document frequency (tf-idf or tfidf) of term t in document d: tfidf(t, d) = tf (t, d) * idf(t) query: brick, phone Measure of Relevance with tf-idf Call up all the documents that have any of the terms from the query, and sum up the tf-idf of each term: Relevance(d) = å tfidf (t , d) iÎ[1, n] i Feras Al-Obeidat. 41 INS 646 TF-IDF, example The process to find meaning of documents using TF-IDF is very similar to Bag of words, Clean data / Preprocessing — Clean data (standardize data) , Normalize data( all lower case) , lemmatize data ( all words to root words ). Tokenize words with frequency Find TF for words Find IDF for words Vectorize vocab Feras Al-Obeidat. 42 INS 646 Example- calculate TF Let’s cover an example of 3 documents - TF Document 1 It is going to rain today. 1/6= 0.16 Document 2 Today I am not going outside. TF= 1/6 Document 3 I am going to watch the season premiere. 1/8= 0.12 To find TF-IDF we need to perform the previous steps: Step 1 Clean data and Tokenize Feras Al-Obeidat. 43 INS 646 Step 2 Find TF for all docs TF = (Number of repetitions of word in a document) / (# of words in a document) Feras Al-Obeidat. 44 INS 646 Step 3: Find IDF: IDF =Log[(Number of documents) / (Number of documents containing the word)] In Excel use LN(3/3) Feras Al-Obeidat. INS 646 45 Step 4: Build model: stack all words next to each other IDF TF 0.17 0.17 0.17 0.17 0.47 0.47 0.47 IDF Value and TF value of 3 documents. Feras Al-Obeidat. 46 INS 646 Step 5: Compare results and use table to ask questions e.g., 0.17*0.16 e.g., 0.17*0.12 0.03 0.03 0.03 0.03 0.03 0.02 0.02 0.02 Remember, the final equation = TF-IDF = TF * IDF 47 Feras Al-Obeidat. INS 646 Example, continue- Analysis and outcomes You can easily see using this table that words like ‘it’,’is’,’rain’ are important for document 1 but not for document 2 and document 3 which means Document 1 and 2&3 are different w.r.t talking about rain. Document 1 and 2 talk about something ‘today’, and document 2 and 3 discuss something about the writer because of the word ‘I’. This table helps you find similarities and non similarities between documents, words and more much better than Bag Of Words. Feras Al-Obeidat. INS 646 48 Applying NLP techniques Generally, for any class of NLP problems, you first apply text preprocessing and feature extraction techniques. Once you have reduced the noise in the text and are able to extract features out of text, you perform various machine learning algorithms to solve different NLP classes of NLP problems. we will cover one such problem, called text classification. Feras Al-Obeidat. INS 646 49 Applying NLP techniques: Text classification Text classification is one of the very common use cases of NLP. E.g.,: email SPAM detection, identifying retail product hierarchy, and sentiment analysis. This process is typically a classification problem. Within each of the data groups, we may have multiple topics discussed and hence it is important to classify the article or the textual information into logical groups. Text classification techniques help us to do that. Feras Al-Obeidat. INS 646 50 Applying NLP techniques: Text classification Text classification requires of computing power if the data volume is huge and it is recommended to use a distributed computing framework for text classification As an example, if we want to classify the legal documents that exist in a knowledge repository on the internet, we can use text classification techniques for the logical separation of various types of documents. Feras Al-Obeidat. INS 646 51 Feras Al-Obeidat. INS 646 52 Thank you Dr.Feras Al-Obeidat Feras Al-Obeidat. INS 646 53 References Artificial Intelligence for Big Data - Anand Deshpande and Manish Kumar Machine Learning with PySpark- Pramod Singh Feras Al-Obeidat. INS 646 54

Chapter 6: Natural Language Processing PDF

Document Details

Tags

Related

Summary

Full Transcript