NLP unit 1.pdf
Document Details
Uploaded by TemptingWendigo
Tags
Full Transcript
3 5 COURSE CODE 20AIPC503 Please use this space to type the name of the course here UNIT No. 1 INTRODUCTION Overview and advantages of NLP - NLP Libraries - Language Modelling: Unigram Language Model - Bigram – Trigram - N-gram - Advanced smoothing for language modelling -...
3 5 COURSE CODE 20AIPC503 Please use this space to type the name of the course here UNIT No. 1 INTRODUCTION Overview and advantages of NLP - NLP Libraries - Language Modelling: Unigram Language Model - Bigram – Trigram - N-gram - Advanced smoothing for language modelling - Empirical Comparison of Smoothing Techniques - Applications of Language Modelling Version: 1.XX Natural Language Processing OVERVIEW OF NLP Overview and advantages of NLP - NLP Libraries - Language Modelling: Unigram Language Model - Bigram – Trigram - N-gram - Advanced smoothing for language modelling - Empirical Comparison of Smoothing Techniques - Applications of Language Modelling. NLP (Natural Language Processing) is a subfield of Artificial Intelligence (AI). This is a widely used technology for personal assistants that are used in various business fields/areas. This technology works on the speech provided by the user, breaks it down for proper understanding and processes accordingly. This is a very recent and effective approach due to which it has a really high demand in today’s market. Natural Language Processing is an upcoming field where already many transitions such as compatibility with smart devices, interactive talks with a human have been made possible. Knowledge representation, logical reasoning, and constraint satisfaction were the emphasis of AI applications in NLP. Here first it was applied to semantics and later to the grammar. In the last decade, a significant change in NLP research has resulted in the widespread use of statistical approaches such as machine learning and data mining on a massive scale. The need for automation is never ending courtesy of the amount of work required to be done these days. NLP is a very favourable, but aspect when it comes to automated applications. The applications of NLP have led it to be one of the most sought-after methods of implementing machine learning. Natural Language Processing (NLP) is a field that combines computer science, linguistics, and machine learning to study how computers and humans communicate in natural language. The goal of NLP is for computers to be able to interpret and generate human language. This not only improves the efficiency of work done by humans but also helps in interacting with the machine. NLP bridges the gap of interaction between humans and electronic devices. Working The field is divided into three different parts: 1. Speech Recognition — the translation of spoken language into text. 2. Natural Language Understanding (NLU) — The computer’s ability to understand what we say. 3. Natural Language Generation (NLG) — The generation of natural language by a computer. NLU and NLG are the key aspects depicting the working of NLP devices. These 2 aspects are very different from each other and are achieved using different methods. Speech Recognition: First, the computer must take natural language and convert it into artificial language. This is what speech recognition, or speech-to-text, does. This is the first step of NLU. Hidden Markov Models (HMMs) are used in the majority of voice recognition systems nowadays. These are statistical models that use mathematical calculations to determine what you said in order to convert your speech to text. Natural Language Processing HMMs do this by listening to you talk, breaking it down into small units (typically 10- 20 milliseconds), and comparing it to pre-recorded speech to figure out which phoneme you uttered in each unit (a phoneme is the smallest unit of speech).The program then examines the sequence of phonemes and uses statistical analysis to determine the most likely words and sentences you were speaking. Natural Language Understanding (NLU): The next and hardest step of NLP, is the understanding part. First, the computer must comprehend the meaning of each word. It tries to figure out whether the word is a noun or a verb, whether it’s in the past or present tense, and so on. This is called Part-of-Speech tagging (POS). A lexicon (a vocabulary) and a set of grammatical rules are also built into NLP systems. The most difficult part of NLP is understanding. The machine should be able to grasp what you said by the conclusion of the process. There are several challenges in accomplishing this when considering problems such as words having several meanings (polysemy) or different words having similar meanings (synonymy), but developers encode rules into their NLU systems and train them to learn to apply the rules correctly. Natural Language Generation (NLG): NLG is much simpler to accomplish. NLG converts a computer’s artificial language into text and can also convert that text into audible speech using text-to-speech technology. First, the NLP system identifies what data should be converted to text. If you asked the computer a question about the weather, it most likely did an online search to find your answer, and from there it decides that the temperature, wind, and humidity are the factors that should be read aloud to you. Then, it organizes the structure of how it’s going to say it. This is similar to NLU except backwards. NLG system can construct full sentences using a lexicon and a set of grammar rules. Finally, text-to-speech takes over. The text-to-speech engine uses a prosody model to evaluate the text and identify breaks, duration, and pitch. The engine then combines all the recorded phonemes into one cohesive string of speech using a speech database. Technologies related to Natural Language Processing Machine Translation: NLP is used for language translation from one language to another through a computer. Chatterbots: NLP is used for chatter bots that communicate with other chat bots or humans through auditory or textual methods. AI Software: NLP is used in question-answering software for knowledge representation, analytical reasoning as well as information retrieval. Applications of Natural Language Processing (NLP): Spam Filters: One of the most irritating things about email is spam. Gmail uses natural language processing (NLP) to discern which emails are legitimate and which are spam. Natural Language Processing These spam filters look at the text in all the emails you receive and try to figure out what it means to see if it’s spam or not. Algorithmic Trading: Algorithmic trading is used for predicting stock market conditions. Using NLP, this technology examines news headlines about companies and stocks and attempts to comprehend their meaning in order to determine if you should buy, sell, or hold certain stocks. Answering Questions: NLP can be seen in action by using Google Search or Siri Services. A major use of NLP is to make search engines understand the meaning of what we are asking and generating natural language in return to give us the answers. Summarizing Information: On the internet, there is a lot of information, and a lot of it comes in the form of long documents or articles. NLP is used to decipher the meaning of the data and then provides shorter summaries of the data so that humans can comprehend it more quickly. Future Scope: Bots: Chatbots assist clients get to the point quickly by answering inquiries and referring them to relevant resources and products at any time of day or night. To be effective, chatbots must be fast, smart and easy to use, To accomplish this, chatbots employ NLP to understand language, usually over text or voice-recognition interactions Supporting Invisible UI: Almost every connection we have with machines involves human communication, both spoken and written. Amazon’s Echo is only one illustration of the trend toward putting humans in closer contact with technology in the future. The concept of an invisible or zero user interface will rely on direct communication between the user and the machine, whether by voice, text, or a combination of the two. NLP helps to make this concept a real-world thing. Smarter Search: NLP’s future also includes improved search, something we’ve been discussing at Expert System for a long time. Smarter search allows a chatbot to understand a customer’s request can enable “search like you talk” functionality (much like you could query Siri) rather than focusing on keywords or topics. Google recently announced that NLP capabilities have been added to Google Drive, allowing users to search for documents and content using natural language. Future Enhancements: Companies like Google are experimenting with Deep Neural Networks (DNNs) to push the limits of NLP and make it possible for human-to-machine interactions to feel just like human-to-human interactions. Basic words can be further subdivided into proper semantics and used in NLP algorithms. The NLP algorithms can be used in various languages that are currently unavailable such as regional languages or languages is spoken in rural areas etc. Translation of a sentence in one language to the same sentence in another Language at a broader scope. Natural Language Processing Natural language processing (NLP) uses artificial intelligence and machine learning to extract meaning from human language while it is spoken. Depending on the natural language programming, the presentation of that meaning could be through pure text, a text-to-speech reading, or within a graphical representation or chart. Understanding human language, including its intricacies, can be difficult even for people, let alone algorithms. The challenge is getting the algorithms to understand the words and their underlying meaning. Machine learning is beneficial when you consider the sheer number of variables that need to be accounted for in a natural learning process application to be effective. Advantage of Natural Language Processing Natural language processing (NLP) is a cutting-edge development for several reasons. Before NLP, businesses were using AI and machine learning for essential insights, but NLP provides the tools to enhance data and analyze both linguistic and statistical data. NLP offers several benefits for companies across different industries. Enable non-subject matter experts to find answers to their questions Analyze data from both structured and unstructured sources Identify the root causes of your business problems Discover your most profitable customers and understand the reasons behind it Identify and address fraudulent claims and behavior Understand several languages, dialects, slang, and jargon Identifying patterns in customer communication and reducing customer complaints Analyze and evaluate your competitors’ product offerings Use Cases of Natural Language Processing Natural language processing is just starting to impact business operations across different industries. Here are some of the top use cases of NLP in various sectors. Banking and Finance Banking and financial institutions can use sentiment analysis to analyze market data and use that insight to reduce risks and make better decisions. NLP can help these institutions identify illegal activities like money laundering and other fraudulent behavior. Insurance Insurance companies can use NLP to identify and reject fraudulent claims. Insurers can use machine learning and artificial intelligence to analyze customer communication to identify indicators of fraud and flag these claims for deeper analysis. Insurance companies can also use these features for competitor research. Manufacturing Manufacturers can use NLP to analyze shipment-related information to streamline processes and increase automation. They can quickly identify the areas that need improvement and make changes to drive efficiencies. NLP can scrape the web for pricing information of different raw materials and labor to optimize costs. Natural Language Processing Retail Retailers can use NLP to analyze customer sentiment about their products and make more informed decisions across their processes, from product design and inventory management to sales and marketing initiatives. NLP analyzes all available customer data and transforms it into actionable insights that can improve the customer experience. Healthcare NLP can analyze patient communication from emails, chat applications, and patient helplines and help medical professionals prioritize patients based on their needs, improving patient diagnosis and treatment, and driving better outcomes. NLP Libraries The fundamental aim of NLP libraries is to simplify text pre-processing. A good NLP library should be able to correctly convert free text sentences into structured features (for example, cost per hour) that can easily be fed into ML or DL pipelines. Also, an NLP library should have a simple-to-learn API, and it must be able to implement the latest and greatest algorithms and models efficiently. 1. Natural Language Toolkit (NLTK) NLTK is one of the leading platforms for building Python programs that can work with human language data. It presents a practical introduction to programming for language processing. NLTK comes with a host of text processing libraries for sentence detection, tokenization, lemmatization, stemming, parsing, chunking, and POS tagging. NLTK provides easy-to-use interfaces to over 50 corpora and lexical resources. The tool has the essential functionalities required for almost all kinds of natural language processing tasks with Python. 2. Gensim Gensim is a Python library designed specifically for “topic modeling, document indexing, and similarity retrieval with large corpora.” All algorithms in Gensim are memory-independent, w.r.t., the corpus size, and hence, it can process input larger than RAM. With intuitive interfaces, Gensim allows for efficient multicore implementations of popular algorithms, Natural Language Processing including online Latent Semantic Analysis (LSA/LSI/SVD), Latent Dirichlet Allocation (LDA), Random Projections (RP), Hierarchical Dirichlet Process (HDP) or word2vec deep learning. Gensim features extensive documentation and Jupyter Notebook tutorials. It largely depends on NumPy and SciPy for scientific computing. Thus, you must install these two Python packages before installing Gensim. 3. CoreNLP Stanford CoreNLP comprises of an assortment of human language technology tools. It aims to make the application of linguistic analysis tools to a piece of text easy and efficient. With CoreNLP, you can extract all kinds of text properties (like named-entity recognition, part-of- speech tagging, etc.) in only a few lines of code. Since CoreNLP is written in Java, it demands that Java be installed on your device. However, it does offer programming interfaces for many popular programming languages, including Python. The tool incorporates numerous Stanford’s NLP tools like the parser, sentiment analysis, bootstrapped pattern learning, part- of-speech (POS) tagger, named entity recognizer (NER), and coreference resolution system, to name a few. Furthermore, CoreNLP supports four languages apart from English – Arabic, Chinese, German, French, and Spanish. 4. spaCy spaCy is an open-source NLP library in Python. It is designed explicitly for production usage – it lets you develop applications that process and understand huge volumes of text. Natural Language Processing spaCy can preprocess text for Deep Learning. It can be be used to build natural language understanding systems or information extraction systems. spaCy is equipped with pre-trained statistical models and word vectors. It can support tokenization for over 49 languages. spaCy boasts of state-of-the-art speed, parsing, named entity recognition, convolutional neural network models for tagging, and deep learning integration. 5. TextBlob TextBlob is a Python (2 & 3) library designed for processing textual data. It focuses on providing access to common text-processing operations through familiar interfaces. TextBlob objects can be treated as Python strings that are trained in Natural Language Processing. TextBlob offers a neat API for performing common NLP tasks like part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, language translation, word inflection, parsing, n-grams, and WordNet integration. 6. Pattern Pattern is a text processing, web mining, natural language processing, machine learning, and network analysis tool for Python. It comes with a host of tools for data mining (Google, Twitter, Wikipedia API, a web crawler, and an HTML DOM parser), NLP (part-of-speech taggers, n-gram search, sentiment analysis, WordNet), ML (vector space model, clustering, SVM), and network analysis by graph centrality and visualization. Pattern can be a powerful tool both for a scientific and a non-scientific audience. It has a simple and straightforward syntax – the function names and parameters are chosen in a way so that the commands are self-explanatory. While Pattern is a highly valuable learning environment for students, it serves as a rapid development framework for web developers. 7. PyNLPl Natural Language Processing Pronounced as ‘pineapple,’ PyNLPl is a Python library for Natural Language Processing. It contains a collection of custom-made Python modules for Natural Language Processing tasks. One of the most notable features of PyNLPl is that it features an extensive library for working with FoLiA XML (Format for Linguistic Annotation). PyNLPl is segregated into different modules and packages, each useful for both standard and advanced NLP tasks. While you can use PyNLPl for basic NLP tasks like extraction of n-grams and frequency lists, and to build a simple language model, it also has more complex data types and algorithms for advanced NLP tasks. Language Modelling Language models form the backbone of Natural Language Processing. They are a way of transforming qualitative information about text into quantitative information that machines can understand. They have applications in a wide range of industries like tech, finance, healthcare, military etc. All of us encounter language models daily, be it the predictive text input on our mobile phones or a simple Google search. Hence language models form an integral part of any natural language processing application. N-gram. N-grams are a relatively simple approach to language models. They create a probability distribution for a sequence of n The n can be any number, and defines the size of the "gram", or sequence of words being assigned a probability. For example, if n = 5, a gram might look like this: "can you please call me." The model then assigns probabilities using sequences of n size. Basically, n can be thought of as the amount of context the model is told to consider. Some types of n-grams are unigrams, bigrams, trigrams and so on. Unigram. The unigram is the simplest type of language model. It doesn't look at any conditioning context in its calculations. It evaluates each word or term independently. Unigram models commonly handle language processing tasks such as information retrieval. The unigram is the foundation of a more specific model variant called the query likelihood model, which uses information retrieval to examine a pool of documents and match the most relevant one to a specific query. Unigram language model Natural Language Processing In natural language processing, an n-gram is a sequence of n words. For example, “statistics” is a unigram (n = 1), “machine learning” is a bigram (n = 2), “natural language processing” is a trigram (n = 3). For longer n-grams, people just use their lengths to identify them, such as 4-gram, 5-gram, and so on. In this part of the project, we will focus only on language models based on unigrams i.e. single words. Training the model A language model estimates the probability of a word in a sentence, typically based on the the words that have come before it. For example, for the sentence “I have a dream”, our goal is to estimate the probability of each word in the sentence based on the previous words in the same sentence: For simplicity, all words are lower-cased in the language model, and punctuations are ignored. The [END] token marks the end of the sentence, and will be explained shortly. The unigram language model makes the following assumptions: 1. The probability of each word is independent of any words before it. 2. Instead, it only depends on the fraction of time this word appears among all the words in the training text. In other words, training the model is nothing but calculating these fractions for all unigrams in the training text. Estimated probability of the unigram ‘dream’ from training text Evaluating the model After estimating all unigram probabilities, we can apply these estimates to calculate the probability of each sentence in the evaluation text: each sentence probability is the product of word probabilities. We can go further than this and estimate the probability of the entire evaluation text, such as dev1 or dev2. Under the naive assumption that each sentence in the text is independent from other sentences, we can decompose this probability as the product of the sentence probabilities, which in turn are nothing but products of word probabilities. Natural Language Processing The role of ending symbols As outlined above, our language model not only assigns probabilities to words, but also probabilities to all sentences in a text. As a result, to ensure that the probabilities of all possible sentences sum to 1, we need to add the symbol [END] to the end of each sentence and estimate its probability as if it is a real word. This is a rather esoteric detail, and you can read more about its rationale here (page 4). Evaluation metric: average log likelihood When we take the log on both sides of the above equation for probability of the evaluation text, the log probability of the text (also called log likelihood), becomes the sum of the log probabilities for each word. Lastly, we divide this log likelihood by the number of words in the evaluation text to ensure that our metric does not depend on the number of words in the text. For n-gram models, log of base 2 is often used due to its link to information theory (see here, page 21) As a result, we end up with the metric of average log likelihood, which is simply the average of the trained log probabilities of each word in our evaluation text. In other words, the better our language model is, the probability that it assigns to each word in the evaluation text will be higher on average. Other common evaluation metrics for language models include cross-entropy and perplexity. However, they still refer to basically the same thing: cross-entropy is the negative of average log likelihood, while perplexity is the exponential of cross-entropy. Dealing with unknown unigrams There is a big problem with the above unigram model: for a unigram that appears in the evaluation text but not in the training text, its count in the training text — hence its probability — will be zero. This will completely implode our unigram model: the log of this zero probability is negative infinity, leading to a negative infinity average log likelihood for the entire model! Natural Language Processing Laplace smoothing To combat this problem, we will use a simple technique called Laplace smoothing: 1. We add an artificial unigram called [UNK] to the list of unique unigrams in the training text. This represents all the unknown tokens that the model might encounter during evaluation. Of course, the count for this unigram will be zero in the training set, and the unigram vocabulary size — number of unique unigrams — will increase by 1 after this new unigram is added. 2. Next, we add a pseudo-count of k to all the unigrams in our vocabulary. The most common value of k is 1, and this goes by the intuitive name of “add-one smoothing”. As a result, for each unigram, the numerator of the probability formula will be the raw count of the unigram plus k, the pseudo-count from Laplace smoothing. Furthermore, the denominator will be the total number of words in the training text plus the unigram vocabulary size times k. This is because each unigram in our vocabulary has k added to their counts, which will add a total of (k × vocabulary size) to the total number of unigrams in the training text. Effect of Laplace smoothing Because of the additional pseudo-count k to each unigram, each time the unigram model encounters an unknown word in the evaluation text, it will convert said unigram to the unigram [UNK]. The latter unigram has a count of zero in the training text, but thanks to the pseudo-count k, now has a non-negative probability: Furthermore, Laplace smoothing also shifts some probabilities from the common tokens to the rare tokens. Imagine two unigrams having counts of 2 and 1, which becomes 3 and 2 respectively after add-one smoothing. The more common unigram previously had double the probability of the less common unigram, but now only has 1.5 times the probability of the other one. Natural Language Processing Percentages of top 10 most and least common words (before and after add-one smoothing). Left and right bar charts are not plotted at the same scale. This can be seen from the estimated probabilities of the 10 most common unigrams and the 10 least common unigrams in the training text: after add-one smoothing, the former lose some of their probabilities, while the probabilities of the latter increase significantly relative to their original values. In short, this evens out the probability distribution of unigrams, hence the term “smoothing” in the method’s name. Bidirectional. Unlike n-gram models, which analyze text in one direction (backwards), bidirectional models analyze text in both directions, backwards and forwards. These models can predict any word in a sentence or body of text by using every other word in the text. Examining text bidirectionally increases result accuracy. This type is often utilized in machine learning and speech generation applications. For example, Google uses a bidirectional model to process search queries. Exponential. Also known as maximum entropy models, this type is more complex than n- grams. Simply put, the model evaluates text using an equation that combines feature functions and n-grams. Basically, this type specifies features and parameters of the desired results, and unlike n-grams, leaves analysis parameters more ambiguous -- it doesn't specify individual gram sizes, for example. The model is based on the principle of entropy, which Natural Language Processing states that the probability distribution with the most entropy is the best choice. In other words, the model with the most chaos, and least room for assumptions, is the most accurate. Exponential models are designed maximize cross entropy, which minimizes the amount statistical assumptions that can be made. This enables users to better trust the results they get from these models. Continuous space. This type of model represents words as a non-linear combination of weights in a neural network. The process of assigning a weight to a word is also known as word embedding. This type becomes especially useful as data sets get increasingly large, because larger datasets often include more unique words. The presence of a lot of unique or rarely used words can cause problems for linear model like an n-gram. This is because the amount of possible word sequences increases, and the patterns that inform results become weaker. By weighting words in a non-linear, distributed way, this model can "learn" to approximate words and therefore not be misled by any unknown values. Its "understanding" of a given word is not as tightly tethered to the immediate surrounding words as it is in n- gram models. In this article, we will be learning how to build unigram, bigram and trigram language models on a raw text corpus and perform next word prediction using them. Reading the Raw Text Corpus We will begin by reading the text corpus which is an excerpt from Oliver Twist. You can download the text file from here. Once it is downloaded, read the text file and find the total number of characters in it. file = open("rawCorpus.txt", "r") rawReadCorpus = file.read() print ("Total no. of characters in read dataset: {}".format(len(rawReadCorpus))) We need to import the nltk library to perform some basic text processing tasks which we will do with the help of the following code : import nltk nltk.download() from nltk.tokenize import word_tokenize,sent_tokenize Preprocessing the Raw Text Firstly, we need to remove all new lines and special characters from the text corpus. We do that by the following code : import string string.punctuation = string.punctuation +'“'+'”'+'-'+'’'+'‘'+'—' string.punctuation = string.punctuation.replace('.', '') Natural Language Processing file = open('rawCorpus.txt').read() #preprocess data to remove newlines and special characters file_new = "" for line in file: line_new = line.replace("n", " ") file_new += line_new preprocessedCorpus = "".join([char for char in file_new if char not in string.punctuation]) After removing newlines and special characters, we can break up the corpus to obtain the words and the sentences using sent_tokenize and word_tokenize from nltk.tokenize. Let us print the first 5 sentences and the first 5 words obtained from the corpus : sentences = sent_tokenize(preprocessedCorpus) print("1st 5 sentences of preprocessed corpus are : ") print(sentences[0:5]) words = word_tokenize(preprocessedCorpus) print("1st 5 words/tokens of preprocessed corpus are : ") print(words[0:5]) The output looks something like this : We also need to remove stopwords from the corpus. Stopwords are some commonly used words like ‘and’, ‘the’, ‘at’ which do not add any special meaning or significance to a sentence. A list of stopwords are available with nltk, and they can be removed from the corpus using the following code : nltk.download('stopwords') from nltk.corpus import stopwords stop_words = set(stopwords.words('english')) filtered_tokens = [w for w in words if not w.lower() in stop_words] Creating Unigram, Bigram and Trigram Language Models We can create n-grams using the ngrams module from nltk.util. N-grams are a sequence of n consecutive words occurring in the corpus. For example, the sentence “I love dogs” – ‘I’, ‘love’ and ‘dogs’ are unigrams while ‘I love’ and ‘love dogs’ are bigrams. ‘I love dogs’ is Natural Language Processing itself a trigram i.e. a contiguous sequence of three words. We obtain unigrams, bigrams and trigrams from the corpus using the following code : from collections import Counter from nltk.util import ngrams unigrams=[] bigrams=[] trigrams=[] for content in (sentences): # *** Write code *** content = content.lower() content = word_tokenize(content) for word in content: if (word =='.'): content.remove(word) else: unigrams.append(word) bigrams.extend(ngrams(content,2)) ##similar for trigrams # *** Write code *** trigrams.extend(ngrams(content,3)) print ("Sample of n-grams:n" + "-------------------------") print ("--> UNIGRAMS: n" + str(unigrams[:5]) + "...n") print ("--> BIGRAMS: n" + str(bigrams[:5]) + "...n") print ("--> TRIGRAMS: n" + str(trigrams[:5]) + "...n") The output looks like this : Source: Screenshot from my Jupyter notebook Next, we obtain those unigrams, bigrams and trigrams from the corpus which do not have stopwords like articles, prepositions or determiners in them. For example, we remove bigrams like ‘in the’ and we remove unigrams like ‘the’, ‘a’ etc. We use the following code for the removal of stopwords from n-grams. def stopwords_removal(n, a): b = [] if n == 1: for word in a: count = 0 if word in stop_words: Natural Language Processing count = 0 else: count = 1 if (count==1): b.append(word) return(b) else: for pair in a: count = 0 for word in pair: if word in stop_words: count = count or 0 else: count = count or 1 if (count==1): b.append(pair) return(b) unigrams_Processed = stopwords_removal(1,unigrams) bigrams_Processed = stopwords_removal(2,bigrams) trigrams_Processed = stopwords_removal(3,trigrams) print ("Sample of n-grams after processing:n" + "-------------------------") print ("--> UNIGRAMS: n" + str(unigrams_Processed[:5]) + "...n") print ("--> BIGRAMS: n" + str(bigrams_Processed[:5]) + "...n") print ("--> TRIGRAMS: n" + str(trigrams_Processed[:5]) + "...n") The unigrams, bigrams and trigrams obtained in this way look like : We can obtain the count or frequency of each n-gram appearing in the corpus. This will be useful later when we need to calculate the probabilities of the next possible word based on previous n-grams. We write a function get_ngrams_freqDist which returns the frequency corresponding to each n-gram sent to it. We obtain the frequencies of all unigrams, bigrams and trigrams in this way. def get_ngrams_freqDist(n, ngramList): ngram_freq_dict = {} for ngram in ngramList: if ngram in ngram_freq_dict: ngram_freq_dict[ngram] += 1 else: Natural Language Processing ngram_freq_dict[ngram] = 1 return ngram_freq_dict unigrams_freqDist = get_ngrams_freqDist(1, unigrams) unigrams_Processed_freqDist = get_ngrams_freqDist(1, unigrams_Processed) bigrams_freqDist = get_ngrams_freqDist(2, bigrams) bigrams_Processed_freqDist = get_ngrams_freqDist(2, bigrams_Processed) trigrams_freqDist = get_ngrams_freqDist(3, trigrams) trigrams_Processed_freqDist = get_ngrams_freqDist(3, trigrams_Processed) Predicting Next Three words using Bigram and Trigram Models The chain rule is used to compute the probability of a sentence in a language model. Let w1w2…wn be a sentence where w1, w2, wn are the individual words. Then the probability of the sentence occurring is given by the following formula : For example, the probability of the sentence “I love dogs” is given by : P(I love dogs) = P(I)P(love | I)P(dogs | I love) Now the individual probabilities can be obtained in the following way : P(I) = Count(‘I’) / Total no. of words P(love | I) = Count(‘I love’) / Count(‘I’) P(dogs | I love) = Count(‘I love dogs’) / Count(‘I love’) Note that Count(‘I’), Count(‘I love’) and Count(‘I love dogs’) are the frequencies of the respective unigram, bigram and trigram which we computed earlier using the get_ngrams_freqDist function. Now, when we use a bigram model to compute the probabilities, the probability of each new word depends only on its previous word. That is, for the previous example, the probability of the sentence becomes : P(I love dogs) = P(I)P(love | I)P(dogs | love) Similarly, for a trigram model, the probability will be given by : P(I love dogs) = P(I)P(love | I)P(dogs | I love) since the probability of each new word depends on the previous two words. Trigram modelling can be better explained by the following diagram : Natural Language Processing However, there is a catch involved in this kind of modelling. Suppose there is some bigram that does not appear in the training set but appears in the test set. Then we will assign a probability of 0 to that bigram, making the overall probability of the test sentence 0, which is undesirable. Smoothing is done to overcome this problem. Parameters are smoothed (or regularized) to reassign some probability mass to unseen events. One way of smoothing is Add-one or Laplace smoothing, which we will be using in this article. Add-one smoothing is performed by adding 1 to all bigram counts and V (no. of unique words in the corpus) to all unigram counts. Now that we have understood what smoothed bigram and trigram models are, let us write the code to compute them. We will be using the unprocessed bigrams and trigrams (without articles, determiners removed) for prediction. smoothed_bigrams_probDist = {} V = len(unigrams_freqDist) for i in bigrams_freqDist: smoothed_bigrams_probDist[i] = (bigrams_freqDist[i] + 1)/(unigrams_freqDist[i]+V) smoothed_trigrams_probDist = {} for i in trigrams_freqDist: smoothed_trigrams_probDist[i] = (trigrams_freqDist[i] + 1)/(bigrams_freqDist[i[0:2]]+V) Next, we try to predict the next three words of three test sentences using the computed smoothed bigram and trigram language models. testSent1 = "There was a sudden jerk, a terrific convulsion of the limbs; and there he" testSent2 = "They made room for the stranger, but he sat down" testSent3 = "The hungry and destitute situation of the infant orphan was duly reported by" Natural Language Processing First, we tokenize the test sentences into component words and obtain the last unigrams and bigrams appearing in them. token_1 = word_tokenize(testSent1) token_2 = word_tokenize(testSent2) token_3 = word_tokenize(testSent3) ngram_1 = {1:[], 2:[]} ngram_2 = {1:[], 2:[]} ngram_3 = {1:[], 2:[]} for i in range(2): ngram_1[i+1] = list(ngrams(token_1, i+1))[-1] ngram_2[i+1] = list(ngrams(token_2, i+1))[-1] ngram_3[i+1] = list(ngrams(token_3, i+1))[-1] print("Sentence 1: ", ngram_1,"nSentence 2: ",ngram_2,"nSentence 3: ",ngram_3) Next, we write functions to predict the next word and the next 3 words respectively of the three test sentences using the smoothed bigram model. def predict_next_word(last_word,probDist): next_word = {} for k in probDist: if k == last_word: next_word[k] = probDist[k] k = Counter(next_word) high = k.most_common(1) return high def predict_next_3_words(token,probDist): pred1 = [] pred2 = [] next_word = {} for i in probDist: if i == token: next_word[i] = probDist[i] k = Counter(next_word) high = k.most_common(2) w1a = high w1b = high w2a = predict_next_word(w1a,probDist) w3a = predict_next_word(w2a,probDist) w2b = predict_next_word(w1b,probDist) w3b = predict_next_word(w2b,probDist) pred1.append(w1a) pred1.append(w2a) pred1.append(w3a) pred2.append(w1b) Natural Language Processing pred2.append(w2b) pred2.append(w3b) return pred1,pred2 print("Predicting next 3 possible word sequences with smoothed bigram model : ") pred1,pred2 = predict_next_3_words(ngram_1,smoothed_bigrams_probDist) print("1a)" +testSent1 +" "+ '33[1m' + pred1+" "+pred1+" "+pred1 + '33[0m') print("1b)" +testSent1 +" "+ '33[1m' + pred2+" "+pred2+" "+pred2 + '33[0m') pred1,pred2 = predict_next_3_words(ngram_2,smoothed_bigrams_probDist) print("2a)" +testSent2 +" "+ '33[1m' + pred1+" "+pred1+" "+pred1 + '33[0m') print("2b)" +testSent2 +" "+ '33[1m' + pred2+" "+pred2+" "+pred2 + '33[0m') pred1,pred2 = predict_next_3_words(ngram_3,smoothed_bigrams_probDist) print("3a)" +testSent3 +" "+ '33[1m' + pred1+" "+pred1+" "+pred1 + '33[0m') print("3b)" +testSent3 +" "+ '33[1m' + pred2+" "+pred2+" "+pred2 + '33[0m') The predictions from the smoothed bigram model are : We obtain predictions from the smoothed trigram model similarly. def predict_next_word(last_word,probDist): next_word = {} for k in probDist: if k[0:2] == last_word: next_word[k] = probDist[k] k = Counter(next_word) high = k.most_common(1) return high def predict_next_3_words(token,probDist): pred = [] next_word = {} for i in probDist: if i[0:2] == token: next_word[i] = probDist[i] k = Counter(next_word) high = k.most_common(2) Natural Language Processing w1a = high tup = (token,w1a) w2a = predict_next_word(tup,probDist) tup = (w1a,w2a) w3a = predict_next_word(tup,probDist) pred.append(w1a) pred.append(w2a) pred.append(w3a) return pred print("Predicting next 3 possible word sequences with smoothed trigram model : ") pred = predict_next_3_words(ngram_1,smoothed_trigrams_probDist) print("1)" +testSent1 +" "+ '33[1m' + pred+" "+pred+" "+pred + '33[0m') pred = predict_next_3_words(ngram_2,smoothed_trigrams_probDist) print("2)" +testSent2 +" "+ '33[1m' + pred+" "+pred+" "+pred + '33[0m') pred = predict_next_3_words(ngram_3,smoothed_trigrams_probDist) print("3)" +testSent3 +" "+ '33[1m' + pred+" "+pred+" "+pred + '33[0m') The output looks like this : NGRAM Model What is n-gram Model In natural language processing n-gram is a contiguous sequence of n items generated from a given sample of text where the items can be characters or words and n can be any numbers like 1,2,3, etc. For example, let us consider a line – “Either my way or no way”, so below is the possible n- gram models that we can generate – As we can see using the n-gram model we can generate all possible contiguous combinations of length n for the words in the sentence. When n=1, the n-gram model resulted in one word in each tuple. When n=2, it generated 5 combinations of sequences of length 2, and so on. Natural Language Processing Similarly for a given word we can generate n-gram model to create sequential combinations of length n for characters in the word. For example from the sequence of characters “Afham”, a 3-gram model will be generated as “Afh”, “fha”, “ham”, and so on. Due to their frequent uses, n-gram models for n=1,2,3 have specific names as Unigram, Bigram, and Trigram models respectively. Use of n-grams in NLP N-Grams are useful to create features from text corpus for machine learning algorithms like SVM, Naive Bayes, etc. N-Grams are useful for creating capabilities like autocorrect, autocompletion of sentences, text summarization, speech recognition, etc. Generating ngrams in NLTK We can generate ngrams in NLTK quite easily with the help of ngrams function present in nltk.util module. Let us see different examples of this NLTK ngrams function below. Unigrams or 1-grams To generate 1-grams we pass the value of n=1 in ngrams function of NLTK. But first, we split the sentence into tokens and then pass these tokens to ngrams function. As we can see we have got one word in each tuple for the Unigram model. In : from nltk.util import ngrams n=1 sentence = 'You will face many defeats in life, but never let yourself be defeated.' unigrams = ngrams(sentence.split(), n) for item in unigrams: print(item) [Out] : ('You',) ('will',) ('face',) ('many',) ('defeats',) ('in',) ('life,',) ('but',) ('never',) ('let',) ('yourself',) ('be',) ('defeated.',) Bigrams or 2-grams Natural Language Processing For generating 2-grams we pass the value of n=2 in ngrams function of NLTK. But first, we split the sentence into tokens and then pass these tokens to ngrams function. As we can see we have got two adjacent words in each tuple in our Bigrams model. In : from nltk.util import ngrams n=2 sentence = 'The purpose of our life is to happy' unigrams = ngrams(sentence.split(), n) for item in unigrams: print(item) [Out] : ('The', 'purpose') ('purpose', 'of') ('of', 'our') ('our', 'life') ('life', 'is') ('is', 'to') ('to', 'happy') Trigrams or 3-grams In case of 3-grams, we pass the value of n=3 in ngrams function of NLTK. But first, we split the sentence into tokens and then pass these tokens to ngrams function. As we can see we have got three words in each tuple for the Trigram model. In : from nltk.util import ngrams n=3 sentence = 'Whoever is happy will make others happy too' unigrams = ngrams(sentence.split(), n) for item in unigrams: print(item) [Out] : ('Whoever', 'is', 'happy') ('is', 'happy', 'will') ('happy', 'will', 'make') ('will', 'make', 'others') ('make', 'others', 'happy') ('others', 'happy', 'too') Generic Example of ngram in NLTK Natural Language Processing In the example below, we have defined a generic function ngram_convertor that takes in a sentence and n as an argument and converts it into ngrams. In : from nltk.util import ngrams def ngram_convertor(sentence,n=3): ngram_sentence = ngrams(sentence.split(), n) for item in ngram_sentence: print(item) In : sentence = "Life is either a daring adventure or nothing at all" ngram_convertor(sentence,3) [Out] : ('Life', 'is', 'either') ('is', 'either', 'a') ('either', 'a', 'daring') ('a', 'daring', 'adventure') ('daring', 'adventure', 'or') ('adventure', 'or', 'nothing') ('or', 'nothing', 'at') ('nothing', 'at', 'all') NLTK Everygrams NTK provides another function everygrams that converts a sentence into unigram, bigram, trigram, and so on till the ngrams, where n is the length of the sentence. In short, this function generates ngrams for all possible values of n. Let us understand everygrams with a simple example below. We have not provided the value of n, but it has generated every ngram from 1-grams to 5-grams where 5 is the length of the sentence, hence the name everygram. In : from nltk.util import everygrams message = "who let the dogs out" msg_split = message.split() list(everygrams(msg_split)) [Out] : [('who',), ('let',), ('the',), ('dogs',), ('out',), ('who', 'let'), Natural Language Processing ('let', 'the'), ('the', 'dogs'), ('dogs', 'out'), ('who', 'let', 'the'), ('let', 'the', 'dogs'), ('the', 'dogs', 'out'), ('who', 'let', 'the', 'dogs'), ('let', 'the', 'dogs', 'out'), ('who', 'let', 'the', 'dogs', 'out')] Converting data frames into Trigrams In this example, we will show you how you can convert a dataframes of text into Trigrams using the NLTK ngrams function. In : import pandas as pd df=pd.read_csv('file.csv') df.head() [Out] : headline_text 0 Southern European bond yields hit multi-week lows 1 BRIEF-LG sells its entire stake in unit LG Lif… 2 BRIEF-Golden Wheel Tiandi says unit confirms s… 3 BRIEF-Sunshine 100 China Holdings Dec contract… 4 Euro zone stocks start 2017 with new one-year … In : from nltk.util import ngrams def ngramconvert(df,n=3): for item in df.columns: df['new'+item]=df[item].apply(lambda sentence: list(ngrams(sentence.split(), n))) return df In : new_df = ngramconvert(df,3) new_df.head() Advanced smoothing for language modelling Natural Language Processing Language is an indispensable part of our lives. It is so crucial for communicating our thoughts. Natural Language Processing designed to understand and interpret human language is such an excellent tool as we can perform so many tasks with just a click, for example, if we are new to a language, we can simply use Google translate to check its meaning hassle-free, the benefits go on and on like handwriting recognition, text summarization, etc. With such a plethora of language options and essential responsibilities, for example, in Google translation, our prediction model must perform accurately because a wrong translation can lead to negative consequences. What is Smoothing in NLP? In NLP, we have statistical models to perform tasks like auto-completion of sentences, where we use a probabilistic model. Now, we predict the next words based on training data, which has complete sentences so that the model can understand the pattern for prediction. Naturally, we have so many combinations of words possible. It is next to impossible to include all the varieties in training data so that our model can predict accurately on unseen data. So, here comes Smoothing to the rescue. Smoothing refers to the technique we use to adjust the probabilities used in the model so that our model can perform more accurately and even handle the words absent in the training set. We use Smoothing for the following reasons. To improve the accuracy of our model. To handle data sparsity, out of vocabulary words, words that are absent in the training set. Example - Training set: ["I like coding", “Prakriti likes mathematics”, “She likes coding”] Let’s consider bigrams, a group of two words. P(wi | w(i-1)) = count(wi w(i-1)) / count(w(i-1)) So, let's find the probability of “I like mathematics”. We insert a start token, and end token, at the start and end of a sentence respectively. P(“I like mathematics”) = P( I | ) * P( like | I) * P( mathematics | like) * P( | mathematics) = (count(I) / count()) * (count(I like) / count(I)) * (count(like mathematics) / count(like)) * (count(mathematics ) / count()) = (1/3) * (1/1) * (0/1) * (1/3) =0 As you can see, P (“I like mathematics”) comes out to be 0, but it can be a proper sentence, but due to limited training data, our model didn’t do well. Now, we’ll see how smoothing can solve this issue. Types of Smoothing in NLP Laplace / Add-1 Smoothing Here, we simply add 1 to all the counts of words so that we never incur 0 value. Natural Language Processing PLaplace(wi | w(i-1)) = (count(wi w(i-1)) +1 ) / (count(w(i-1)) + V) Where V= total words in the training set, 9 in our example. So, P(“I like mathematics”) = P( I | )*P( like | I)*P( mathematics | like)*P( | mathematics) = ((1+1) / (3+9)) * ((1+1) / (1+9)) * ((0+1) / (1+9)) * ((1+1) / (3+9)) = 1 / 1800 Additive Smoothing It is very similar to Laplace smoothing. Instead of 1, we add a δ value. So, PAdditive(wi | w(i-1)) = (count(wi w(i-1)) + δ) / (count(w(i-1)) + δ|V|) Backoff and Interpolation Backoff ○ Start with n-gram, ○ If insufficient observations, check (n-1)gram ○ If insufficient observations, check (n-2)gram Interpolation ○ Try a mixture of (multiple) n-gram models Good Turing Smoothing This technique uses the frequency of occurring of N-grams reallocates probability distribution using two criteria. For example, as we saw above, P(“like mathematics”) equals 0 without smoothing. We use the frequency of bigrams that occurred once, the total number of bigrams for unknown bigrams. Punknown(wi | w(i-1)) = (count of bigrams that appeared once) / (count of total bigrams) For known bigrams like “like coding,” we use the frequency of bigrams that occurred more than one of the current bigram frequency (Nc+1), frequency of bigrams that occurred the same as the current bigram frequency (Nc), and the total number of bigrams(N). Pknown(wi | w(i-1)) = c* / N Where c* = (c+1) * (Nc+1) / (Nc) and c = count of input bigram, “like coding” in our example. Kneser-Ney Smoothing Here we discount an absolute discounting value, d from observed N-grams and distribute it to unseen N-grams. Natural Language Processing Katz Smoothing Here we combine the Good-turing technique with interpolation. Feel free to know more about Katz smoothing here. Church and Gale Smoothing Here, the Good-turing technique is combined with bucketing. Every N-gram is added to one bucket according to its frequency, and then good-turing is estimated for every bucket. Applications of language modeling Language models are the backbone of natural language processing (NLP). Below are some NLP tasks that use language modeling, what they mean, and some applications of those tasks: Speech recognition -- involves a machine being able to process speech audio. This is commonly used by voice assistants like Siri and Alexa. Machine translation -- involves the translation of one language to another by a machine. Google Translate and Microsoft Translator are two programs that do this. SDL Government is another, which is used to translate foreign social media feeds in real time for the U.S. government. Parts-of-speech tagging -- involves the markup and categorization of words by certain grammatical characteristics. This is utilized in the study of linguistics, first and perhaps most famously in the study of the Brown Corpus, a body of composed of random English prose that was designed to be studied by computers. This corpus has been used to train several important language models, including one used by Google to improve search quality. Natural Language Processing Parsing -- involves analysis of any string of data or sentence that conforms to formal grammar and syntax rules. In language modeling, this may take the form of sentence diagrams that depict each word's relationship to the others. Spell checking applications use language modeling and parsing. Sentiment analysis -- involves determining the sentiment behind a given phrase. Specifically, it can be used to understand opinions and attitudes expressed in a text. Businesses can use this to analyze product reviews or general posts about their product, as well as analyze internal data like employee surveys and customer support chats. Some services that provide sentiment analysis tools are Repustate and Hubspot's ServiceHub. Google's NLP tool -- called Bidirectional Encoder Representations from Transformers (BERT) -- is also used for sentiment analysis. Optical character recognition -- involves the use of a machine to convert images of text into machine encoded text. The image may be a scanned document or document photo, or a photo with text somewhere in it -- on a sign, for example. It is often used in data entry when processing old paper records that need to be digitized. In can also be used to analyze and identify handwriting samples. Information retrieval -- involves searching in a document for information, searching for documents in general, and searching for metadata that corresponds to a document. Web browsers are the most common information retrieval applications.