NLP Fundamentals - Lec 2.pdf

‫ جامعة اإلسماعيلية الجديدة األهلية‬- ‫كليــــــــة الهندســـــــة‬ Faculty of Engineering – New Ismailia National University ‫برنامج هندسة الذكاء االصطناعي‬ Artificial Intelligence Engineering Program AIE 265: NLP LEC. 2-4: NLP FUNDAMENTALS AND PREPROCESSING [email protected] +201227059094 OUTLINES  NLP fundamentals.  Preprocessing.  Candidate Textbooks/References. A. PROF. AHMED MAGDY 10/11/2024 2 NLP FUNDAMENTALS  What is NLP ? [Natural Language Processing]  In neuropsychology, linguistics, and philosophy of language, a natural language or ordinary language is any language that occurs naturally in a human community by a process of use, repetition, and change without conscious planning or premeditation. It can take different forms, typically either a spoken language or a sign language. Natural languages are distinguished from constructed and formal languages such as those used to program computers or to study logic.  Natural language can be broadly defined as different from artificial and constructed languages, e.g. computer programming languages non-human communication systems in nature such as whale and other marine mammal vocalizations or honey bees' waggle dance. 10/11/2024 3 NLP FUNDAMENTALS  Natural Language (Human Language) [Arabic language English language French Language …..etc]  Human get the edge due to the communication skills he has.  Roughly 6,500 languages are spoken in the world today  Programming language (Python language, C++,Java..etc ) is different  Processing:  How computers carries out instructions.  How to deal with Text data. 10/11/2024 4 NLP FUNDAMENTALS  Natural Language Processing (NLP) is defined as the automatic manipulation of natural languages, such as speech and text, by using software or any programming language.  The ultimate objective of NLP is to read, decipher, understand, and make sense of the human languages in a manner that is valuable.  Transforming free-form text into structured data and back  Most NLP techniques rely on machine learning to derive meaning from human languages..  As a business tool, NLP helps to drive better decision-making by applying computer intelligence.  It also identifies hot discussion topics and consumers’ interest charts.  For instance, marketers use sentiment analysis for consumer insights regarding brand preference.  It’s not an easy task teaching machines to understand how we communicate. 10/11/2024 5 NLP FUNDAMENTALS  Structured Data vs. Unstructured Data 10/11/2024 6 NLP FUNDAMENTALS  The relationship between the progress of years and the size of the data. We notice that U-S-D increases in a greater way than S-D until it reaches 120 exabytes in 2020. 10/11/2024 7 10/11/2024 8 NLP FUNDAMENTALS  NLP Components  The term NLP can be divided into two major components  Natural language understanding (NLU)  Natural language generation (NLG)  Or, in simple terms :NLP consist of :  Turning text into data, Then  Turning data into text  Note: If we have time in the next lec., we can talk about data analytics in detail? 10/11/2024 9 NLP FUNDAMENTALS  In fact, a typical interaction between humans and machines using Natural Language Processing could go as follows:  1. A human talks to the machine  2. The machine captures the audio  3. Audio to text conversion takes place  4. Processing of the text’s data  5. Data to audio conversion takes place  6. The machine responds to the human by playing the audio file 10/11/2024 10 NLP FUNDAMENTALS  NLP & AI 10/11/2024 11 NLP FUNDAMENTALS  NLP – timeline 10/11/2024 12 NLP FUNDAMENTALS  Why NLP is very important?  NLP is everywhere even if we don’t realize it.  The majority of activities performed by humans are done through language.  There are millions of gigabytes of data generated by Social media (Facebook, Instagram, Twitter, YouTube etc.), Apps messages (Whatsapp, WeChat, Telegram etc.), Forums (Quora, Reddit etc.),Blogs, news publishing platforms, google searches and many other channels.  All these channels are constantly generating large amount of text data every second.  And because of the large volumes of text data as well as the highly unstructured data source, we can no longer use the common approach to understand the text and this is where NLP comes in.  NLP produces new and exciting results on a daily basis, and is a very large field. 10/11/2024 13 10/11/2024 14 NLP FUNDAMENTALS  NLP allows companies to track, manage, and analyze billions of ever-changing data points. This way, companies make sense of all this information and use it to make decisions about their businesses  NLP helps systems analyze data faster By combining the power of artificial intelligence, computational linguistics and computer science.  NLP helps bring semantic understanding to languages: NLP systems help resolve confusing, ambiguous language by adding structure to the data they receive.  With NLP, there are several successful implementations with search engine like Google; social websites like Facebook’s news feeds; speech engines like Apple Siri; and spam filters. 10/11/2024 15 NLP FUNDAMENTALS  Natural Language Processing Applications  Machine translation.  Automatic summarization  Sentiment analysis  Text classification  Question Answering  ………… 10/11/2024 16 10/11/2024 17 10/11/2024 18 10/11/2024 19 10/11/2024 20 10/11/2024 21 NLP FUNDAMENTALS  Areas That Leverage NLP Technology  Chatbots 10/11/2024 22 NLP FUNDAMENTALS  The use of chatbots in maintaining business workflow is considerable and beneficial for every Industry.  It also enables bots to respond to customer queries faster than a human being. The faster responses help in building customer trust and more business.  NLP, when paired with voice recognition technology, can make chatbots smarter.  Chatbot interactions nowadays can be easily confused with human interactions because they are intelligent and also can recognize human emotions.  NLP helps chatbots analyze, understand, and prioritize complex questions.  Gartner has predicted that chatbots will account for 85% of customer interactions in 2024. 10/11/2024 23 NLP FUNDAMENTALS  E-commerce  With the exponential growth of multi-channel data like social or mobile data, businesses need solid technologies in place to assess and evaluate customer sentiments. So far, businesses have been happy analyzing customer actions, but in the current competitive climate, that type of customer analytics is outdated.  Now businesses need to analyze and understand customer attitudes, preferences, and even moods – all of which come under the purview of sentiment analytics. Without NLP, business owners would be seriously handicapped in conducting even the most basic sentiment analytics.  With the help of NLP, machines can easily pick out what phrases and words are generally used by humans while searching on a particular product on any ecommerce website.  NLP helps in customizing the searches for users using search engines. The system finds what the user is exactly searching for by using its understanding of language and sentence structure. It also detects patterns and creates links between messages to discover the meaning of unstructured text.  Smart Product Recommendations 10/11/2024 24 NLP FUNDAMENTALS  Sentiment Analysis  A classic example of NLP, sentiment analysis can help estimate how customers feel about the brand when it comes to adjusting sales and marketing strategy.  This technology is also known as opinion mining and is capable of analyzing news and blogs and assigning a value to the text (positive, negative, or neutral).  NLP algorithms enable you to identify emotions such as happy, annoyed, angry, and sad. In addition, a sentiment analysis tool increases customer loyalty, drives business changes, and achieves an appropriate return on sales and marketing investments. 10/11/2024 25 NLP FUNDAMENTALS  Hiring & Recruitment  By utilizing NLP, HR professionals can significantly speed up candidate searches, filtering out relevant resumes and creating bias-proof and gender-neutral job descriptions. By using semantic analysis,  NLP-based software helps recruiters to detect candidates that meet a job’s requirements.  Textio is a real example of using semantic categorization to tweak job descriptions in a way to maximize the number of job applicants. 10/11/2024 26 NLP FUNDAMENTALS  Why is NLP so difficult?  It’s the nature of the human language that makes NLP difficult.  Human gets the edge due to the communication skills he has.  There are hundreds of natural languages, each of which has different syntax rules. Words can be ambiguous where their meaning is dependent on their context.  The rules that dictate the passing of information using natural languages are not easy for computers to understand. Some of these rules can be high-leveled and abstract; for example, when someone uses a sarcastic remark to pass information.  Comprehensively understanding the human language requires understanding both the words and how the concepts are connected to deliver the intended message.  While humans can easily master a language, the ambiguity and imprecise characteristics of the natural languages are what make NLP difficult for machines to implement. 10/11/2024 27 NLP FUNDAMENTALS  What are the techniques used in NLP?  Syntax analysis:  Syntax refers to the arrangement of words in a sentence such that they make grammatical sense. In NLP, syntactic analysis is used to assess how the natural language aligns with the grammatical rules. Here are some syntax techniques that can be used: 10/11/2024 28 NLP FUNDAMENTALS  Lemmatization: It entails reducing the various inflected forms of a word into a single form for easy analysis.  Stemming: It involves cutting the inflected words to their root form.  Morphological segmentation: It involves dividing words into individual units called morphemes.  Word segmentation: It involves dividing a large piece of continuous text into distinct units.  Part-of-speech tagging: It involves identifying the part of speech for every word.  Parsing: It involves undertaking grammatical analysis for the provided sentence.  Sentence breaking: It involves placing sentence boundaries on a large piece of text. 10/11/2024 29 NLP FUNDAMENTALS  Semantics Analytics:  Semantics refers to the meaning that is conveyed by a text. Semantic analysis is one of the difficult aspects of Natural Language Processing that has not been fully resolved yet.  It involves applying computer algorithms to understand the meaning and interpretation of words and how sentences are structured. Here are some techniques in semantic analysis:  Named entity recognition (NER): It involves determining the parts of a text that can be identified and categorized into preset groups. Examples of such groups include names of people and names of places.  Word sense disambiguation: It involves giving meaning to a word based on the context.  Natural language generation: It involves using databases to derive semantic intentions and convert them into human language. 10/11/2024 30 NLP FUNDAMENTALS  Libraries and tools 10/11/2024 31 10/11/2024 32 NLP FUNDAMENTALS  The Future of NLP  Natural language processing (NLP), one of the most exciting components of AI  NLP is the voice behind Siri and Alexa, likewise, customer service chatbots harness the power of NLP to drive customized responses in e-commerce, healthcare and business utilities. Some of the more omnipresent applications of NLP today include virtual assistants, sentiment analysis, customer service, and translation.  According to many market statistics, data volume is doubling every two years, but in future this time span may get further reduced. The vast portion of this data (about 75 percent) is text data.  NLP is the sub-branch of Data Science that attempts to extract insights from “text.” Thus, NLP is assuming an important role in Data Science. Industry experts have predicted that the demand for NLP experts will grow exponentially in the near future.  Using natural language processing for creating a seamless and interactive interface between humans with machines will continue to be a top priority for today’s and tomorrow’s increasingly cognitive applications.  NLP everywhere , there is a potential ,Opportunities, Jobs and Money. 10/11/2024 33 BREAK RETURN AT 10:10 A.M FROM THE PREVIOUS SESSION 1. What is NLP? 2. Structured Data vs. Unstructured Data 3. NLP Components 4. NLP & Artificial intelligence 5. NLP TimeLine 6. Why NLP is very important? 7. Natural Language Processing Applications 8. Areas That Leverage NLP Technology 9. Why is NLP so difficult? 10. What are the techniques used in NLP? 11. Libraries and tools 12. The Future of NLP BEFORE WE START !!! 1Understanding of some of the key concepts in natural language What you processing and machine learning algorithms should know 2 Basic Knowledge in Python before you start 3 Some experience using the NumPy, pandas and scikit-learn libraries. Installing Python(IDE like anaconda ) and Jupyter Notebooks Environment OR setup Google Colab 1. FREE-OF-CHARGE Colab is a free cloud service based on Jupyter Notebooks 2. Cloud Service for machine learning education and research. It 3. Jupyter Notebook Environment provides a runtime fully configured for deep learning and free-of-charge access to 4. Zero Configuration Required a robust GPU. 5. Access to GPU /TPU PYTHON NATURAL LANGUAGE PROCESSING (NLP) LIBRARIES Natural Language TextBlob CoreNLP Gensim Toolkit (NLTK) spaCy polyglot scikit–learn Pattern Ahmad Shhadeh LIBRARIES AND TOOLS NLTK ▪ Small but useful datasets with markup ▪ Preprocessing tools: tokenization, normalization… ▪ Pre-trained models for POS-tagging, parsing… Stanford parser spaCy: ▪ python and cython library for NLP Gensim ▪ python library for text analysis, e.g. for word ▪ embeddings and topic modeling MALLET ▪ Java-based library, e.g. for classification, ▪ sequence tagging, and topic modeling PYTHON NATURAL LANGUAGE PROCESSING (NLP) LIBRARIES 1) Natural Language Toolkit (NLTK) 2) TextBlob 3) CoreNLP 4) Gensim 5) spaCy 6) polyglot 7) scikit–learn 8) Pattern NATURAL LANGUAGE TOOLKIT(NLTK) The natural language toolkit is the most utilized package for handling natural language processing tasks in Python. Usually called NLTK for short. NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 100 corpora and lexical resources such as WordNet. Providing a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries NLTK has been called “a wonderful tool for teaching, and working in, computational linguistics using Python,” and “an amazing library to play with natural language.” It is a suite of open-source tools originally created in 2001 at the University of Pennsylvania for the purpose of making building NLP processes in Python easier. This package has been expanded through the extensive contributions of open-source users in the years since its original development. NLP TOOL KIT INSTALLATION How to install NLTK on your local machine: We assumed that python and anaconda both are installed pip install nltk ➔installing nltk dir(nltk) ➔ check all installed packages under NLTK EMAIL/SMS SPAM FILTERING USE CASE DATA SCIENCE LIFE CYCLE 1. Business Understanding 2. Data Collection 3. Data Preparation 4. Exploratory data analytics(EDA) 5. Data Modelling 6. Model Evaluation 7. Model Deployment EMAIL/SMS SPAM FILTERING Spam filtering is a beginner’s example of document classification task which involves classifying an email as spam or non-spam (a.k.a. ham) mail. Spam box in your Gmail account is the best example of this. So lets get started in building a spam filter on a publicly available mail corpus. DATA SCIENCE LIFE CYCLE Business Understanding Data Collection Data Preparation Exploratory data analytics(EDA) Data Modelling Model Evaluation Model Deployment PUBLIC AND PRIVATE DATA Private data: it is private and belongs to an organization, and there are certain security and privacy concerns attached to it. It is used for the companies’ internal analysis purposes in order to gain business and growth insights. Some examples of such organizational private data are telecom data, retail data, and banking and medical data. Public data: This is the data that is available for public use and is offered by many sites such as government websites and public agencies for the purpose of research. Accessing this data does not require any special permission or approval. Open data is the idea that some data should be freely available to everyone to use and republish as they wish, without restrictions from copyright,the goals of the open-source data movement are similar to those of other "open(-source)" movements such as open- source software DATA COLLECTION Kaggle: https://www.kaggle.com/datasets https://www.kaggle.com/datasets GitHub: https://github.com/awesomedata/awesome-public-datasets https://github.com/awesomedata/awesome-public-datasets Google Dataset Search: https://datasetsearch.research.google.com/ https://datasetsearch.research.google.com/ Paper with code: https://paperswithcode.com/datasets https://paperswithcode.com/datasets Yahoo: https://webscope.sandbox.yahoo.com/ https://webscope.sandbox.yahoo.com/ UCI university: https://archive.ics.uci.edu/datasets https://archive.ics.uci.edu/datasets 10/11/2024 49 DATA COLLECTION SMS Spam Collection Data Set Download: Data Folder, Data Set Description Abstract: The SMS Spam Collection is a public set of SMS labeled messages that have been collected for mobile phone spam research. Multivariate, Data Set Number Text, 5574 Area: Computer Characteristics of Domain- : Instances: Theory Attribute Number Real N/A Date Donated 2012-06-22 Characteristics of : Attributes: Associated Classification Number of Tasks: Missing Values? N/A 331230 , Clustering Web Hits: PREPROCESSING Business Understanding Data Collection Data Preparation Exploratory data analytics(EDA) Data Modelling Model Evaluation Model Deployment PREPROCESSING Cleaning up the text data is necessary to highlight attributes that we are going to want our model to pick up on. Cleaning (or pre-processing) the data typically consists of number of steps: 1. Removing punctuation 2. Converting text to lowercase 3. Tokenization 4. Removing stop-words 5. Lemmatization /Stemming 6. Vectorization 7. Feature Engineering PREPROCESSING Cleaning up the text data is necessary to highlight attributes that we are going to want our model to pick up on. Cleaning (or pre-processing) the data typically consists of number of steps: 1. Remove punctuation 2. Converting text to lowercase 3. Tokenization 4. Remove stop-words 5. Lemmatization /Stemming 6. Vectorization 7. Feature Engineering PREPROCESSING Cleaning up the text data is necessary to highlight attributes that we are going to want our model to pick up on. Cleaning (or pre-processing) the data typically consists of number of steps: 1. Remove punctuation 2. Converting text to lowercase 3. Tokenization 4. Remove stop-words 5. Lemmatization /Stemming 6. Vectorization 7. Feature Engineering TOKENIZATION Tokenization is one of the most common tasks when it comes to working with text data Tokenization is essentially splitting a phrase, sentence, paragraph, or an entire text document into smaller units, such as individual words or terms. Each of these smaller units are called tokens. It is important because the meaning of the text could easily be interpreted by analyzing the words present in the text. Methods to Perform Tokenization in Python : ✓ Tokenization using Python’s split() function ✓ Tokenization using Regular Expressions (RegEx) ✓ Tokenization using NLTK ✓ Tokenization using the other libraries like spaCy and Gensim library The final Goal of Tokenization is : Creating Vocabulary PREPROCESSING Cleaning up the text data is necessary to highlight attributes that we are going to want our model to pick up on. Cleaning (or pre-processing) the data typically consists of number of steps: 1. Remove punctuation 2. Converting text to lowercase 3. Tokenization 4. Remove stop-words 5. Lemmatization /Stemming 6. Vectorization 7. Feature Engineering REMOVE STOP-WORDS Stopwords are common words that are present in the text but generally do not contribute to the meaning of a sentence. They hold almost no importance for the purposes of information retrieval and natural language processing. They can safely be ignored without sacrificing the meaning of the sentence. For example – ‘the’ and ‘a’. Stop Words: A stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that a search engine has been programmed to ignore, both when indexing entries for searching and when retrieving them as the result of a search query. The NLTK package has a separate package of stop words that can be downloaded. NLTK has stop words more than 16 languages which can be downloaded and used. Once it is downloaded, it can be passed as an argument indicating it to ignore these words. import nltk from nltk.corpus import stopwords set(stopwords.words('english')) REMOVE STOP-WORDS If the concern is with the context (e.g. sentiment analysis) of the text it might make sense to treat words differently. For example, “NOT” is included as stop word but when considering context of text, negation changes the so-called valence of a text. This needs to be treated carefully and is usually not trivial. Considering example for ‘not’- NLTK is a useful tool => NLTK useful tool NLTK is not a useful tool => NLTK useful tool PREPROCESSING Cleaning up the text data is necessary to highlight attributes that we are going to want our model to pick up on. Cleaning (or pre-processing) the data typically consists of number of steps: 1. Remove punctuation 2. Converting text to lowercase 3. Tokenization 4. Remove stop-words 5. Lemmatization /Stemming 6. Vectorization 7. Feature Engineering STEMMING Stemming. Isthe process of reducing inflected or derived words to their word stem or root. Stemming is aiming to reduce variations of the same root word. STEMMING IN EXAMPLE If Python sees play, playing, played and plays as four different separate things, that means: it has to keep those Four separate words in memory. Imagine every variation of every root word. Maybe we have a thousand root words. The alternative in this play, playing, played and plays example is applying the stemmer, all different words will become one word ; play, Python has to look at a lot more tokens without a stemmer and it doesn't know that these separate tokens are even related. In this case, we're not leaving it up to Python. We're being explicit by replacing similar words with just one common root word. reduce the corpus of words that exposed to the model Explicitly correlate the word with similar meaning STEMMING APPROACH Stemming is a rule-based approach because it slices the inflected words from prefix or suffix as per the need using a set of commonly underused prefix and suffix, like “-ing”, “-ed”, “-es”, “-pre”, etc. It results in a word that is actually not a word. The process of stemming means often crudely chopping off the end of a word, to leave only the base. So this means taking words with various suffixes and condensing them under the same root word. There are mainly two errors that occur while performing Stemming, Over-stemming, and Under- stemming. Over-stemming occurs when two words are stemmed from the same root of different stems. Universe University Universal Under-stemming occurs when two words are stemmed from the same root of not a different stems alumnus alumni alumnae DIFFERENTTYPES OF STEMMERS There are English and Non-English Stemmers available in nltk package. Porter Stemmer: PorterStemmer is known for its simplicity and speed. PorterStemmer uses Suffix Stripping to produce stems Lancaster Stemmer: LancasterStemmer is simple, but heavy stemming due to iterations and over- stemming may occur The Snowball Stemmer Regex-Based Stemmer. DIFFERENTTYPES OF STEMMERS There are English and Non-English Stemmers available in nltk package. Porter Stemmer: PorterStemmer is known for its simplicity and speed. PorterStemmer uses Suffix Stripping to produce stems Lancaster Stemmer: LancasterStemmer is simple, but heavy stemming due to iterations and over- stemming may occur The Snowball Stemmer Regex-Based Stemmer. LEMMATIZING Lemmatizing : The process of grouping together the inflected forms of a word so they can be analyzed as a single term. Lemmatization,unlike Stemming, reduces the inflected words properly ensuring that the root word belongs to the language lemmatizing is using vocabulary analysis of words to remove inflectional endings and return to the dictionary form of a word. So again : play, playing, played and plays would all be simplified down to play, because that's the root of the word. Each variation carries the same meaning just with slightly different tense. Will use the WordNet lemmatizer. This is probably the most popular lemmatizer. WordNet is a collection of nouns, verbs, adjective and adverbs that are grouped together in sets of synonyms, each expressing a distinct concept. This lemmatizer runs off of this corpus of synonyms, so given a word, it will track that word to its synonyms, and then the distinct concept that that group of words represents. LEMMATIZING VS STEMMING The goal of both is to condense derived words down into their base form, to reduce the corpus of words that the model's exposed to, and to explicitly correlate words with similar meaning. There is an accuracy and speed trade-off that you're making when you opt for one over the other. The difference is that stemming takes a more crude approach by just chopping off the ending of a word using heuristics, without any understanding of the context in which a word is used. Because of that, stemming may or may not return an actual word in the dictionary. And it's usually less accurate, but the benefit is that it's faster because the rules are quite simple. Lemmatizing leverages more informed analysis to create groups of words with similar meaning based on the context around the word, part of speech, and other factors. Lemmatizers will always return a dictionary word. And because of the additional context it's considered, this is typically more accurate. But the downside is that it may be more computationally expensive. The selection of Stemming or Lemmatization is solely dependent upon project requirements. Lemmatization is mandatory for critical projects and projects where sentence structure matter like language applications PREPROCESSING Cleaning up the text data is necessary to highlight attributes that we are going to want our model to pick up on. Cleaning (or pre-processing) the data typically consists of number of steps: 1. Remove punctuation 2. Converting text to lowercase 3. Tokenization 4. Remove stop-words 5. Lemmatization /Stemming 6. Vectorization 7. Feature Engineering VECTORIZING Vectorizing : The process that we use to convert text to a form that Python and a machine learning model can understand. It is defined as the process of encoding text as integers to create feature vectors. A feature vector is an n-dimensional vector of numerical features that represent some object. So in our context, that means we'll be taking an individual text message and converting it to a numeric vector that represents that text message. REFERENCES  Lyons, John (1991). Natural Language and Universal Grammar. New York: Cambridge University Press. pp. 68–70.  Norris, Paul F (25 August 2011). "The Honeybee Waggle Dance – Is it a Language?". AnimalWise. Archived from the original on 20 August 2016. Retrieved 10 April 2019.  Ahmad Shhadeh, “NLP course” 10/11/2024 69 ATTENDANCE GO COURSE FILE A. PROF. AHMED MAGDY 10/11/2024 70 THANKS ANY QUESTION A. PROF. AHMED MAGDY 10/11/2024 71

NLP Fundamentals - Lec 2.pdf

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue