Foundation of AI and ML (4351601) PDF
Document Details
Uploaded by IllustriousHydrangea453
Sir Bhavsinhji Polytechnic Institute
H. P. Jagad
Tags
Summary
This document provides an overview of the foundation of AI and ML, focusing on natural language processing (NLP). It covers historical developments, techniques, and applications of NLP, including advantages and disadvantages. The document emphasizes practical applications of NLP in various fields.
Full Transcript
Foundation of AI and ML (4351601) Foundation of - H. P. Jagad Lecturer(IT) Sir BPTI Bhavnagar http://hpjagad.blogspot.com Unit-4 1 Langu...
Foundation of AI and ML (4351601) Foundation of - H. P. Jagad Lecturer(IT) Sir BPTI Bhavnagar http://hpjagad.blogspot.com Unit-4 1 Language is a method of communication with the help of which we can speak, read and write. “Can human beings communicate with computers in their natural language? ” Introduction to It is Challenge for us to develop such applications because computers need structured data, but human speech is NLP unstructured & ambiguous in nature. Branch of AI that gives ability to machine to Interpret, Analyze, Manipulate & Understand human's languages. Human languages can be in form of text or audio format. Helps developers to organize knowledge for performing tasks such as translation, automatic summarization, speech recognition & topic segmentation. To enable computers for analyzing and processing huge amount of natural language data. Unit-4 I AM U 2 1940- Began experimenting with Machine Translation during World War-II. 1948 - First recognizable NLP application was introduced in Birkbeck College, London. 1950 - Alan Turing published an article titled "Computing Machinery and Intelligence“. The proposed test includes a task that involves the automated interpretation and History of generation of natural language. (Turing Test) NLP 1950 - Chomsky introduced the idea of Generative Grammar(Rule based descriptions of syntactic structures.) 1960 to 1970, NLP focused on Rule based System. Used set of predefined rules & dictionaries to process language. 1970, SHRDLU – It is a program that understand & respond to natural language queries. It shows syntax, semantics, and reasoning about the world. Unit-4 3 1980 to 1990, Hidden Markov Model(HMM) became a popular tool for speech recognition. Converts speech to text. 2000, IBM’s model-1 & model-2 used statistical patterns to improve translation quality. LUNAR- one of the largest and most successful question- answering system using AI techniques. It had a separate syntax analyzer and a semantic interpreter. History to 2010 onwards, NLP reformed using Deep learning & Neural network. Models like Word2Vec, GloVe, Google’s BERT NLP (Bidirectional Encoder Representations from Transformers) developed for NLP tasks. GPT-3 (Generative Pre-trained Transformer 3) developed. Now, modern NLP consists of various applications, like speech recognition, machine translation, sentiment analysis and machine text reading. Ex- AMAZON ALEXA Unit-4 In 2028, NLP is expected to grow 20 billion $ to 127 billion $. 4 Enhanced User Experience- chatbot & virtual assistant make interaction with users in a natural way. Efficient Information Retrieval- Analyze large volume of text data quickly & accurately. Automation of repetitive task- text summarization, data extraction & document classification. Multi language Capabilities Advantages Insight Extraction- Customer feedback, social media, online of review used in decision making. NLP Content generation Accessibility- helps to disable person using text to speech or speech to text application. Fraud detection- Identifying phishing emails or fraud financial transactions. Market Research- Analyze social media conversations, Unit-4 customer review & survey & understand market trends. 5 Requires vast amount of high quality data to train a model. That is time consuming & expensive. Training can take time. It’s necessary to develop a model with a new set of data without using a pre-trained model, it can take weeks to achieve a good performance depending on the amount of data. May require vast number of computation resources. Disadvantages Difficult to understand how they arrive at decisions. of Biases present in training data may be inherited by NLP NLP models. NLP model with multi language can be challenging task. Unpredictable. It’s not 100% reliable. There’s the possibility of errors in its prediction and results. May require more keystrokes. NLP model can work well on specific task but may not work Unit-4 effectively for unseen task. 6 It helps the machine to understand and analyze human language by extracting the metadata from content such as concepts, entities, keywords, emotion, relations etc. NLU mainly used in Business applications to understand the Components customer's problem in both spoken and written language. of NLP Semantic Understanding– Meaning of word, phrase, sentence Contextual Analysis & Understanding- Considering surrounding words to interpret exact meaning of word. 1. Named Entity Recognition(NER)- Categorize named entity like Natural name of people, organization, location, date etc. Language Sentiment Analysis Understanding Relationship Extraction (NLU) Question Answering Topic Modeling- Identify main topic within collection of documents and Categorize them. Unit-4 Construct Knowledge Graph for organizing structured info. 7 Generation of human like text or speech from structured data, information or other non-linguistic input. Converts machine readable language into text and can also convert text into audible speech using text-to-speech Components technology. of NLP Data to text Generation Text Summarization 2. Content/Dialog/Narrative(Story) Generation Natural Data Reporting- Report & Insights from Visualization tools. Language Automated Translation- Translation between languages. Generation Techniques used in NLG (NLG) Rule based NLG- Predefine rules & templates Statistical NLG- Based on probabilities of word sequences. Ex- Hidden Markov Model(HMM), Conditional Random Fields(CRFs) Unit-4 Neural NLG- Generative Pre-trained Transformer(GPT) 8 NL Understanding NL Generation Understand & interpret human language. Generating human like text or speech from structured data. Natural language as i/p Structured data, templates as i/p Extract information or meaning from it as o/p Generate natural language text/speech as o/p Techniques: Tokenization, Part of Speech Techniques: Rule based approach, Stastical tagging, NER(Named Entity Recognition), Modeling and deep learning Sentiment Analysis Challenge- Handling ambiguity & understand Challenge- Ensuring generated content is context contextually relevant Applications: Chatbot, Voice assistant, Sentiment analysis, Report generation, Content creation, dialog system, story telling Unit-4 9 Phonology − study of organizing sound NLP Terminology* systematically. Morphemes - Smallest meaningful unit Syntax − Arranging words to make a sentence. of language. If it is altered, the entire meaning of the word can be changed. Semantics − meaning of words and how to combine words into meaningful phrases and Ex.- Word “Eating” has 2 morphemes sentences. “Eat” &“ing”, Redo= Re+do Pragmatics − Understanding sentences in Morphology- Study of Construction of different situations and how the interpretation words from primitive meaningful units. of the sentence is affected. Lexemes - set of inflected forms taken Discourse- how the immediately preceding by a single word. Ex.- {run, running, ran} sentence can affect the interpretation of the Context - how everything within next sentence. Ex.- That is an elephant. It is language works together to convey a running. particular meaning. Phoneme – basic unit of phonology. Smallest unit of sound that may cause a change of meaning within a language, but that doesn’t Corpus->Document->Paragraph-> have meaning by itself. Unit-4 Sentence->Word/Token 10 Input Sentence Lexical Analysis Lexicon Syntax Analysis Grammer Semantic Semantic Phases of NLP Rules Analysis Contextual Discourse Information Integration Pragmatic LSS Analysis DP Output Unit-4 11 It scans the source code as a stream of characters Phases of NLP: and converts it into meaningful lexemes. It breaks 1. Lexical Analysis/ down the whole text into paragraphs, sentences, and words called token list. Morphological Splitting text at every space & remove punctuation Processing marks. Study of trying to understand the meaning of words, their relation with other words, and the context. Starting point of an NLP pipeline. Token refers to a sequence of characters that can be considered as one unit in the grammar. Approaches used in Lexical Analysis Part of speech (PoS) tagger- Assign PoS tag to each word to understand meaning of text. Stemming- Cut each word to its base form. Lemmatization- Reduce/cut each word to its Unit-4 meaningful base form. Ex.- Cats-Cat, Running-Run12 It is used to check grammar, logical meaning, correctness of sentences, word arrangements, & shows the relationship among the words. ‘Parsing’ is originated from the Latin word ‘pars’ which Phases of NLP: means ‘part’. It means to break down a given sentence into its ‘grammatical constituents’. 2. Apply grammatical rules only to group of words not on individual words. Syntactic Ex: College go a boy Analysis Grammatical structure is not correct. It does not convey its (Parsing) logical meaning. It is rejected. Process of analyzing the string of symbols in natural language & confirming rule of formal Grammer. Unit-4 13 It is the process of finding the meaning of the text. It tries to understand & interpret sentences, paragraphs or whole document by analyzing grammatical structure & identifying relationship between individual words in particular context. Only finds out the dictionary meaning or the actual meaning of Phases of NLP: the given text. Reading every word in the content to capture the actual meaning of any text. It identifies the text elements and assigns them to 3. their logical and grammatical role. Semantic Every sentence has a predicate that conveys the main logic of that sentence. Analysis Unit-4 14 Phases of NLP: 4. Discourse Integration It is the process of understanding relationship between words, sentences, paragraphs & entire documents. Specify the relations between sentences or clauses. It is essential because meaning of word depends on context in which it is used & also on previous sentences. Ex- “Bank” -> Financial institute/Riverbank/place to store data. It is important for information retrieval, text summarization & information extraction. Various ways to integrate discourse information in NLP models: Co-reference resolution algorithm is process of identifying words/phrases that refer to the same entity in a text. Unit-4 Ex- She refers to Aalya. 15 Discourse Markers are words that represents relationship between sentences or paragraphs. It provides clue about overall structure of text & main points of document. Ex- “However”, “therefore” and “in Phases of NLP: conclusion”. Applications of Discourse Integration in NLP Sentiment Analysis 4. Ex- “The product is good but the customer service is Discourse terrible.” Here, “but” denotes change in sentiment. Integration Question answering system By integrating discourse information, question answer system can better understand context of question & provide accurate answers. Unit-4 16 NLP is used to perform tasks like sentiment analysis, text classification, information extraction. However, accuracy of result depends on quality of data & analysis techniques used. Pragmatic analysis can help to improve accuracy. “What was said” is re-interpreted on what it actually meant. It involves deriving those aspects which require real world Phases of NLP: knowledge. It is the process of extracting information from text and focuses on figuring out actual meaning of structured text. 5. Lot of the text’s meaning does have to do with the context in Pragmatic which it was said/written. Ex.- "Open the door" is interpreted as a request not an order. Analysis Ex. - “I am so excited to attend the meeting tomorrow.” Helpful in sentiment analysis- Identify +ve /-ve /neutral sentiments. Challenging process as context can vary depending upon factors. Unit-4 17 Computationally Expensive & time consuming process NLP Libraries Ex- Scikit-learn, NLTK, Pattern, TextBlob, SpaCy NLTK is widely used Python Library for working with human language data NLP and text analysis task. It provides set of tools, recourses and libraries for tasks such as tokenization, Part of speech tagging, Parsing, stemming, Lemmatization, Character Count, Word count, sentiment NLTK-Natural analysis etc. Free & Open source. Language It supports multiple languages like English, German, Hindi etc ToolKit Provides a set of algorithms for NLP. Library Installation for Jupyter Notebook or Google Colab PIP is a package manager for Python packages. (Preferred Installer Program) !pip install nltk import nltk nltk.download('all’) Unit-4 Crl+Enter to run Program 18 The process of cleaning unstructured text data. So that it can be used to predict, analyze & extract information. Real-world text data is unstructured & inconsistent. So, Data preprocessing becomes a necessary step. The various Data Preprocessing methods are: Data 1. Tokenization Preprocessing 2. Frequency Distribution of Words 3. Filtering Stop Words Using NLTK 4. Stemming 5. Lemmatization 6. Parts Of Speech (POS)Tagging 7. Name Entity Recognition 8. WordNet Unit-4 19 The process of breaking down the text data into individual tokens (words, sentences, characters) is known as Tokenization. First step in Text Analytics & implemented using a class tokenize. punkt Sentence Tokenizer: - It divides a text into a list of sentences by using an unsupervised algorithm to build a 1. model for abbreviation words, collocations and words that start sentences. It identifies sentence boundary. Tokenization a) Sentence Tokenization Text data is split into sentences. It is implemented using sent_tokenize() function from nltk.tokenize module. b) Word Tokenization Text data is split into individual words. It is implemented using word_tokenize() function from nltk.tokenize module. Unit-4 20 Find out number of times each word repeated in a given text. Generate the frequency distribution of words in a text by using the FreqDist() function in from probability submodule. from nltk.probability import FreqDist fd = FreqDist(tokenized_word) 2. most_common() is used to print the most frequent words. Frequency Matplotlib is a graph plotting library in python which has Distribution of pyplot submodule. It has plot() function. words plot() function draws a line from point to point. Parameter 1 is an array containing the points on the x-axis. Parameter 2 is an array containing the points on the y-axis. import matplotlib.pyplot as plt fd.plot() Unit-4 21 Useless words are referred to as Stop words (Noise). It is necessary to filter some words which are repetitive and don’t hold any information. For example, words like – {that, these, below, is, are, a, an etc.} don’t provide any information, so they need to be removed from the text. NLTK provides a huge list of stop words. stopwords module is a part of NLTK library and provides a 3. collection of commonly used stop words for various languages. Need to download stopwords module. It is available in corpus. Filtering A corpus is a collection of authentic text documents or audio Stop Words organized into datasets. 'Authentic' means text written or audio spoken by a native of the language. nltk.download('stopwords') from nltk.corpus import stopwords format() method: - Concatenate elements within an output through positional formatting. Users use {} to mark where a variable will be substituted. Unit-4 Ex-print('{1} and {0}'.format(“Tom”, “Jerry”)) – Jerry and Tom 22 Stemming is a text normalization technique which reduced prefix and suffix of the word to their root word or stem. Used by chatbots and search engines to analyze the meaning behind the search queries. 3 stemming algorithms 1. Porter Stemmer: - It is one of the oldest and most widely used. It removes common suffix from the words. 2. Snowball Stemmer: - It is also called Porter2 stemmer. It is 4. advanced version of the Porter stemmer where a few of the Stemming stemming issues have been resolved. It supports several languages. 3. Lancaster stemmer: - It is an aggressive approach because it implements over-stemming for a lot of terms. It reduces the word to the shortest stem possible. Faster process compared to lemmatization as it does not nltk.stem consider the context of the words. Due to its aggressive nature, there always remains a possibility Unit-4 of invalid outcomes in a set of data. 23 It is also text normalization technique used to reduce the word to their root word and gives the complete meaning of the word which makes sense. It uses vocabulary and morphological analysis to transform a 5. word into a root word called Lemma. Lemmatization Stemming is used to reduce the given word to their root form by just discarding last few characters while Lemmatization considers context and converts the word to its meaningful base form which is called lemma. from nltk.stem import WordNetLemmatizer POS Char Description These process requires a lot of data to analyze the structure of v verb the language. This approach always considers the context first n noun and then converts the word to its meaningful root form. a adjective The WordNet lemmatizer is a lexical database that is used by all the major search engines. It provides lemmatization features. r adverb By default it lemmatizes words as nouns. To provide pos (part Unit-4 of speech) input externally, use character given in table: 24 POS tagging assigns a grammatical category (like noun, verb, adjective, adverb etc.) to each word in a sentence. Process of identifying parts of speech of a sentence. Parts of speech are also known as word classes or lexical categories. The collection of tags used for a particular task is known as a tagset. 2 main types of POS tagging in NLP: 6. Rule-based POS tagging: - It relies on a predefined set of Parts of grammatical rules, a dictionary of words, and their POS tags. Speech (POS) Simple to implement and understand but less accurate. Statistical POS tagging: - It uses ML algorithms to predict POS Tagging tags based on the context of the words in a sentence. It require a large amount of training data and computational resources. So, More accurate. Averaged Perceptron Tagger is a statistical POS tagger that uses a ML algorithm called Averaged Perceptron. It uses the universal POS tagset. It is trained on a large corpus of text. Unit-4 25 The averaged_perceptron_tagger.zip contains the pre-trained English POS tagger in NLTK. Some of them are given below: Abbreviation Meaning NNP proper noun, singular 6. PRP personal pronoun (hers, herself, him, himself) Parts of TO infinite marker (to) VB verb Speech (POS) VBG verb gerund (ing form) Tagging VBP verb, present tense not 3rd person singular Ex- I'm going to meet Krushna. [('I', 'PRP’), ("'m", 'VBP’), ('going', 'VBG’), ('to', 'TO’), ('meet', 'VB’), (‘Krushna’, 'NNP’), ('.', '.')] Unit-4 26 It is used to identify names of organizations, people, and geographic locations or any real world value [called named entity] in the text and tag them to the text. NLTK provides basic NER functionality. Chunking: Process of grouping words together into meaningful phrases based on their part of speech and syntactic structure. MaxEnt NE Chunker (Maximum Entropy Named Entity 7. Chunker):- It is a named entity recognition (NER) chunker provided by NLTK based on the Maximum Entropy classifier and Name Entity used to identify and classify named entities (such as names of people, organizations, locations, etc.) within a text. Recognition Named entities are marked with labels like "PERSON," "GPE" (Geopolitical Entity), "DATE," etc. words package in NLTK is a dataset that contains a basic list of common words for various languages. It may not include highly specialized or domain-specific vocabulary. Ex- GeeksforGeeks is a recognised platform in India Unit-4 27 8. WordNet WordNet is a lexical database for the English language that provides information about word meanings and relationships between words. Applications: 1) Synonyms of words synset() returns a set of synonymous words. It takes as input the name of a Words and synsets are linked by means of synset, whose format is: lemma.pos.nn conceptual-semantic relations to form the POS Char Description structure of wordnet. v verb One word form may connect to many different n noun meanings. So, we need senses, to work as the unit of word meanings. Ex- “Bank” has 2 senses: a adjective financial institution or river bank. r adverb Bank1 and bank2 are members of two different nn is a unique identifier for the Unit-4 synset. synsets, although they have the same word form. 28 lemmas() returns a list of Lemma objects, each of which contains information about a specific lemma. word.pos.nn.lemma 2) Print definition of Word: definition() 3) Print examples of Word: examples() 4) Antonyms of words: antonyms() 5) Word Similarity: wup_similarity() 8. wup_similarity() is a method in NLTK that calculates the Wu- Palmer Similarity between two synsets. This matrix measures WordNet how closely two concepts are related within the WordNet. Here, o/p will be between 0(no similarity) & 1(high similarity). 6) Hypernyms and Hyponyms Hypernyms: More general or broader concept. hypernyms() Hyponyms: More specific or narrower concept. hyponyms() root_hypernyms(): Find most general concepts or categories that a word or concept belongs to. Unit-4 29 Ambiguity is the capability of being understood in more than one way. 1. Lexical Ambiguity: - It exists in the presence of two or more possible meanings for a single word. 2 forms of lexical ambiguity: Homonymy: − Two words which are spelled the same but different meanings. Ex- bat, bank Types of Polysemy − Words which are depending on the context in which they are used and the tone in which they are spoken, can mean Ambiguities different things. Ex: a fine house, a fine situation! in NLP Homonymous words are different words, with completely different roots and different meanings. Polysemic words are the same words which have multiple meanings in common use. 2. Syntactic Ambiguity: - Structural Ambiguity: - Multiple interpretations due to its ambiguous syntax or sentence structure. Ex: I saw the stranger Unit-4 with the telescope. 30 3. Semantic Ambiguity: - Ex- He ate the burnt Rice and dal Word Sense Disambiguation(WSD): - Determining the correct sense or meaning of word in a particular context. Ex- I saw a bat. WSD is needed to determine bat is flying mammal or sports equipment. Referential Ambiguity: - When you are referring to something using the pronoun. Ex: Kiran went to Sunita. She said, "I am Types of hungry." Ambiguities 4. Pragmatic Ambiguity: - in NLP Conversational Ambiguity: - “Can you pass the salt?” Assumptions: - Sentence may carry assumptions that are not explicitly stated. Ex.- “After doing submission, you can have your own time” 5. Ambiguity due to Noise & Errors Speech Recognition Errors Unit-4 Typographical Errors 31