Natural Language Processing (NLP) PDF
Document Details
Uploaded by Deleted User
Yarmouk University
Ahmad T. Al-Taani
Tags
Summary
This document is a lecture presentation on Natural Language Processing (NLP). It covers introductory concepts, applications, and different approaches in the field. The presentation discusses the relationship between NLP, Natural Language Understanding (NLU), and Natural Language Generation (NLG).
Full Transcript
AI671 Natural Language Processing (NLP) Prof. Dr. Ahmad T. Al-Taani Department of Computer Science Faculty of IT and CS Yarmouk University [email protected] 1. Introduction to Natural Language Processing 1.1 Differences between NLP, NLU, an...
AI671 Natural Language Processing (NLP) Prof. Dr. Ahmad T. Al-Taani Department of Computer Science Faculty of IT and CS Yarmouk University [email protected] 1. Introduction to Natural Language Processing 1.1 Differences between NLP, NLU, and NLG 1.2 The Study of Language 1.3 Applications of Natural Language Understanding 1.4 Evaluating Language Understanding Systems 1.5 The Different Levels of Language Analysis 1.6 Representations and Understanding 1.7 The Organization of Natural Language Understanding Systems NLP - Prof. A. T. Al-Taani 1. Introduction to NLP 2 1.1 NLP vs. NLU vs. NLG What is NLP? NLP uses methods from various disciplines, such as computer science, artificial intelligence, linguistics, and data science, to enable computers to understand human language in both written and verbal forms. NLP use machine learning and deep learning techniques to complete tasks, like language translation or question answering. NLP takes unstructured data and converts it into a structured data format (e.g. NER and identification of word patterns, using methods like tokenization, stemming, and lemmatization, which examine the root forms of words. NLP - Prof. A. T. Al-Taani 1. Introduction to NLP 3 Different approaches have been used for different types of language tasks: Hidden Markov Models (HMMs) are used for part-of-speech (POS) tagging. Recurrent Neural Networks (RNN) help to generate the appropriate sequence of text. N-grams, a simple language model (LM), assign probabilities to sentences or phrases to predict the accuracy of a response. These techniques work together to support popular technology such as chatbots, or speech recognition products like Amazon’s Alexa or Apple’s Siri. NLP - Prof. A. T. Al-Taani 1. Introduction to NLP 4 What is NLU? NLU is a subset of NLP, which uses syntactic and semantic analysis of text and speech to determine the meaning of a sentence. NLU also establishes a relevant ontology: a data structure which specifies the relationships between words and phrases. While humans naturally do this in conversation, the combination of these analyses is required for a machine to understand the intended meaning of different texts. For example, let’s take the following two sentences: Alice is swimming against the current. The current version of the report is in the folder. Example: Sentiment Analysis. NLP - Prof. A. T. Al-Taani 1. Introduction to NLP 5 What is NLG? NLG is another subset of NLP. While NLU focuses on computer reading comprehension, NLG enables computers to write. NLG is the process of producing a human language text response based on some data input. This text can also be converted into a speech format through text-to-speech services. NLG also encompasses text summarization capabilities that generate summaries from input documents while maintaining the integrity of the information. NLG systems used templates to generate text. Based on some data or query, an NLG system would fill in the blank, like a game of Mad Libs. But over time, NLG systems have evolved with the application of HMMs, RNNs, and transformers, enabling more dynamic text generation in real time. NLP - Prof. A. T. Al-Taani 1. Introduction to NLP 6 NLP vs NLU vs. NLG summary NLP seeks to convert unstructured language data into a structured data format to enable machines to understand speech and text and formulate relevant, contextual responses. Its subtopics include NLU and NLG. NLU focuses on machine reading comprehension through grammar and context, enabling it to determine the intended meaning of a sentence. NLG focuses on text generation, or the construction of text in English or other languages, by a machine and based on a given dataset. NLP - Prof. A. T. Al-Taani 1. Introduction to NLP 7 1.2 The Study of Language Language is one of the fundamental aspects of human behavior and is a crucial component of our lives. In written form it serves as a long-term record of knowledge from one generation to the next. In spoken form it serves as our primary means of coordinating our day-to-day behavior with others. NLP - Prof. A. T. Al-Taani 1. Introduction to NLP 8 Language is studied in several different academic disciplines. Each discipline defines its own set of problems and has its own methods for addressing them. The linguist, for instance, studies the structure of language itself, considering questions such as why certain combinations of words form sentences, but others do not, and why a sentence can have some meanings but not others. The psycholinguist, on the other hand, studies the processes of human language production and comprehension, considering questions such as how people identify the appropriate structure of a sentence and when they decide on the appropriate meaning for words. The philosopher considers how words can mean anything at all and how they identify objects in the world. Philosophers also consider what it means to have beliefs, goals, and intentions, and how these cognitive capabilities relate to language. NLP - Prof. A. T. Al-Taani 1. Introduction to NLP 9 The goal of the computational linguist is to develop a computational theory of language, using the notions of algorithms and data structures from computer science. To build a computational model, you must take advantage of what is known from all the other disciplines. Figure 1.2 summarizes these different approaches to studying language. NLP - Prof. A. T. Al-Taani 1. Introduction to NLP 10 Figure 1.2 The major disciplines studying language Discipline Typical Problems Tools How do words form phrases Intuitions about well-formedness Linguists and sentences? What and meaning; mathematical constrains the possible models of structure (for example, formal language theory, model meanings for a sentence? theoretic semantics) How do people identify the Experimental techniques Psycholinguists structure of sentences? How are based on measuring human word meanings identified? When performance; statistical does understanding take place? analysis of observations What is meaning, and how Natural language argumentation Philosophers do words and sentences using intuition about counter- acquire it? How do words examples; mathematical models (for example, logic and model identify objects in the world? theory) How is the structure of sentences Algorithms, data structures; Computational identified? How can knowledge formal models of representation Linguists and reasoning be modeled? How and reasoning; AI techniques can language be used to (search and representation accomplish specific tasks? methods) NLP - Prof. A. T. Al-Taani 1. Introduction to NLP 11 Motivations for Developing Computational Models The scientific motivation is to obtain a better understanding of how language works. It recognizes that any one of the other traditional disciplines does not have the tools to completely address the problem of how language comprehension and production work. Computational models may provide very specific predictions about human behavior that can then be explored by the psycholinguist. By continuing in this process, we may eventually acquire a deep understanding of how human language processing occurs. NLP - Prof. A. T. Al-Taani 1. Introduction to NLP 12 1.3 Applications of Natural Language Understanding The applications of NLU can be divided into two major classes: text-based applications and dialogue-based applications. Text-based applications involve the processing of written text, such as books, newspapers, reports, manuals, e-mail messages, and so on. These are all reading- based tasks. NLP - Prof. A. T. Al-Taani 1. Introduction to NLP 13 Dialogue-based applications involve human- machine communication. Most naturally this involves spoken language, but it also includes interaction using keyboards. Finding appropriate documents on certain topics from a database of texts Extracting information from messages or articles on certain topics Translating documents from one language to another Summarizing texts for certain purposes Story Understanding NLP - Prof. A. T. Al-Taani 1. Introduction to NLP 14 Text-based NL Research Areas Information Retrieval For example, consider the task of finding newspaper articles on a certain topic in a large database. Many, techniques have been developed that classify documents by the presence of certain keywords in the text. You can then retrieve articles on a certain topic by looking for articles that contain the keywords associated with that topic. NLP - Prof. A. T. Al-Taani 1. Introduction to NLP 15 Machine Translation Some machine translation systems have been built that are based on pattern matching. The translation is accomplished by finding the best set of patterns that match the input and producing the associated output in the other language. This technique can produce reasonable results in some cases but sometimes produces completely wrong translations because of its inability to use an understanding of content to disambiguate word senses and sentence meanings appropriately. In contrast, other machine translation systems operate by producing a representation of the meaning of each sentence in one language, and then producing a sentence in the other language that realizes the same meaning. NLP - Prof. A. T. Al-Taani 1. Introduction to NLP 16 Dialogue-based Applications Question-Answering systems, where natural language is used to query a database Automated Customer Service over the telephone Tutoring Systems, where the machine interacts with a student Spoken Language Control of a machine (for example, voice control of a VCR or computer) General Cooperative Problem-Solving Systems (for example, a system that helps a person plan and schedule freight shipments). NLP - Prof. A. T. Al-Taani 1. Introduction to NLP 17 Some of the problems faced by dialogue systems are quite different than in text- based systems. First, the language used is very different, and the system needs to participate actively in order to maintain a natural, smooth- flowing dialogue. Dialogue requires the use of acknowledgments to verify that things are understood, and an ability to both recognize and generate clarification sub-dialogues when something is not clearly understood. NLP - Prof. A. T. Al-Taani 1. Introduction to NLP 18 A Speech Recognition System need not involve any language understanding. For instance, voice- controlled computers and VCRs are entering the market now. These do not involve natural language understanding in any general way. Rather, the words recognized are used as commands, much like the commands you send to a VCR using a remote control. Speech Recognition is concerned only with identifying the words spoken from a given speech signal, not with understanding how words are used to communicate. To be an understanding system, the speech recognizer would need to feed its input to a natural language understanding system, producing what is often called a Spoken Language Understanding System. NLP - Prof. A. T. Al-Taani 1. Introduction to NLP 19 1.4 Evaluating Language Understanding Systems How can you tell if a system works? 1.Black Box Evaluation: it evaluates system performance without looking inside to see how it works. run the program and see how well it performs the task it was designed to do. If the program is meant to answer questions about a database of facts, you might ask it questions to see how good it is at producing the correct answers. NLP - Prof. A. T. Al-Taani 1. Introduction to NLP 20 2. Glass Box Evaluation: you look inside at the structure of the system to identify various subcomponents of a system and then evaluate each one with appropriate tests. The problem with glass box evaluation is that it requires some consensus on what the various components of a NL system should be. NLP - Prof. A. T. Al-Taani 1. Introduction to NLP 21 1.5 The Different Levels of Language Analysis A NL-system must use considerable knowledge about the structure of the language itself, including what the words are, how words combine to form sentences, what the words mean, how word meanings contribute to sentence meanings, and so on. Human general world knowledge and their reasoning abilities. For example, to answer questions or to participate in a conversation, a person not only must know a lot about the structure of the language being used, but also must know about the world in general and the conversational setting in particular. NLP - Prof. A. T. Al-Taani 1. Introduction to NLP 22 Phonetic and phonological knowledge - concerns how words are related to the sounds that realize them. Such knowledge is crucial for speech-based systems. Morphological knowledge - concerns how words are constructed from more basic meaning units called morphemes. A morpheme is the primitive unit of meaning in a language (for example, the meaning of the word "friendly" is derivable from the meaning of the noun "friend" and the suffix "-ly", which transforms a noun into an adjective). NLP - Prof. A. T. Al-Taani 1. Introduction to NLP 23 Syntactic knowledge - concerns how words can be put together to form correct sentences and determines what structural role each word plays in the sentence and what phrases are subparts of what other phrases. Semantic knowledge - concerns what words mean and how these meanings combine in sentences to form sentence meanings. This is the study of context-independent meaning - the meaning a sentence has regardless of the context in which it is used. Pragmatic knowledge - concerns how sentences are used in different situations and how use affects the interpretation of the sentence. NLP - Prof. A. T. Al-Taani 1. Introduction to NLP 24 Discourse knowledge-concerns how the immediately preceding sentences affect the interpretation of the next sentence. This information is especially important for interpreting pronouns and for interpreting the temporal aspects of the information conveyed. World knowledge - includes the general knowledge about the structure of the world that language users must have in order to, for example, maintain a conversation. It includes what each language user must know about the other user’s beliefs and goals. NLP - Prof. A. T. Al-Taani 1. Introduction to NLP 25 1.6 Representations and Understanding A crucial component of understanding involves computing a representation of the meaning of sentences and texts. Why not simply use the sentence itself as a representation of its meaning? One reason is that most words have multiple meanings, which we will call senses. The word "cook", for example, has a sense as a verb and a sense as a noun; "dish" has multiple senses as a noun as well as a sense as a verb. NLP - Prof. A. T. Al-Taani 1. Introduction to NLP 26 This ambiguity would inhibit the system from making the appropriate inferences needed to model understanding. The disambiguation problem appears much easier than it is because people do not generally notice ambiguity. While a person does not seem to consider each of the possible senses of a word when understanding a sentence, a program must explicitly consider them one by one. NLP - Prof. A. T. Al-Taani 1. Introduction to NLP 27 To represent meaning, we must have a more precise language. The tools to do this come from mathematics and logic and involve the use of formally specified representation languages. Formal languages are specified from very simple building blocks. The most fundamental is the notion of an atomic symbol which is distinguishable from any other atomic symbol simply based on how it is written. Useful representation languages have the following two properties: NLP - Prof. A. T. Al-Taani 1. Introduction to NLP 28 The representation must be precise and unambiguous. You should be able to express every distinct reading of a sentence as a distinct formula in the representation. The representation should capture the intuitive structure of the natural language sentences that it represents. For example, sentences that appear to be structurally similar should have similar structural representations, and the meanings of two sentences that are paraphrases of each other should be closely related to each other. NLP - Prof. A. T. Al-Taani 1. Introduction to NLP 29 Syntax: Representing Sentence Structure The syntactic structure of a sentence indicates the way that words in the sentence are related to each other. This structure indicates how the words are grouped together into phrases, what words modify what other words, and what words are of central importance in the sentence. In addition, this structure may identify the types of relationships that exist between phrases and can store other information about the particular sentence structure that may be needed for later processing. NLP - Prof. A. T. Al-Taani 1. Introduction to NLP 30 For example, consider the following sentences: 1. John sold the book to Mary. 2. The book was sold to Mary by John. These sentences share certain structural properties. In each, the noun phrases are "John", "Mary", and "the book“. In other respects, these sentences are significantly different. NLP - Prof. A. T. Al-Taani 1. Introduction to NLP 31 you could only give sentence 1 as an answer to the question "What did John do for Mary?" Sentence 2 is a much better continuation of a sentence beginning with the phrase "After it fell in the river", as sentences 3 and 4 show. Following the standard convention in linguistics, we will use an asterisk (*) before any example of an ill-formed or questionable sentence. NLP - Prof. A. T. Al-Taani 1. Introduction to NLP 32 3. *After it fell in the river, John sold Mary the book. 4. After it fell in the river, the book was sold to Mary by John. 5. *John are in the corner. 6. *John put the book. Sentence 5 is ill-formed because the subject and the verb do not agree in number (the subject is singular and the verb is plural), while 6 is ill-formed because the verb put requires some modifier that describes where John put the object. NLP - Prof. A. T. Al-Taani 1. Introduction to NLP 33 In fact, a robust system should be able to understand ill-formed sentences whenever possible. This might suggest that agreement checks can be ignored, but this is not so. Agreement checks are essential for eliminating potential ambiguities. Consider sentences 7 and 8, which are identical except for the number feature of the main verb, yet represent two quite distinct interpretations. NLP - Prof. A. T. Al-Taani 1. Introduction to NLP 34 7. flying planes are dangerous. 8. flying planes is dangerous. If you did not check subject-verb agreement, these two sentences would be indistinguishable and ambiguous. Most syntactic representations of language are based on the notion of Context-Free Grammars (CFG), which represent sentence structure in terms of what phrases are subparts of other phrases. NLP - Prof. A. T. Al-Taani 1. Introduction to NLP 35 Figure 1.4, shows two different structures for the sentence "Rice flies like sand". The two structures give further details on the structure of the noun phrase and verb phrase and identify the part of speech for each word. NLP - Prof. A. T. Al-Taani 1. Introduction to NLP 36 NLP - Prof. A. T. Al-Taani 1. Introduction to NLP 37 The Logical Form The structure of a sentence doesn’t reflect its meaning. For example, the NP "the catch" can have different meanings depending on whether the speaker is talking about a baseball game or a fishing expedition Both these interpretations have the same syntactic structure, and the different meanings arise from an ambiguity concerning the sense of the word "catch". Once the correct sense is identified, say the fishing sense, there still is a problem in determining what fish are being referred to. The intended meaning of a sentence depends on the situation in which the sentence is produced. NLP - Prof. A. T. Al-Taani 1. Introduction to NLP 38 The division is between context-independent meaning and context-dependent meaning. The fact that "catch" may refer to a baseball move, or the results of a fishing expedition is knowledge about English and is independent of the situation in which the word is used. On the other hand, the fact that a particular NP "the catch" refers to what Jack caught when fishing yesterday is contextually dependent. The representation of the context-independent meaning of a sentence is called its logical form. NLP - Prof. A. T. Al-Taani 1. Introduction to NLP 39 The logical form encodes possible word senses and identifies the semantic relationships between the words and phrases. Many of these relationships are often captured using an abstract set of semantic relationships between the verb and its NPs. In particular, in both sentences 1 and 2 previously given, the action described is a selling event, where "John" is the seller, "the book" is the object being sold, and "Mary" is the buyer. These roles are instances of the abstract semantic roles AGENT, THEME, and TO-POSS (for final possessor), respectively. NLP - Prof. A. T. Al-Taani 1. Introduction to NLP 40 Once the semantic relationships are determined, some word senses may be impossible and thus eliminated from consideration. Consider the sentence 9. Jack invited Mary to the Halloween ball. The word "ball", which by itself is ambiguous between the plaything that bounces and the formal dance event, can only take the latter sense in sentence 9, because the verb "invite" only makes sense with this interpretation. NLP - Prof. A. T. Al-Taani 1. Introduction to NLP 41 One of the key tasks in semantic interpretation is to consider what combinations of the individual word meanings can combine to create coherent sentence meanings. Exploiting such interconnections between word meanings can greatly reduce the number of possible word senses for each word in a given sentence. NLP - Prof. A. T. Al-Taani 1. Introduction to NLP 42 The Final Meaning Representation The final representation needed is a general knowledge representation (KR), which the system uses to represent and reason about its application domain. The goal of contextual interpretation is to take a representation of the structure of a sentence and its logical form, and to map this into some expression in the KR that allows the system to perform the appropriate task in the domain. NLP - Prof. A. T. Al-Taani 1. Introduction to NLP 43 In a question-answering application, a question might map to a database query, in a story-understanding application, a sentence might map into a set of expressions that represent the situation that the sentence describes. We will assume that the first-order predicate calculus (FOPC) is the final representation language because it is relatively well known, well studied, and is precisely defined. NLP - Prof. A. T. Al-Taani 1. Introduction to NLP 44 1.7 The Organization of NLU Systems Figure 1.5 shows the organization of NLU systems. there are interpretation processes that map from one representation to the other. For instance, the process that maps a sentence to its syntactic structure and logical form is called the parser. It uses knowledge about word and word meanings (the lexicon) and a set of rules defining the legal structures (the grammar) in order to assign a syntactic structure and a logical form to an input sentence. NLP - Prof. A. T. Al-Taani 1. Introduction to NLP 45 NLP - Prof. A. T. Al-Taani 1. Introduction to NLP 46 An alternative organization could perform syntactic processing first and then perform semantic interpretation on the resulting structures. Combining the two has considerable advantages because it leads to a reduction in the number of possible interpretations, since every proposed interpretation must simultaneously be syntactically and semantically well formed. For example, consider the following two sentences: NLP - Prof. A. T. Al-Taani 1. Introduction to NLP 47 10. Visiting relatives can be trying. 11. Visiting museums can be trying. These two sentences have identical syntactic structure, so both are syntactically ambiguous. In sentence 10, the subject might be relatives who are visiting you or the event of you visiting relatives. Both of these alternatives are semantically valid, and you would need to determine the appropriate sense by using the contextual mechanism. NLP - Prof. A. T. Al-Taani 1. Introduction to NLP 48 Sentence 11 has only one possible semantic interpretation, since museums are not object that can visit other people; rather they must be visited. In a system with separate syntactic and semantic processing, there would be two syntactic interpretations of sentence 11, one of which the semantic interpreter would eliminate later. If syntactic and semantic processing are combined, however, the system will be able to detect the semantic anomaly as soon as it interprets the phrase "visiting museums", and thus will never build the incorrect syntactic structure in the first place. NLP - Prof. A. T. Al-Taani 1. Introduction to NLP 49 Continuing through Figure 1.5, the process that transforms the syntactic structure and logical form into a final meaning representation is called contextual processing. This process includes issues such as identifying the objects referred to by NPs such as definite descriptions (for example, "the man") and pronouns, the analysis of the temporal aspects of the new information conveyed by the sentence, the identification of the speaker’s intention (for example, whether "Can you lift that rock" is a yes/no question or a request), as well as all the inferential processing required to interpret the sentence appropriately within the application domain. NLP - Prof. A. T. Al-Taani 1. Introduction to NLP 50 It uses knowledge of the discourse context (determined by the sentences that preceded the current one) and knowledge of the application to produce a final representation. The system would then perform whatever reasoning tasks are appropriate for the application. When this requires a response to the user, the meaning that must be expressed is passed to the generation component of the system. It uses knowledge of the discourse context, plus information on the grammar and lexicon, to plan the form of an utterance, which then is mapped into words by a realization process. NLP - Prof. A. T. Al-Taani 1. Introduction to NLP 51