12_Question Answering.pdf

Question Answering Le Thanh Huong School of Information and Communication Technology [email protected] 1 How can you solve the following problem? 1. Build a QA system based on a set of FAQs/user manual 2. Build a chatbot for a computer store 3. Build a QA system with the knowledge base from Wikipedia 4. Build a QA system for advising education enrollment 5. Build a QA system for study counseling 2 Question Answering Target: Build a system that can answer human questions automatically in natural language Question (Q) Answer (A) Sources: Text passages, web documents, knowledge bases, databases, available Q&A sets Question types: return a value or not, open/closed domain, simple/complex questions, etc. Answer types: a few words, a paragraph, a list, yes/no, etc. 3 Sample TREC questions Who is the author of the book “The Iron Lady: A Biography of Margaret Thatcher”? What was the monetary value of the Nobel Peace Prize in 1989? What does the Peugeot company manufacture? How much did Mercury spend on advertising in 1993? Why did David Koresh ask the FBI for a word processor? 4 People want to ask questions Examples from AltaVista query log (late 1990s) Who invented surf music? How to make stink bombs Which english translation of the bible is used in official catholic liturgies? Examples from Excite query log (12/1999) How can i find someone in Texas Where can i find information on puritan religion? What vacuum cleaner does Consumers Guide recommend 5 Online QA Examples LCC: http://www.languagecomputer.com/demos/ question_answering/index.html AnswerBus is an open-domain question answering system: www.answerbus.com EasyAsk, AnswerLogic, AnswerFriend, Start, Quasm, Mulder, Webclopedia, TextMap, etc. Google 6 AskJeeves …is most hyped example of QA …does pattern matching to match your question to their own knowledge base of questions If that works, you get the human-curated answers to that known question If that fails, return regular web search A potentially interested middle ground, but a weak shadow of real QA 7 Top Performing Systems …can answer ~70% of the questions Approaches: Knowledge-rich approaches, using many NLP techniques (Harabagiu, Moldovan et al.-SMU/UTD/LCC) AskMRS: shallow approach Middle ground use large collection of surface matching patterns (ISI) … 11 AskMRS: shallow approach In what year did Abraham Lincoln die? Ignore hard documents and find easy ones 12 AskMSR: Details Step 1: Rewrite queries Intuition: The user’s question is often syntactically quite close to sentences that contain the answer Where is the Louvre Museum located? The Louvre Museum is located in Paris Who created the character of Scroogle? Charles Dickens created the character of Scrooge. 14 Query rewriting Classify question into 7 categories Who is/was/are/were…? When is/did/will/are/were…? Where is/are/were…? a) Category-specific transformation rules E.g., For Where question, move “is” to all possible locations Where is the Louvre Museum located? → is the Louvre Museum located? → the is Louvre Museum located? → the Louvre is Museum located? → the Louvre Museum is located? → the Louvre Museum located is? b) Expected answer “Datatype” (eg, Date, Person, Location,…) → When was the French Revolution? → DATE Hand-crafted classification/rewrite/datatype rules 15 Query Rewriting - weights Some query rewrites are more reliable than others Step 2: Query search engine Send all rewrites to a Web search engine Retrieve top N answers (100?) Rely just on search engine’s words/phrases, not the full text of the actual document 17 Step 3: Mining N-Grams Unigram, bigram, trigram, …, N-gram: list of N adjacent term in a sequence Eg. “Web Question Answering: Is More Always Better” Unigram: Web, Question, Answering, Is, More, Always, Better Bigram: Web Question, Question Answering, Answering Is, Is More, More Always, Always Better Trigram: … 18 Mining N-grams Simple: Enumerate all N-grams (N=1,2,3…) in all retrieved phrases Use hash table and other tools to make this efficient Weight of an n-gram: occurrence count Eg, “Who created the character of Scrooge?” Dickens – 117 Christmas Carol – 78 Charles Dickens – 75 Disney – 72 Carl Banks – 54 A Christmas – 41 Christmas Carol - 45 19 Step 4: Filtering N-Grams Each question type is associated with one or more “data-type filters” = regular expression When… Date Where… Location What… Who… Person Boost score of n-grams that do match regexp Lower score of n-grams that don’t match regexp 20 Step 5: Tiling the Answers 21 Results Standard TREC contest test-bed: ~1M documents; 900 questions Technique doesn’t do well (but rank in top 9/30 participants!) Limitation: Works best only for fact-based questions Limited range of Question categories Answer data types Query rewriting rules 22 Surface matching patterns (Ravichandran and Hovy, ISI) When was X born? Mozart was born in 1756 Gandhi (1869—1948) was born in (- Use a Q-A pair to query a search engine Extract patterns and compute their accuracy 23 Example: INVENTOR invents the was invented by invented the in ’s invention of the … Many of these patterns have high accuracy But still some mistakes 24 Full NLP QA LCC: Harabagiu,Moldovan et al. Value from sophisticated NLP – Pasca & Harabagiu (2001) Good IR is needed: SMART paragraph retrieval Large taxonomy of question types and expected answer types is crucial Statistical parser is used to parse questions and relevant text for answers, and to build KB Query expansion loops (morphological, lexical synonyms, and semantic relations) important Answer ranking by simply ML method 26 Answer types in State-of-the-art QA systems Features: Answer type: Labels questions with answer type based on a taxonomy Classifies questions (eg., by using a maximum entropy model) Answer Types “Who” questions can have organizations as answers Who sells the most hybrid cars? “Which” questions can have people as answers Which president went to war with Mexico? 28 Keyword Selection Algorithm Select all… Non-stopwords in quotations NNP words in recognized named entities Complex nominals with their adjectival modifiers Other complex nominals Nouns with adjectival modifiers Other nouns Verbs The answer type word 29 Passage Extraction Loop Passage Extraction Component Extracts passages that contain all selected keywords Passage size/start position dynamic Passage quality and keyword adjustment 1st iteration: use the first 6 keyword selection heuristics If #passages <  → query is too strict →drop a keyword If #passages >  → query is too relaxed →add a keyword 30 Passage Scoring Involve 3 scores: #words from the question that are recognized in the same sequence in the window #words that separate the most distant keywords in the window #unmatched keywords in the window 31 Rank candidate answers in the retrieved passages Name the first private citizen to fly in space Answer type: Person Text passage: “Among them was Christa McAuliffe, the first private citizen to fly in space. Karen Ailen, best known for her starring role in “Raiders of the Lost Ark”, plays McAuliffe. Brian Kerwin is featured as shuttle pilot Mike Smith…” Best candidate answer: Christa McAuliffe 32 Name Entity Recognition Several QA systems are determined by the recognition of name entities. E.g., Give me information about some Dell laptops with prices in the range from 18 million to 22 million VND Can you recommend me some tourist attraction in Hanoi? Give me some information about houses for rent near HUST with the price under 5 million VND Which city has the largest population in Vietnam? Important features: Precision of recognition Coverage of name classes (Producer - Branch name) Mapping into concept hierarchies (computer, laptop, iPad,…) Participation into semantic relations (eg, predicate-argument structures or frame semantics) 33 Semantics and Reasoning for QA: Predicate-argument structure When was Microsoft established? Microsoft plans to establish manufacturing partnerships in Brazil and Mexico in May. Need to be able to detect sentences in which ‘Microsoft’ is object of ‘establish” or close synonym. Matching sentence: Microsoft Corp was founded in the US in 1975, incorporated in 1981, and established in the UK in 1982. Require analysis of sentence syntax/ semantics 34 Semantics and Reasoning for QA: Syntax to Logical Forms Syntactic analysis plus semantic → logical form Mapping of question and potential answer LFs to find the best match Inference System attempts inference to justify an answer (often following lexical chains) Their inference is a middle ground between logic and pattern matching But very effective: 30% improvement Q: When was the internal combustion engine invented? A: The first internal-combustion engine was built in 1867. Invent → create_mentally → create → build 36 Neural models for reading comprehension Stanford question answering dataset (SQuAD) 100k annotated (passage, question, answer) triples 16 Large-scale supervised datasets are also a key ingredient for training effective neural models for reading comprehension! This is a limitation— not all the questions can be answered in this way! Passages are selected from English Wikipedia, usually 100~150 words. Questions are crowd-sourced. Each answer is a short segment of text (or span) in the passage. SQuAD was for years the most popular reading comprehension dataset; it is “almost solved” today (though the underlying task is not,) and the state-of-the-art exceeds the estimated human performance. (Rajpurkar et al., 2016): SQuAD: 100,000+ Questions for Machine Comprehension Stanford question answering dataset (SQuAD) Evaluation: exact match (0 or 1) and F1 (partial credit). For development and testing sets, 3 gold answers are collected, because there could be multiple plausible answers. We compare the predicted answer to each gold answer (a, an, the, punctuations are removed) and take max scores. Finally, we take the average of all the examples for both exact match and F1. Estimated human performance: EM = 82.3, F1 = 91.2 Q: What did Tesla do in December 1878? A: {left Graz, left Graz, left Graz and severed all relations with his family} Prediction: {left Graz and served} Exact match: max{0, 0, 0} = 0 F1: max{0.67, 0.67, 0.61} = 0.6 39 Other question answering datasets TriviaQA: Questions and answers by trivia enthusiasts. Independently collected web paragraphs that contain the answer and seem to discuss question, but no human verification that paragraph supports answer to question Natural Questions: Question drawn from frequently asked Google search questions. Answers from Wikipedia paragraphs. Answer can be substring, yes, no, or NOT_PRESENT. Verified by human annotation. HotpotQA. Constructed questions to be answered from the whole of Wikipedia which involve getting information from two pages to answer a multistep query: Q: Which novel by the author of “Armada” will be adapted as a feature film by Steven Spielberg? A: Ready Player One 40 BiDAF: the Bidirectional Attention Flow model (Seo et al., 2017): Bidirectional Attention Flow for Machine Comprehension BiDAF: Encoding Use a concatenation of word embedding (GloVe) and character embedding (CNNs over character embeddings) for each word in context and query Then, use two bidirectional LSTMs separately to produce contextual embeddings for both context and query BiDAF: Attention Context-to-query attention: For each context word, choose the most relevant words from the query words (Slides adapted from Minjoon Seo) BiDAF: Attention Query-to-context attention: choose the context words that are most relevant to one of query words (Slides adapted from Minjoon Seo) BiDAF: Attention First, compute a similarity score for every pair of (ci, qj): Context-to-query attention (which question words are more relevant to ci) Query-to-context attention (which context words are relevant to some question words): BiDAF: Modeling and outputlayers Modeling layer: pass to another two layers of bi-directional LSTMs Attention layer is modeling interactions between query and context Modeling layer is modeling interactions within context words Output layer: two classifiers predicting the start and end positions: BiDAF: Performance on SQuAD F1=77.3 % on SQuAD v1.1. Without context-to-query 67.7 F1 attention Without query-to-context 73.7 F1 attention Without character 75.4 F1 embedding (Seo et al., 2017): Bidirectional Attention Flow for Machine Comprehension Attention visualization 48 BERT for readingcomprehension BERT is a deep bidirectional Transformer encoder pre-trained on large amounts of text (Wikipedia + BooksCorpus) BERT is pre-trained on two training objectives: Masked language model (MLM) Next sentence prediction (NSP) BERTbase has 12 layers and 110M parameters, BERTlarge has 24 layers and 330M parameters Transformer A deep learning model that transforms an input sequence to an output sequence using the encoder – decoder architecture Multi-head attention Feed forward layers Positional embeddings https://medium.com/inside-machine-learning/what-is-a-transformer-d07dd1fbec04 BERT for reading comprehension Question = Segment A Passage = Segment B Answer = predicting two endpoints in segment B Image credit: https://mccormickml.com/ BERT for reading comprehension F1 EM Human performance 91.2 82.3 BiDAF 77.3 67.7 BERT-base 88.5 80.8 BERT-large 90.9 84.1 XLNet 94.5 89.0 RoBERTa 94.6 88.9 ALBERT 94.8 89.3 (dev set, except for human performance) 52 ChatGPT Launched in 11/2022 by OpenAI ChatGPT stands for Generative Pre-Trained Transformer. ChatGPT is powered by LLM(Large Language Model) → it can resemble human-like responses Data: from online databases, containing books, webtexts, Wikipedia, articles, and other online literature. Usages: chatbots, virtual assistants, other applications requiring a high level of natural language processing ability. Anyone can use Chat GPT by integrating it into their own apps or by using one of the many prebuilt chatbot platforms that incorporate the technology. 53 COMPARING GENERATIVE AI Google Bard Al ChatGPT Answers real-time Answers are based queries on data recorded up to 2021 Regular Google Text-only responses search results Based on GPT Based on LaMDA Has plagiarism No plagiarism detector detector ChatGPT Plus is a Free for now paid plan 54 Exercise How can you solve the following problem? 1. Build a chatbot for a computer store. 2. Build a QA system with the knowledge base from Wikipedia 55 Chatbot for a computer store Give me information about some Dell laptops with prices in the range from 18 million to 22 million VND Laptop’s attributes: Branch name: Dell Processor: Core i5 1135G7 2.4GHz RAM: 8Gb DDR4 3200 Hard Drive: 256GB SSD Video card: VGA onboard - Intel Iris Xe Graphics Screen Size: About 14 inches Operating system: NoOS Price: 20 million VND 56

12_Question Answering.pdf

Document Details

Tags

Related

Full Transcript

Upgrade to continue