Natural Language Processing: Foundations - NUS Computing

Document Details

CapableCopernicium

Uploaded by CapableCopernicium

National University of Singapore

Tags

natural language processing nlp computer science artificial intelligence

Summary

This document is a lecture handout on natural language processing foundations. It provides an overview of NLP concepts and applications.

Full Transcript

Natural Language Processing: Foundations Section 1 — NLP in Everyday Life NLP Applications NLP application in almost daily use Machine translation Conversational agents (e.g., chat bots) Text summarization Text generation (e.g., autocomplete) Applications powered by NLP Social...

Natural Language Processing: Foundations Section 1 — NLP in Everyday Life NLP Applications NLP application in almost daily use Machine translation Conversational agents (e.g., chat bots) Text summarization Text generation (e.g., autocomplete) Applications powered by NLP Social media Search engines Writing assistants (e.g., grammar checking) 2 Machine Translation A- D 3 Conversational Agents Conversational agents — core components Speech recognition Language analysis Dialogue processing Information retrieval Text-to-Speech 4 Conversational Agents — Question Answering 5 Text Summarization Google's cloud unit looked into using artificial intelligence to help a financial firm decide whom to lend money to. It turned down the client's idea after weeks of internal discussions, deeming the project too ethically dicey. Google has also blocked new AI features analysing emotions, fearing cultural insensitivity. Microsoft restricted software mimicking voices and IBM rejected a client request for an advanced facial-recognition system. 6 Text Generation Example: Autocomplete Given the first words of a sentence, predict the next most likely word 7 Text Generation Example: Image Captioning ➜ "A man riding a red bicycle." 8 Other Applications Spelling correction Document clustering Document classification, e.g.: Spam detection Sentiment analysis Authorship attribution 9 Natural Language Processing: Foundations Section 1 — What is Language? What is NLP? What is Natural Language? Natural Language Means of communicating thoughts, feelings, opinions, ideas, etc. Not formal, yet systematic: rules can emerge which were not previously defined (this includes that many rules can be bent until reaching a breaking point) Characteristics: ambiguous, redundant, changing, unbounded, imprecise, etc. Text / Writing Visual representation of verbal communication (i.e., Natural Language) Writing system: agreed meaning behind the sets of characters that make up a text (most importantly: letters, digits, punctuation, white space characters: spaces, tabs, new lines, etc.) 11 Core Building Blocks of (Written) Language Basic symbol of written language Character (letter, numeral, punctuation mark, etc.) r, e, a, c, t, i, o, n Morpheme Smallest meaning-bearing (1..n characters) unit in a language re-act-ion Word Single independent unit of (1..n morphemes) language that can be represented reaction Phrase Group of words expressing a (1..n words) particular idea or meaning his quick reaction Clause (1..n phrases) Phrase with a subject and verb his quick reaction saved him 12 Core Building Blocks of (Written) Language Sentence Expresses an independent statement, His quick reaction saved him (1..n clauses) question, request, exclamation, etc. from the oncoming traffic. Self-contained unit of discourse in writing Bob lost control of his car. His quick reaction Paragraph (1..n sentences) saved him from the oncoming traffic. Luckily dealing with a particular point or idea. nobody was hurt and the damage to the cae was minimal. (Text) Document Written representation of thought (1..n paragraphs) Corpus Collection of writings (i.e., written texts) (1..n documents) 13 Morphemes Morpheme Smallest meaning-bearing unit in a language ➜ word = 1..n morphemes Example: Prefixes & Suffixes Change the semantic meaning or the part of speech of the affected word un-happy de-frost-er hope-less Assign a particular grammatical property to that word (e.g., tense, number, possession, comparison) walk-ed elephant-s Bob-'s fast-er 14 Examples Prefix Prefix Stem Suffix Suffix Suffix dogs dog -s walked walk -ed imperfection im- perfect -ion hopelessness hope -less -ness undesirability un- desire -able -ity unpremeditated un- pre- mediate -ed antidisestablishmentarianism anti- dis- establish -ment -arian -ism Examples with multiple stems: daydream-ing, paycheck-s, skydive-er 15 Using Language Before ~1950 Verbal communication between people Day-to-day conversations Oral history Written communication for people Stone tablets, scrolls, books, etc. Permanent record of written language Source: Wiki Commons (CC BY-SA 4.0): Rosetta Stone 16 Since 1950: Communication with Machines ~50s-70s ~80s Today Basic symbolic languages Formal languages Natural language (e.g., punch cards) (e.g., programming languages) (e.g., conversational agents / chatbots) Source: Wiki Commons (CC BY-SA 4.0): punch cards, programming 17 Communication with Machines Humans Machines Analysis Natural Some abstract internal Language representation / model of language and the world Generation Source: Wiki Commons (CC BY-SA 4.0): gpu 18 NLP in One Slide characters Tokenization Stemming Lexical Analysis "shallower" morphemes (understanding structure & meaning of words) Normalization Lemmatization words Part-of-Speech Tagging Syntactic Analysis (organization of words into sentences) Syntactic parsing (constituents, dependencies) phrases clauses sentences Word Sense Disambiguation Semantic Analysis Named Entity Recognition (meaning of words and sentences) Semantic Role Labeling Coreference / anaphora resolution paragraphs Discourse Analysis Ellipsis resolution documents (meaning of sentences in documents) Stance detection "deeper" world knowledge Textual Entailment Pragmatic Analysis common sense (understanding & interpreting language in context) Intent recognition 19 What is NLP? — The Big Picture Human Language Algorithms, e.g.: Speech Computer Indexing / search Linguistics Writing Science Pattern matching NLP Artificial Intelligence Machine Learning Deep Learning 20 Natural Language Processing: Foundations Section 1 — Important NLP Tasks Important NLP Tasks Lower-level tasks, e.g.: "shallower" Text preprocessing (e.g., tokenization, normalization) Part-of-Speech Tagging rely on Mid-level tasks, e.g.: Word Sense Disambiguation Named Entity Recognition rely on Higher-level tasks, e.g.: "deeper" Coreference / anaphora resolution Intent recognition 22 Lexical Analysis — Tokenization Tokenization Splitting a sentence or text into meaningful / useful units Different levels of granularity applied in practice character- based She ' s d r i v i ng f as t e r t han a l l owe d. subword- based She 's driv ing fast er than allow ed. word- based She's driving faster than allowed. 23 Syntactic Analysis — Part-of-Speech Tagging Part-of-Speech (POS) tagging Labeling each word in a text corresponding to a part of speech Basic POS tags: noun, verb, article, adjective, preposition, pronoun, adverb, conjunction, interjection proper noun verb preposition or noun, singular adverb possessive adjective singular past tense subordinating or mass punctuation pronoun conjunction NNP VBD RB IN IN PRP$ JJ NN. Bob walked slowly because of his swollen ankle. 24 Syntactic Analysis — Syntactic Parsing Dependency parsing Analyze the grammatical structure in a sentence Find related words & the type of the relationship between them Example: Dependency Graph 25 Semantic Analysis — Word Sense Disambiguation Word Sense Disambiguation (WSD) Identification of the right sense of a word among all possible senses Semantic ambiguity: many words have multiples meanings (i.e., senses) sloping land depository financial institution arrangement of similar objects... She heard a loud shot from the bank during the time of the robbery. a consecutive series of pictures (film) the act of firing a projectile an attempt to score in a game... 26 Semantic Analysis — Named Entity Recognition Named Entity Recognition (NER) Identification of named entities: terms that represent real-world objects Examples: persons, locations, organizations, time, money, etc. PERSON ORGANIZATION LOCATION MONEY Chris booked a Singapore Airlines flight to Germany for S$1,200. 27 Semantic Analysis — Semantic Role Labeling Semantic Role Labeling (SRL) Identification of the semantic roles of these words or phrases in sentences Express semantic roles as predicate-argument structures Who did What to Whom What exactly at When The teacher sent the class the assignment last week. 28 Discourse Analysis — Coreference Resolution Coreference Resolution Identification of expressions that refer to the same entity in a text Entities can be referred to by named entities, noun phrases, pronouns, etc. Mr Smith didn't see the car. Then it hit him. Mr Smith didn't see the car. Then the car hit Mr Smith. 29 Discourse Analysis — Ellipsis Resolution Ellipsis Resolution Inference of ellipses using the surrounding context Ellipsis: omission of a word or phrases in sentence He studied at NUS, his brother at NTU. He studied at NUS, his brother studied at NTU. She's very funny. Her sister is not. She's very funny. Her sister is not very funny. 30 Pragmatic Analysis — Textual Entailment Textual Entailment Determining the inference relation between two short, ordered texts Given a text t and hypothesis h, "t entails h" (t ⇒ h) ➜ someone reading t would infer that h is most likely true t: A mixed choir is performing at the National Day parade. t⇒h h: The anthem is sung by a group of men and women. Required world knowledge: Mixed choir: male and female members Singing a song is a performance "anthem" typically refers to "national anthem" 31 Pragmatic Analysis — Intent Recognition Intent Recognition Classification of an utterance based on what the speaker/writer is trying to achieve Core component of sophisticated chatbots "I'm hungry!" Intent: Action: Additional context: ➜ Writer is looking ➜ Search for vegetarian restaurants in The writer is vegetarian The writer is near VivoCity for a place to eat and around VivoCity that are open. It's 1pm: lunch time... 32 Natural Language Processing: Foundations Section 1 — What makes NLP so Hard? What Makes NLP so Hard? Challenges of Language ➜ Challenges of NLP ambiguous expressive redundant sparse changing variable unbounded "large-scale" imprecise … 34 Ambiguity Ambiguity at different levels, e.g.: Word senses: bank (financial institute or edge of river?), cancer (disease or zodiac sign?) Part of Speech: run (verb or noun?), fast (verb or noun or adjective or adverb?) Syntactic structure: "I see the man with a telescope" ➜ affects semantics! (I have the telescope) (the man has the telescope) 35 Ambiguity Anaphoric ambiguity Ambiguous resolution of anaphora / coreferences (without additional context) ??? ??? Who is "she" and "her" referring to? Alice and Sarah went for dinner. She invited her. Useful context: It was Sarah's birthday. ??? What is "it" referring to? The box didn't fit in the car because it was too big. Resolution requires the understanding of vs. Objects can contain other objects ??? Physical size of objects The box didn't fit in the car because it was too small. Physical limitations due to size 36 Ambiguity Winograd Schema (Challenge) A pair of sentences differing in only one or two words and containing an ambiguity that is resolved in opposite ways Resolution requires the use of world knowledge & reasoning Example ??? I poured water from the bottle into the cup until it was full. vs. ??? I poured water from the bottle into the cup until it was empty. 37 Ambiguity Pragmatic Ambiguity Unclear semantics if context is unknown "Yes!" (interpreted as Yes/No question) "Do you know what time it is?" "It's 4.30 pm." (interpreted as question asking for the time) (interpreted as a rhetorical question, e.g., the lecturer scolding a student for being late for the lecture) 38 Expressivity In general, the same meaning can be expressed with different forms Alice gave Bob the book. vs. Alice gave the book to Bob. This burger is very delicious. vs. This burger is a banger! Please stop talking and pay close vs. Shut up and listen to me! attention to what I want to tell you! 39 Expressivity Idioms It's raining cats and dogs today. He was over the moon to see her. Neologisms May be added to the selfie, retweet, photobomb, staycation, binge-watching, dictionary over time crowdfunding, adulting, chillax, noob, kudos, etc. Literary devices, e.g: Humor Sarcasm "Oh yeah...studying NLP 24/7 is reeeally Irony my favorite way to spend a weekend!" Satire Exaggeration 40 Variation No one-size-fits-all NLP solutions Difference in underlying task (tokenizing, stemming, syntax parsing, part-of-speech tagging, entity recognizing, etc.) ~6.500 languages and ~150 language families (different phonetics/phonology, morphology, syntax, grammar) Different domains: news articles, social media, scientific papers, ancient literature, etc. (particularly: different vocabularies, formal vs. informal language (e.g., slang), narrative vs. dialogue) Cultural differences and biases (example: "I'm over 40 and live alone." — perceived sentiment affected by cultural background) 41 Sparsity Rank Word Freq. 1 the 14,767 2 of 10567 3 and 5920 Sparsity in text corpora 4 in 5477 5 to 4837 Word frequencies inversely proportional 6 a 3460 to their rank ➜ Zipf's Law 7 that 2764 8 as 2242 Example: "On the Origin of Species" 9 have 2121 (Charles Darwin, 1859; 212k+ words) 10 be 2116......... 101 mr 263 102 parts 260 103 often 260 104 period 259 105 common 256......... 1001 increasing 25 1002 expected 25 1003 egg 25 1004 fly 25 1005 aquatic 25......... 42 Sparsity The Hound of the Baskervilles (63k+ words) On the Origin of Species (212k+ words) 100MB Wikipedia dump (14.4M+ words) ➜ Regardless of size and domain of a corpus, there will be a lot of infrequent words! 43 Scale ~6,500 languages and ~150 language families Number of words (e.g., in English) Dictionary: ~470,000 Web corpus: > 1,000,000 Source: Wiki Commons (CC0 1.0): Stockholm Public Library 44 Unmodeled Representation The meaning / interpretation of a sentence often depends on: The current context or situation ➜ How to capture this in ? Shared understanding about the world "I killed all the children." "I slipped and fell hard on the floor." Serial killer or Linux administrator? Arguably a negative sentiment, but WHY? 45 Natural Language Processing: Foundations Section 1 — When NLP Goes Wrong NLP — Ethical Questions & Challenges "If you torture the data long "Could" vs. "Should" — e.g.: enough, it will confess to anything" Should we build a classifier to identify social media user (Ronald Coase; 1981 — paraphrased) if they suffer from depression based on their posts? Should we organize users news feed based on their interests and likings to maximize user engagement? "With great power comes Should we build chatbots that can perfectly mimic humans? great responsibility!" (Spider-Man's Uncle Ben) Fundamental challenges in NLP Most NLP techniques rely on statistical models (never 100% correct and often difficult to quantify errors) "Your scientists were so preoccupied Most NLP techniques learn from data with whether they could, they didn't (Is the data representative? Is the data biased?) stop to think if they should." (Ian Malcolm; Jurassic Park, 1991) 47 NLP in the Press — For the Wrong Reasons Source: https://www.channelnewsasia.com/singapore/moh-ask-jamie-covid-19-query-social-media-2222571 48 NLP in the Press — For the Wrong Reasons 49 Natural Language Processing: Foundations Section 1 — The Big Picture What is NLP? — The Big Picture NLP as machine learning Symbolic, probabilistic, and connectionist ML have found their way into NLP Good ML needs bias and assumptions ➜ NLP: linguistic theory & representations NLP as linguistics NLP must contend with NL data as found in the world NLP ≈ computational linguistics Linguists now use tools that originated from NLP! 51 What is NLP? — The Big Picture "Language shapes the way we think, and Fields with Connections to NLP determines what we can think about." Cognitive Science Benjamin Lee Whorf Information Theory "Knowledge of languages is the doorway to wisdom." Data Science Roger Bacon Political Science Psychology “Language is the road map of a culture. It tells you where its people come from and where they are going.” Economics Rita Mae Brown Education Ethics "We should learn languages because language is the only thing worth knowing even poorly." Kató Lomb 52 Desiderata of NLP Models What makes good NLP? Sensitivity to a wide range of phenomena and constraints in language Generality across languages, modalities, genres, styles Strong formal guarantees (e.g., convergence, statistical efficiency, consistency) High accuracy when judged against expert annotations or test data Computational efficiency during training and testing (construction and production) Explainable to human users ➜ transparency Ethical considerations In practice, often conflicting goals (e.g., accuracy vs explainability) 53 NLP is Changing Increases in computing power Deep Learning = matrix operations ➜ Game changer: GPUs The rise of the Web, then the social web More "food" for data hungry algorithms User generated content = informal, natural, lively text Advances in machine learning Continuously growing model zoo (LSTM/GRU, CNN, VAE, Transformers, etc.) Advances in understanding of language in social context 54 Course Meta Topics Linguistic Issues What is the range of language phenomena? What are the knowledge sources that let us disambiguate? What representations are appropriate? How do you know what to model and what not to model? Statistical Modeling Methods Increasingly complex model structures Learning and parameter estimation Efficient inference: dynamic programming, search Deep neural networks for NLP: LSTM, CNN, Seq2seq 55 Section 1 — Summary Main takeaway messages NLP is important! NLP is challenging! NLP is interdisciplinary! Communication with machines NLP is challenging because NLP covers many fields and not only with other people Natural Language is challenging Main fields: linguistics NLP-powered applications more Natural Language is ambiguous, and computer science and more commonplace redundant, changing, unbounded, Many connected fields (e.g., imprecise, expressive, etc. More on more written language ethics, psychology, education) will be generated by machines NLP methods need to be flexible, robust, efficient, accurate, etc. 56

Use Quizgecko on...
Browser
Browser