Introduction to NLP and the Web - Lecture 1

Summary

This document is the first lecture of the course "NLP and the Web" from Technische Universität Darmstadt. The lecture provides an introduction to NLP, outlines course goals and discusses research areas. Challenges within the field of natural language processing are presented, including the need for robust systems to cope with information overload.

Full Transcript

NLP and the Web – WS 2024/2025 Lecture 1 Introduction Dr. Thomas Arnold Hovhannes Tamoyan Kexin Wang Ubiquitous Knowledge Processing Lab Technische Universität Darmstadt Introduction: Teaching Staff Dr. Thomas Arnold Hovhannes Tamoyan Kexin Wang Lect...

NLP and the Web – WS 2024/2025 Lecture 1 Introduction Dr. Thomas Arnold Hovhannes Tamoyan Kexin Wang Ubiquitous Knowledge Processing Lab Technische Universität Darmstadt Introduction: Teaching Staff Dr. Thomas Arnold Hovhannes Tamoyan Kexin Wang Lectures Practice Class Practice Class WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 2 Outline UKP Lab: profile and projects Administrative course issues NLP 4 Web Introduction NLP Basics / Linguistic Analysis WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 3 Who Are We? ▪ 1 Professor, ~5 Postdocs, ~35 Doctoral Researchers ▪ We mainly work in natural language processing (NLP) ▪ Research areas (growing every day!) Deep Learning for NLP Knowledge Graphs Argument Mining Interactive AI and NLP Content Analytics for the Social Writing Assistance and Language Good Learning WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 4 Teaching Concept – UKP (Lectures) Winter Term Summer Term Information Introductory Management Application NLP and the Web Ethics in NLP Oriented Advanced Deep Learning for NLP WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 5 Teaching Concept – UKP (Seminars & Projects) Data Analysis Software Project Software Project for Natural Language (irregular schedule) Winter 2023/24: Various Projects Winter 2024/25: Various Projects Regular Seminar Text Analytics / Large Language Models Winter 2023/24: Generative AI Summer 2024: LLMs for Mental Health Winter 2024/25: Understanding LLMs WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 6 Complementary Lectures and Seminars ▪ Machine Learning ▪ Einführung in die künstliche Intelligenz (Kersting) ▪ Data Mining und maschinelles Lernen (Kersting) ▪ Deep Learning (Kersting) ▪ Computer Vision ▪ Computer Vision 1 and 2 (Roth) ▪ Natural Language Processing ▪ Deep Learning for NLP ▪ Ethics in NLP WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 7 Teaching Concept – UKP (PhD) ▪ Get involved early (HiWi, B.Sc. thesis, M.Sc. thesis) WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 8 More information Website: www.ukp.tu-darmstadt.de GitHub: www.github.com/UKPLab Social Media: @UKPLab WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 9 Outline UKP Lab: profile and projects Administrative course issues NLP 4 Web Introduction NLP Basics / Linguistic Analysis WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 10 Course Goals ▪ Learn the basic principles underlying NLP systems ▪ Two big NLP topics: ▪ Information Retrieval (IR) ▪ Large Language Model (LLM) Applications ▪ Gain insight into open research problems in natural language processing WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 11 Why Care? Information Overload Business Intelligence Need for Robust, Intelligent Systems WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 12 Textbook Constantly updated: ▪ Speech and Language Processing. An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Daniel Jurafsky and James H. Martin. 3nd edition, 2023 (draft). ▪ https://web.stanford.edu/~jurafsky/slp3/ WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 13 General Information ▪ All lectures and practice classes will be in person Lectures: Tuesdays 13:30 – 15:10, S306 / 051 Practice Class: Thursdays 16:15 – 17:55, S103 / 221 ▪ All slides, handouts, readings etc. can be found on the Moodle e-Learning platform ▪ We also use Moodle as a central point for announcements and questions ▪ Please use the Moodle forum! WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 14 General Information – Practice Class ▪ In the practice classes, you will work on programming exercises ▪ Programming language is Python ▪ First practice session will include a brief introduction to Python ▪ This will give you some practical experience in NLP ▪ Practice class topics are relevant for the exam! (including Python) ▪ In addition, there are homework assignments for an exam bonus: ▪ Assignments will be bi-weekly – 6 exercises in total ▪ Each assignment is worth a maximum of 20 points ▪ If you get >= 75% of the points (>= 90 points), you get a bonus ▪ You can improve your grade by 0.3/0.4 IFF you pass the exam without bonus WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 15 General Information – Practice Class ▪ First class: October 24th (no practice class this week) ▪ Details will be announced in moodle ▪ If you need additional help regarding the practice class, use the Moodle forum The assignments will require a significant amount of time, so start earlier than the day before submission. WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 16 Final exam Tuesday, 25.02.2025, 15:00 More info be announced in Moodle ▪ Allowed: Non-programmable calculator, no other material ▪ Content: lecture, readings, practice class WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 17 Syllabus (tentative) Nr. Lecture 01 Introduction / NLP basics 02 Foundations of Text Classification 03 IR – Introduction, Evaluation 04 IR – Word Representation, Data Collection 05 IR – Re-Ranking Methods 06 IR – Language Domain Shifts, Dense / Sparse Retrieval 07 LLM – Language Modeling Foundations 08 LLM – Neural LLM, Tokenization 09 LLM – Transformers, Self-Attention 10 LLM – Adaption, LoRa, Prompting 11 LLM – Alignment, Instruction Tuning 12 LLM – Long Contexts, RAG 13 LLM – Scaling, Computation Cost 14 Review & Preparation for the Exam WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 18 Warm up Now it is your turn: Which degree programme are you studying? ▪ Computer Science? ▪ Bachelor? ▪ Master? ▪ Other disciplines? WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 19 Warm up Now it is your turn: Which other UKP courses did you already attend? ▪ FoLT ▪ Ethics in Natural Language Processing ▪ Deep Learning for NLP ▪ Data Analysis Software Project ▪ Text Analytics / LLM Seminar WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 20 Outline UKP Lab: profile and projects Administrative course issues NLP 4 Web Introduction NLP Basics / Linguistic Analysis WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 21 NLP in the Web – Search Engines WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 22 NLP in the Web – Spelling Correction WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 23 Question Answering WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 24 NLP in the Web – Machine Translation WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 25 NLP in the Web – Speech Recognition WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 26 NLP in the Web – Plagiarism Detection http://de.guttenplag.wikia.com/ WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 27 NLP in the Web – Summarization WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 28 NLP in the Web – Diachronic Analysis WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 29 NLP in the Web – Text Generators WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 30 Natural Language Processing and the Web ▪ The web is an application area for NLP, e.g.: ▪ Information retrieval: Search engines Question answering News aggregation Recommender Systems Chatbots… ▪ Web is a resource to improve the quality of NLP, e.g.: ▪ Web as a corpus ▪ Analyzing web-based knowledge repositories Wikipedia Wiktionary ▪ Recognizing synonyms, paraphrases and the like WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 31 Challenges for NLP How to remove noise, e.g. duplicates? How to assess the quality of content? How to integrate the content of heterogeneous and scattered nature? How to deal with errors, e.g. spelling or grammar errors? How to „clean“ the data? WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 32 Data Cleansing is Necessary ▪ User-generated content contains errors, smileys, abbreviations, etc. Hi Micheal, have u seen my posting,last week u said that u will look in to my problem thsi week.can i ask u now? Data import Data cleansing WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 33 Outline UKP Lab: profile and projects Administrative course issues NLP 4 Web Introduction NLP Basics / Linguistic Analysis WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 34 Analysis Levels in Language Understanding Phonetics and Phonology Segmentation Morphology Syntax Semantics Pragmatics and Discourse WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 35 Phonetics and Phonology (c) David Groome, 2006 night Homophones /naɪt/ knight WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 36 Analysis Levels in Language Understanding Phonetics and Phonology Segmentation Morphology Syntax Semantics Pragmatics and Discourse WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 37 Segmentation (c) David Groome, 2006 WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 38 Tokenization ▪ Segmenting an input stream into an ordered sequence of units is called tokenization. ▪ A token can correspond to an inflected word form or sub-word units, and may be subject to a subsequent morphological analysis. ▪ Tokens include punctuation! ▪ A system which splits texts into tokens is called a tokenizer A very simple example: ▪ Input text: John likes Mary and Mary likes John. ▪ Tokens: {"John", "likes", "Mary", "and", "Mary", "likes", "John", "."} WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 39 Tokenization English Example ▪ Mr. Sherwood said, reaction to Sea Containers‘ proposal has been „very positive.“ In New York Stock Exchange composite trading yesterday, Sea Containers closed at $62.625, up 62.5 cents. Where could be problems for a tokenizer? WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 40 Tokenization English Example ▪ Mr. Sherwood said, reaction to Sea Containers‘ proposal has been „very positive.“ In New York Stock Exchange composite trading yesterday, Sea Containers closed at $62.625, up 62.5 cents. ▪ Split at whitespace characters? cents. said, positive.” $62.625, WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 41 Tokenization Ambiguities Period ▪ In most of the cases: Final sentence punctuation symbol ▪ Part of an abbreviation, e.g. F.D.P. ▪ Numbers, ordinal numbers, e.g.: 21., numbers with fractions, e.g. 1.543 ▪ References to resources locators, e.g.: www.apple.com ▪ To complicate things, if a sentence ends with an abbreviation which ends with a period, only one period is written. “I go to Apple, Inc.” ▪… Whitespace character ▪ Part of numbers, e.g. “1 543” ▪ No segmentation character in multi-word expressions ▪ “New York” WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 42 Ambiguities Comma ▪ Part of numbers, e.g. 1,543 Single quote ▪ Within tokens to mark contractions and elisions, e.g. English: don’t, won’t, you’ve, James’ new hat; German: Ich hab’s! ▪ Part of a token in French, e.g. aujourd´hui ▪ But in most cases: Enclosing quoted groups of words Dash ▪ A delimiter, if it connects strings of digits, e.g. "see pages 100-101” ▪ In French: Signal a close connection between two tokens, e.g. verb and personal pronoun: donne-le ▪ In most cases, however, it is part of the token, e.g. multi-word WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 43 Tokenization in Other Languages Chinese 爱国人 ▪ No spaces ▪ Two possible segmentations, both of them are syntactically and semantically correct ▪ Disambiguation can only be done with contextual information 爱国 / 人 country-loving person 爱 / 国人 love country-person Bird et al., NLP with Python, p.113 WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 44 German Compounds German STAUBECKEN ▪ No spaces within noun compounds ▪ Two possible segmentations, both of them are syntactically and semantically correct ▪ Disambiguation can only be done with contextual information STAU BECKEN water reservoir STAUB ECKEN dusty corners WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 45 Analysis Levels in Language Understanding Phonetics and Phonology Segmentation Morphology Syntax Semantics Pragmatics and Discourse WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 46 Morphology Morphology is the branch of linguistics that studies word forms and word formation Words are composed of morphemes Morphemes are the smallest meaning-bearing units (c) David Groome, 2006 WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 47 Morphology Words can be further decomposed into smaller units: “pneumonoultramicroscopicsilicovolcanoconiosis” lung disease caused by the inhalation of very fine silica dust found in volcanoes WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 48 Bases and Affixes Remember: Morphemes are the smallest meaning-bearing units Examples: ▪ cats → cat (noun) + s (plural) ▪ unknowingly → un + know + ing + ly ▪ bedenken → be + denk + en ▪ Both cat and cats can be uttered in isolation but s cannot: -s is a bound morpheme ▪ Minimal free morphemes = stems ▪ cat is a free morpheme ▪ Stems carry the main meaning of the word ▪ Affixes are bound morphemes WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 49 Types of Affixes Suffixes: appear after the base ▪ cat + s, nice + ly Prefixes: appear before the base ▪ un + true Infixes: appear inside the base ▪ fan + bloody + tastic Circumfixes: appear on both sides of the base ▪ ge + sag + t WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 50 Morphological Normalization ▪ Morphological normalization consists in identifying a single canonical representative for morphologically related word- forms Methods ▪Stemming ▪Lemmatization WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 51 Stemming Stemming is an algorithmic approach to strip off the endings of words sitting → sitt anarchism, anarchy, anarchistic → anarchi Objective: group words belonging to the same morphological family by transforming them into the same stemmed representation ▪ stemming does not distinguish between inflection and derivation ▪ the stems obtained do not necessarily correspond to a real word form Well-known stemming algorithms for English have been developed by Lovins and Porter WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 52 Algorithmic Stemming Method Stemming is rule-based. Example rules from Porter: *ATIONAL -> *ATE (relational -> relate) *[> 0 vowels] + ING -> * (monitoring -> monitor) *SSES -> *SS (grasses -> grass) Rule-based stemming methods are hard to create, often yield arbitrary distinctions, but can be executed very quickly at runtime. WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 53 Porter's Stemmer Original Word Stemmed Word vision vision visible visibl visibility visibl visionary visionari visioner vision visual visual WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 54 Stemming Errors Under-stemming: remove too little ▪ adhere → adher ▪ adhesion → adhes Over-stemming: remove too much ▪ appendicitis → append ▪ append → append WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 55 Problem with Stemming: Syntactic Ambiguity Homographs: words which have the same spelling but different meanings I saw the saw Past form Singular form of the verb SEE ≠ of the noun SAW Such cases cannot be properly dealt with by stemming only, the word's grammatical category also has to be identified WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 56 Lemmatization ▪ “undo” the inflectional changes of a base form ▪ Usually needs lexical resources and part-of-speech tagging ▪cats (NOUN) → cat ▪left (VERB) → leave ▪left (ADJ) → left ▪Has to deal with Irregularities ▪ sing, sang, sung → sing ▪ indices → index ▪ Bäume → Baum WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 57 Stemming vs. Lemmatization Original Stemmed Lemmatized visibilities visibl visibility adhere adher adhere adhesion adhes adhesion appendicitis append appendicitis oxen oxen ox indices indic index swum swum swim WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 58 Analysis Levels in Language Understanding Phonetics and Phonology Segmentation Morphology Syntax Semantics Pragmatics and Discourse WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 59 Syntax ▪ Syntax refers to the way words are arranged together ▪ "Syntax is the study of the regularities and constraints of word order and phrase structure" (Manning & Schütze, 2003, p. 93) ▪ There is an infinite number of ways in which words can be arranged together to form sentences ▪ Yet, we can understand sentences we have never heard or read before WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 60 POS Tagging ▪ The process of assigning a part of speech or lexical class marker to each word in a corpus ▪ The input to a tagging algorithm is a sequence of words and a tagset, and the output is a sequence of tags, a single best tag for each word Determiner Noun Verb Pronoun Adjective (c) David Groome, 2006 WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 61 Parts of Speech ▪ In English we traditionally have 8 parts of speech ▪N Noun chair, bandwidth, pacing ▪V Verb study, debate, munch ▪ ADJ Adjective purple, tall, ridiculous ▪ ADV Adverb unfortunately, slowly ▪P Preposition of, by, to ▪ PRO Pronoun I, me, mine ▪ DET Determiner the, a, that, those ▪ INTJ Interjection oh!, m-hm, huh? WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 62 Penn Treebank Tagset 1. CC Coord. conjunc. 25. TO to 2. CD Cardinal number 26. UH Interjection 3. DT Determiner 27. VB V, base form 4. EX Existential there 28. VBD V, past tense 5. FW Foreign word 29. VBG V, gerund/pres. part. 6. IN Prep./subord. conj. 30. VBN V, past part. Language Tagset Size 7. JJ Adject. 31. VBP V, non-3rd ps. sing. pres. 8. JJR Adject., comp. 32. VBZ V, 3rd ps. sing. pres. English 139 9. JJS Adject., superl. 33. WDT wh-det. 10. LS List item marker 34. WP wh-pronoun Czech 970 11. MD Modal 35. WP$ Poss. wh-pronoun 12. NN Noun, sing. or mass 36. WRB wh-adverb Estonian 476 13. NNS Noun, plural 37. # Pound sign Hungarian 401 14. NNP Proper noun, sing. 38. $ Dollar sign 15. NNPS Proper noun, plural 39.. Sent.-final punct. Romanian 486 16. PDT Predeterminer 40. , Comma 17. POS Possessive ending 41. : Colon, semi-colon Slovene 1033 18. PRP Personal pronoun 42. ( L. bracket char. 19. PP$ Poss. pronoun 43. ) R. bracket char. (Hajič, 2000) 20. RB Adverb 44.“ Straight dbl. quote 21. RBR Adverb, comp. 45. ‘ L. open sngl. quote 22. RBS Adverb, superl. 46. “ L. open dbl. quote 23. RP Particle 47. ’ R. close sngl. quote 24. SYM Symbol 48. ” R. close dbl. quote WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 63 An Example WORD LEMMA TAG the the +DET host host +NOUN kissed kiss +VPAST the the +DET friend friend +NOUN on on +PREP the the +DET cheek cheek +NOUN WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 64 Ambiguities ▪ POS Tagging is a disambiguation task ▪ Words are ambiguous—have more than one possible part-of-speech ▪ The word “book”: ▪ book that flight: verb ▪ hand me that book: noun ▪ The word “that”: ▪ Does that flight serve dinner? : determiner ▪ I thought that your flight was earlier: complementizer ▪ POS Tagging: resolves ambiguities, choosing the proper tag for the context ▪ Baseline: Most Frequent Class (accuracy 92.34% [Jurafsky & Martin]) ▪ Outdated: Rule-based tagging, probabilistic tagging ▪ State of the art: Neural approaches, accuracy ~ 98% WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 65 Parsing ▪ The process of determining the grammatical structure with respect to a given grammar. (c) David Groome, 2006 WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 66 Alternative representations ▪ Bracketed notation: [S [NP [Det the] [N dog] ] [VP [V ate] [NP [Det a] [N cookie] ] ] ] ▪ Parenthesized notation: (S Parse Tree: (NP (Det the) (N dog) ) (VP (V ate) (NP (Det a) (N cookie)))) WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 67 Syntactic Ambiguity ▪If you love money problems show up ▪ If you love, money problems show up. ▪ If you love money, problems show up. ▪ If you love money problems, show up. ▪“I made her duck.” ▪“We're eating grandpa!” vs. "We're eating, grandpa!" ▪“Weil er drei Monate verfallene Medikamente nahm,...” ▪Different interpretations are mainly caused by syntactic ambiguity. WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 68 Syntactic Ambiguities: Two Possible Parsing Possibilities “I saw the man with a telescope.” WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 69 Syntactic Ambiguities: Two Possible Parsing Possibilities WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 70 Analysis Levels in Language Understanding Phonetics and Phonology Segmentation Morphology Syntax Semantics Pragmatics and Discourse WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 71 Definition Semantics: ▪ Study of the meaning of words, phrases, sentences, or documents Lexical Semantics ▪ Study of the meaning of lexical units, i.e. words. WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 72 Lexical Ambiguity He hit the ball with the bat. Chuck Norris can hit a bat with a ball. ▪ Different interpretations are caused by lexical ambiguity. WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 73 Analysis Levels in Language Understanding Phonetics and Phonology Segmentation Morphology Syntax Semantics Pragmatics and Discourse WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 74 Pragmatics What is the purpose of an utterance? “I never said she stole my money" I simply didn't ever say it. ▪ “I never said she stole my money” Someone else said it, but I didn't. ▪ “I never said she stole my money” I might have implied it in some way, but I never explicitly said it. ▪ “I never said she stole my money” I said someone took it; I didn't say it was she. ▪ “I never said she stole my money” I just said she probably borrowed it. ▪ “I never said she stole my money” I said she stole someone else's money. ▪ “I never said she stole my money” I said she stole something of mine, but not my money. Example from Wikipedia WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 75 Pragmatics What is the purpose of an utterance? Utterance: “Is it cold in here or is it just me? Intended meaning: “Please close the window!” Utterance: “Oh, great! Another meeting.” Intended meaning: The speaker likely means the opposite of what they are literally saying—meetings might be something they dislike, despite the positive tone. WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 76 Summary – Linguistic Analysis Levels Phonetics and Phonology Segmentation Morphology Syntax Semantics Pragmatics WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 77 Summary – Linguistic Analysis Levels Elementary, my dear Watson Phonetics and Phonology Segmentation Morphology Syntax Semantics Pragmatics and Discourse WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 78 Summary – Linguistic Analysis Levels Elementary, my dear Watson Phonetics and Phonology Segmentation Morphology Syntax Semantics Pragmatics WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 79 Summary – Linguistic Analysis Levels Elementary, my dear Watson [ɛlɪˈmɛntəri, maɪ dɪə ˈwɒtsən] Segmentation Morphology Syntax Semantics Pragmatics WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 80 Summary – Linguistic Analysis Levels Elementary, my dear Watson [ɛlɪˈmɛntəri, maɪ dɪə ˈwɒtsən] ["Elementary", ",", "my", "dear", "Watson"] Morphology Syntax Semantics Pragmatics WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 81 Summary – Linguistic Analysis Levels Elementary, my dear Watson [ɛlɪˈmɛntəri, maɪ dɪə ˈwɒtsən] ["Elementary", ",", "my", "dear", "Watson"] Base: Element, Suffix: -ary Syntax Semantics Pragmatics WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 82 Summary – Linguistic Analysis Levels Elementary, my dear Watson [ɛlɪˈmɛntəri, maɪ dɪə ˈwɒtsən] ["Elementary", ",", "my", "dear", "Watson"] Base: Element, Suffix: -ary ADJ, PRP$ ADJ NNP Semantics Pragmatics WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 83 Summary – Linguistic Analysis Levels Elementary, my dear Watson [ɛlɪˈmɛntəri, maɪ dɪə ˈwɒtsən] ["Elementary", ",", "my", "dear", "Watson"] Base: Element, Suffix: -ary ADJ, PRP$ ADJ NNP Watson: Dr. John H. Watson (not IBM) Pragmatics WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 84 Summary – Linguistic Analysis Levels Elementary, my dear Watson [ɛlɪˈmɛntəri, maɪ dɪə ˈwɒtsən] ["Elementary", ",", "my", "dear", "Watson"] Base: Element, Suffix: -ary ADJ, PRP$ ADJ NNP Watson: Dr. John H. Watson (not IBM) "You are so stupid…" WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 85 Take-Home-Messages ▪ Natural language processing is an interesting topic ☺ ▪ There are a lot of challenges  ▪ Typical preprocessing steps: ▪ Tokenization for splitting texts into tokens ▪ Stemming / Lemmatization to normalize tokens ▪ PoS-Tagging and parsing analyze syntactic features ▪ PoS-tags roughly represent word classes ▪ Phrases group words to function as a single unit ▪ Ambiguity in language makes analysis a hard problem WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 86 Next Lecture Text Classification WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 87

Use Quizgecko on...
Browser
Browser