Introduction to Text Mining.pdf

 Introduction to Text Mining Prof. Dr. Erich Schubert Data Mining, Artificial Intelligence, TU Dortmund Natural Language is Difficult Hamsterkäufe Buying hamsters in a pet shop? A hamster going shopping? English: panic buying, hoarding – why would we buy panic? Hamster-like buying behavior? Image credit: own montage using images under the Pixabay license and BY-CC0 Why is Text Mining Difficult? Example: Homonyms Apples have become more expensive. Apple fruit? Apple computers? Many homonyms: Word: the Microsoft product, or the linguistic unit? Bayern: the state of Bavaria, or the soccer club FC Bayern? Jam: traffic jam, or jelly? A duck, or to duck? A bat, or a baseball bat? A rock, to rock a cradle, to rock a party, “you rock”? Light: referring to brightness, or to weight? Image: SDXL, PD Why is Text Mining Difficult? Example: Negation, Sarcasm and Irony This phone may be great, but I fail to see why. This actor has never been so entertaining. The least offensive way possible. Colloquial: ChatGPT is the shit! Sarcasm: Tell me something I don’t know. Irony: This cushion is soft like a brick. Image: SDXL, PD Why is Text Mining Difficult? Example: Background Knowledge The trophy doesn’t fit into the brown suitcase because it is too big. (German: Der Pokal passte nicht in den braunen Koffer, weil er zu groß war.) What does “it” refer to? The trophy doesn’t fit into the brown suitcase because it is too small. (German: Der Pokal passte nicht in den braunen Koffer, weil er zu klein war.) ➜ this coreference cannot be done syntactically, but it requires logical reasoning and the background knowledge that small things fit into larger things, and not the other way. This example is a Winograd schema. Wikipedia Simple statistical approaches will also likely fail: “suitcase is too big” is more frequent than “the trophy is too big”. But modern large language models can solve this example (which is in the training data many times). Why is Text Mining Difficult? Example: Errors, Mistakes, Abbreviations People are lazy and make mistakes, in particular in social media. Let’s eat, grandma. (German: Komm, wir essen, Oma.) I like cooking, my family, and my pets. (German: Ich mag meine Familie, Kochen und meinen Hund.) They’re there with their books. You’re going too fast with your car. I need food. I am so hungary. There is norway I can get food now. Let’s grab some bear. I’m hangry. (Hungry + angry = hangry) Next time u r on fb check ur events. Image credit: Alexas_Fotos (Pixabay License) Recent Successes are Impressive, but not Universal We have seen some major recent successes: AI assistants like Google Assistant, Siri, Cortana, Alexa Machine translation like Google, Skype, DeepL, KIT Lecture Translator Dall-E, Stable Diffusion, Midjourney, etc. ChatGPT, GPT-4, Bard, PaLM, (earlier: Meta Galactica, LaMDA, Blenderbot) But that does not mean this approach works everywhere Require massive training data, labeled data for tuning/alignment Works for the easier 95%, still fail on the 5% advanced texts Some functionality is command based, fallback to web search E.g., “take a picture” is a defined command, and not “understood” by the KI For example machine translation: the EU translates millions of pages per year, much of which is publicly available for training translation systems. Unsupervised text mining is much harder! How Intelligent is ChatGPT? ChatGPT is very good at guessing – but not really intelligent: Prompt: Spell backwards: “Lollipop” The word “Lollipop” spelled backwards is “poppilol.” The word “Lollipop” spelled backward is “pillopoL.” The word “Lollipop” spelled backwards is “poppilloL.” ChatGPT does not “understand” how to spell backwards. But it guesses very well, almost right! A 6-year old can probably do better – but on the other hand, a lot of people misspell “ChatGTP”. Problems: Tokenization Stanford CoreNLP: The standard solution for NLP [MSBF14] before GPT. CoreNLP NLP is still hard, even just sentence splitting: Example (from a song list): Web All About That Bass by Scott Bradlee’s Postmodern Jukebox feat. Kate Davis Sentence 1: All About That Bass by Scott Bradlee’s Postmodern Jukebox feat. Sentence 2: Kate Davis Named entity: Postmodern Jukebox feat Accuracy nevertheless up to 97% on news (and

Introduction to Text Mining.pdf

Document Details

Tags

Related

Full Transcript

Upgrade to continue