Introduction to Natural Language Processing (NLP) PDF
Document Details
Uploaded by HealthfulSanDiego
Vrije Universiteit Amsterdam
2025
Jacopo Urbani
Tags
Summary
This document is lecture notes on Natural Language Processing (NLP). It covers topics such as NLP pre-processing, tokenization, stemming/lemmatization, stop-word removal, POS tagging, parsing, and other NLP tasks. The document is from Vrije Universiteit Amsterdam, and is for 2024/2025.
Full Transcript
Introduction to Natural Language Processing (NLP) Web Data Processing Systems Jacopo Urbani Department Computer Science Vrije Universiteit Amsterdam, The Netherlands 2024/2025 Typica...
Introduction to Natural Language Processing (NLP) Web Data Processing Systems Jacopo Urbani Department Computer Science Vrije Universiteit Amsterdam, The Netherlands 2024/2025 Typical Extraction Pipeline (from text) Text (HTML, Entities and Knowledge Refined text Tweets,... ) Relationships Bases NLP Pre- Information Reasoning processing Extraction NLP pre-processing Հայերեն Shqip اﻟﻌرﺑﯾﺔБългарски Català 中⽂简体 Hrvatski Česky Dansk Nederlands English Eesti Filipino Suomi Français ქართული Deutsch Ελληνικά עבריתिहन्दी Magyar Indonesia Italiano Latviski Lietuviškai македонски Melayu Norsk Polski Português Româna Pyccкий Српски Slovenčina Slovenščina Español Svenska ไทย Türkçe Українська Tiếng Việt Overview Lorem Ipsum Before we can use some text, we "Neque porro quisquam est qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit..." "There is no one who loves pain itself, who seeks after it and wants to have it, simply because it is pain..." must (or can) pre-process it What is Lorem Ipsum? Why do we use it? Lorem Ipsum is simply dummy text of the printing and typesetting It is a long established fact that a reader will be distracted by the industry. Lorem Ipsum has been the industry's standard dummy readable content of a page when looking at its layout. The point of Standard tasks text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic using Lorem Ipsum is that it has a more-or-less normal distribution of letters, as opposed to using 'Content here, content here', making it look like readable English. Many desktop publishing typesetting, remaining essentially unchanged. It was popularised in packages and web page editors now use Lorem Ipsum as their the 1960s with the release of Letraset sheets containing Lorem default model text, and a search for 'lorem ipsum' will uncover 1. Tokenization Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum. many web sites still in their infancy. Various versions have evolved over the years, sometimes by accident, sometimes on purpose (injected humour and the like). 2. Stemming or Lemmatization Where does it come from? Contrary to popular belief, Lorem Ipsum is not simply random text. Where can I get some? There are many variations of passages of Lorem Ipsum available, It has roots in a piece of classical Latin literature from 45 BC, but the majority have su!ered alteration in some form, by injected making it over 2000 years old. Richard McClintock, a Latin humour, or randomised words which don't look even slightly 3. Stop-word removals professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum believable. If you are going to use a passage of Lorem Ipsum, you need to be sure there isn't anything embarrassing hidden in the passage, and going through the cites of the word in classical middle of text. All the Lorem Ipsum generators on the Internet literature, discovered the undoubtable source. Lorem Ipsum tend to repeat predefined chunks as necessary, making this the 4. POS tagging comes from sections 1.10.32 and 1.10.33 of "de Finibus Bonorum et Malorum" (The Extremes of Good and Evil) by Cicero, written in first true generator on the Internet. It uses a dictionary of over 200 Latin words, combined with a handful of model sentence 45 BC. This book is a treatise on the theory of ethics, very popular structures, to generate Lorem Ipsum which looks reasonable. The during the Renaissance. The first line of Lorem Ipsum, "Lorem generated Lorem Ipsum is therefore always free from repetition, 5. Parsing ipsum dolor sit amet..", comes from a line in section 1.10.32. injected humour, or non-characteristic words etc. The standard chunk of Lorem Ipsum used since the 1500s is paragraphs Start with 'Lorem reproduced below for those interested. Sections 1.10.32 and 5 words ipsum dolor sit amet...' 1.10.33 from "de Finibus Bonorum et Malorum" by Cicero are also bytes reproduced in their exact original form, accompanied by English lists Generate Lorem Ipsum versions from the 1914 translation by H. Rackham. Donate: If you use this site regularly and would like to help keep the site on the Internet, please consider donating a small sum to help pay for the hosting and bandwidth bill. There is no minimum donation, any sum is appreciated - click here to donate using PayPal. Thank you for your support. Donate bitcoin: 16UQLq1HZ3CNwhvgrarV6pMoA2CDjb4tyF Translations: Can you help translate this site into a foreign language ? Please email us with details if you can help. There is a set of mock banners available here in three colours and in a range of standard banner sizes: NLP pre-processing: Tokenization Task: Given a character sequence, split it into tokens Example Input: “Mr. O’Neill thinks rumors about Chile’s capital aren’t amusing” Output: Mr | o | neill | thinks | rumors | about | chile | s | capital | aren | t | amusing Warning: Simple strategies do not always work! What about names? “Los Angeles”, “San Francisco” What about hyphens? “Co-education”, “drag-and-drop” What about non-English Languages? “Lebensversicherungsgesellschaftangesteller” vs. “life insurance company employee” Important! You must use the same tokenization strategy to process both queries and documents! Byte Pair Encoding (BPE) (I) Idea: Instead of white-space segmentation single-character segmentation Use the data to tell us how to tokenize Subword tokenization (because tokens can be parts of words as well as whole words) Three main algorithms Byte-Pair Encoding (BPE) (2016 ) Unigram language modeling tokenization (2018 ) WordPiece (2012 ) Byte Pair Encoding (BPE) (II) All have 2 parts: A token learner that induces a vocabulary (a set of tokens) from a raw training corpus A token segmenter that takes a raw test sentence and tokenizes it according to that vocabulary Example Let vocabulary be the set of all individual characters = {A, B, C , D,... , a, b, c, d...} Repeat: 1. Choose two symbols that are most frequently adjacent in a training corpus (e.g., ’A’, ’B’) 2. Add a new merged symbol ’AB’ to the vocabulary 3. Replace every adjacent ’A’ ’B’ in the corpus with ’AB’ Until k merges have been done NLP pre-processing: Tokenization Some “classic” tokenizers Stanford Tokenizer (http://nlp.stanford.edu/software/tokenizer.shtml) Apache OpenNLP (https://opennlp.apache.org/) NLTK (http://www.nltk.org) BPE implementations Google’s SentencePiece (https://github.com/google/sentencepiece) Hugging Face’s tokenizers (https://github.com/huggingface/tokenizers) fastBPE (C++, very fast https://github.com/glample/fastBPE)... LLAMA (Large Language Model Meta AI) use SentencePiece’s implementation of BPE NLP pre-processing: Stemming or Lemmatization Stemming: Reduce terms to their roots Example “Are” → “ar” “Automate, automates, automatic, automation” → “automat” Lemmatization: Reduce inflectional forms to base form Example “Am, are, is” → “be” “Car, cars, car’s” → “car” NLP pre-processing: Stemming or Lemmatization Some stemmers Porter’s algorithm (http://tartarus.org/martin/PorterStemmer/) Snowball (http://snowballstem.org/demo.html)... Some lemmatizers spaCy lemmatizer (https://spacy.io) Stanford CoreNLP (https://stanfordnlp.github.io/CoreNLP/lemma.html)... Which one? No golden rule, you have to try it out. Modern language models do not use stemming nor lemmatization (Why?) Stop words removal (I) Stop words Have little semantic content Are extremely frequent Occur in almost each document, i.e., are not discriminative Example of a stop word list a, an, and, are, as, at, be, by, for, from, has, he, in is, it, its, of, on, that, the, to, was, were, will, with Stop words removal (II) Idea Based on stop list, remove all stop words in the list Saves a lot of memory Makes query processing much faster Trend Do not perform word removal. Why? there are good compression techniques There are good query optimization techniques In some cases, stop words are needed! King of Norway Let it be To be or not to be NLP pre-processing: Part-of-Speech (POS) tagging (I) Task: Assign to each token a label that indicates what it is Distinction between function words (make sentence grammatically correct) and content words (carry the meaning) Function words Content words Auxiliary verbs Nouns Prepositions Verbs Determiners Adjectives Pronouns...... The list of labels depend on the language. Penn Treebank (36 labels) is one of the most used ones for English https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html NLP pre-processing: Part-of-Speech (POS) tagging (II) POS tagging is important because it helps us predicting the next word! Example What’s the likelihood of: “an amazing” is followed by “goalkeeper”? “an amazing” is followed by “scored”? It is more likely that “determiner adjective” is followed by a “noun” than a “verb” Is it hard? Most words in English are not ambiguous Most occurring words in English are ambiguous Today’s taggers have an accuracy of 97% (with some important exceptions). However, a simple baseline (=pick most frequent tag) is already 90% NLP pre-processing: Parsing Task: Built a tree of the syntactic structure of a string Constituency-parsing breaks the phrase into sub-phrases Dependency parsing connects the words according to their relationships Constituency-based parsing Dependency-based parsing S V N VP N N V NP D D N John hit the ball John hit the ball Other NLP tasks Sentence Boundary Detection (Segmentation) Detecting the start and end of sentences, important for paragraph-level analysis. Useful in summarization, translation, and conversational analysis Text Normalization Standardizing text to ensure consistency. Common tasks include expanding contractions (e.g., ”isn’t” to ”is not”), converting accented characters to ASCII, and correcting spelling Text normalization is useful to create a uniform dataset, esp. when dealing with noisy text (social media data) Co-reference resolution Finding all expressions that refer to the same entity in a text. e.g: “I voted for Nader because he was most aligned with my values, she said” →“He” refers to “Nader”; “My” refers to “I”; “She” refers to “my” NLP pre-processing In practice What should I do if I want to execute NLP tasks? 1- Use NLP frameworks/libraries (easy, decent performance) spaCy (https://spacy.io/) Stanford NLP (https://nlp.stanford.edu/) Apache OpenNLP (https://opennlp.apache.org/) NLTK (https://www.nltk.org/)... 2- Use the code of papers (hard, best performance) Typically, you need to know Python and have access to a good GPU References I T. Kudo. “Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates”. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018, pp. 66–75. M. Schuster and K. Nakajima. “Japanese and korean voice search”. In: 2012 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE. 2012, pp. 5149–5152. R. Sennrich, B. Haddow, and A. Birch. “Neural Machine Translation of Rare Words with Subword Units”. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2016, pp. 1715–1725. H. Touvron et al. “LLaMA: Open and Efficient Foundation Language Models”. In: arXiv preprint arXiv:2302.13971 (2023).