Natural Language Processing Course Notes PDF

Document Details

Tags

natural language processing nlp machine learning computer science

Summary

These course notes provide an introduction to Natural Language Processing (NLP), including its historical development, challenges, and key applications like machine translation and information retrieval. The document covers topics such as tokenization and linguistic analysis.

Full Transcript

Natural Language Processing Week 1: Intíoduction to NLP Natural Language Processing o Definition: A field that addresses methods and approaches with which computers can process, understand and generate natural language o Why study NLP? We use language to read, write and speak, but also...

Natural Language Processing Week 1: Intíoduction to NLP Natural Language Processing o Definition: A field that addresses methods and approaches with which computers can process, understand and generate natural language o Why study NLP? We use language to read, write and speak, but also: to make plans and decisions, to learn, to dream, and much more. If we want to build intelligent systems, they should have similar abilities 2/36 Natural Language Processing o What will we cover? A wide range of NLP techniques and applications; theoretical knowledge + practical skills. By the end of the course, you’ll be able to implement your own NLP project end-to-end 3/36 Outline 01 Overview of NLP applications 02 Building blocks of NLP applications 03 Implementation of a simple NLP application 4/36 Outline 04 Introduction to text tokenization 05 Introduction to linguistic analysis 0G Course logistics 5/36 Overview of NLP applications A bit of history The field was established in 1950s and started with the Georgetown-IBM experiment The experiment was concerned with the implementation of an early fully-automated machine translation system 7/36 A bit of history Specifically interested in translating between Russian and English scientific text The task was deemed to be easy enough to be solved within several years 8/36 A bit of history Do you think they succeeded? Why? 9/36 Development of approaches Started off with rule-based approaches and templates: ○ How do you think it would work for a machine translation system? 10/36 Development of approaches Started off with rule-based approaches and templates: ○ For a machine translation system, we try to translate word for word. Do you think this works well? Around 1980s, statistical approaches were introduced, and machine learning algorithms were developed: ○ Pros: do not make rigid assumptions and can learn flexibly ○ Cons: rely on availability of large amounts of high-quality representative data 11/36 Development of approaches Started off with rule-based approaches and templates: ○ For a machine translation system, we try to translate word for word. Do you think this works well? Around 1980s, statistical approaches were introduced, and machine learning algorithms were developed: ○ Pros: do not make rigid assumptions and can learn flexibly ○ Cons: rely on availability of large amounts of high-quality representative data Around 2010s, advances in compute power allowed researchers to apply deep learning techniques 12/36 NLP timeline Note that it does not mean that DL algorithms are the solution to all problems Different tasks use different types of solutions, including rule-based approaches 13/36 Example: ELIZA chatbot Works by application of templates to “parrot” back what the user is saying https://web.njit.edu/~ronkowit/eliza.html 14/36 What NLP applications do you know of? 15/36 Machine Translation: Task 16/36 Machine Translation: Challenges Human language is creative ⇒ it’s impossible to come up with a generalizable set of rules What is a word? ○ Kraftfahrzeug-Haftpflichtversicherung (DE) = motor-vehicle-liability-insurance (EN) ○ Auf Wiedersehen (DE) = goodbye (EN), where Wiedersehen = see again 17/36 Where things get even more complicated 18/36 Machine Translation: Challenges Human language is creative ⇒ it’s impossible to come up with a generalizable set of rules What is a word? Different grammatical categories: ○ langage naturel (FR) = natural language (EN) ○ catastrophe naturelle (FR) = natural disaster (EN) ○ ressources naturelles (FR) = natural resources (EN) 19/36 Machine Translation: Challenges Human language is creative ⇒ it’s impossible to come up with a generalizable set of rules What is a word? Different grammatical categories Different word order ○ EN: ISUBJECT readVERB a bookOBJECT ⇒ SVO ○ JA: ISUBJECT a bookOBJECT readVERB ⇒ SOV ○ GA (Irish): readVERB ISUBJECT a bookOBJECT ⇒ VSO ○ RU: Flexible in terms of word order 20/36 Machine Translation: Challenges Human language is creative ⇒ it’s impossible to come up with a generalizable set of rules What is a word? Different grammatical categories Different word order Words that only appear to mean the same thing: ○ I’ll book a trip ≠ I like this book 21/36 Information Search: Task 22/36 Information Search: Challenges Suppose you have thousands of documents, each containing hundreds of pages Suppose also that you are looking for information on a specific concept, e.g., reinforcement learning How would you perform such search? How would you judge if the results are relevant? 23/36 Information Search: Challenges Queries may be incomplete, inaccurate, ungrammatical, etc. They may also be ambiguous: when you type in “book”, what do you mean? Some words matter more than others: ○ I’m looking for accommodation in Bath Some words mean similar things: ○ I’m looking for accommodation in Bath ⇔ There are flats in Widcombe 24/36 Spam Filtering: Task & Challenges Identification of a potentially malicious, unsafe or dangerous content Why is this challenging? Some emails may contain clear “red flags” (unusual formatting, unknown sender, mass emailing, etc.); others can only be identified by their content 25/36 Text prediction: Task & Challenges Widely used in predictive keyboards on smartphones, in browsers, email clients (e.g., Smart Reply), etc. The most likely continuation is suggested based on the beginning of a word or a phrase 26/36 Text prediction: Task & Challenges Google’s Smart Reply may even compose whole (short) emails like “Monday works for me” or “Sounds good” on your behalf Why is this challenging? The range of possible natural sentences is practically infinite. An ability to predict what comes next in language brings machine intelligence one step closer to human intelligence http://www.wired.com/2015/06/google-made- 27/36 chatbot-debates-meaning-life NLP in relation to other fields Computer Computer Science contributes Science Artificial Neuroscience algorithms, software & hardware Intelligence Artificial Intelligence sets up the environment for intelligent machines Cognitive Machine Science Learning Machine Learning algorithms are NLP widely used in NLP Psycholinguistics Statistics Statistics helps coming up with theoretical models and probabilistic interpretation of language Computational linguistics Logic phenomena Electrical Engineering 28/36 NLP in relation to other fields Computer Logic helps making sure the world Science described with NLP models makes Artificial Neuroscience Intelligence sense Electrical engineering techniques Cognitive Machine help with specific tasks (e.g., speech Science Learning processing) NLP Computational linguistics provides Psycholinguistics Statistics expert knowledge about how language works Computational Other fields account for human linguistics Logic Electrical factors, brain processes, etc. Engineering 29/36 Building blocks of NLP applications Concepts and methods Machine learning methods are widely used If the goal is to predict a label selecting among a finite set of categories, this is classification or categorization. Can you think of examples? 31/36 Concepts and methods Machine learning methods are widely used If the goal is to predict a label selecting among a finite set of categories, this is classification or categorization. Can you think of examples? ○ spam filtering – binary (spam / ham) ○ topic classification – multi-class (finance / sports / arts / science) 32/36 Concepts and methods Often annotated datasets are available in open access for such purposes. If you have labelled data, you can apply supervised machine learning techniques: the ML algorithm tries to lean a function mapping the characteristics of the data from each class or category to the respective label 33/36 Concepts and methods Data labelling is expensive and time-consuming, so labelled data is not always readily available If labelled data is not available, unsupervised machine learning approaches can be used: e.g., clustering to identify groups of similar documents based on their content or Latent Dirichlet Allocation (LDA) used to detect topics in unlabelled data 34/36 Concepts and methods Certain language phenomena are best described as sequences of events rather than as individual occurrences: e.g., the way characters follow each other in a word or words follow each other in a sentence is not random Sequence modelling approaches help to address such tasks as part-of-speech tagging and language modelling, among others 35/36 Concepts and methods Finally, a number of NLP applications rely on vector-based models Such vectors may encode word occurrences across documents (as in information retrieval) or aspects of meaning across the vocabulary (as in semantic models and word vectors / embeddings) 36/36 Levels of linguistic analysis Raw text processing: Note that a computer doesn’t have an idea of what a “word” is – text is a single stream of symbols. How should we split this stream into units? Morphology: Sub-word level of linguistic analysis ○ book (singular) vs. books (plural) ○ (will) book (tomorrow) vs. (is) booking (now) vs. booked (yesterday) ○ (I) book vs. (he) books ○ More challenging still: am / is / are / was / were / been / … = be 37/36 Levels of linguistic analysis Word level analysis: Identification of word types – parts of speech. E.g., interesting book (noun) vs to book a ticket (verb) Syntax: deals with the way words are grouped together in sentences to express meaning. E.g., Police help dog bite victim – who did what to whom? Semantics: addresses questions related to the meaning of words and other language units. E.g., plot of land ≠ plot of a story 38/36 Implementation of a simple NLP application Pipeline overview Any task can be thought of in terms of this pipeline Let’s walk through these five steps for spam filtering 43/36 Step 1:Analysis of the task Define what exactly the task involves: e.g., ask yourself, how you would solve it yourself (without ML) In spam filtering: you probably pay attention to certain characteristics (sender, fonts, format, how many recipients the email has, etc.) You also may pay attention to the content: “lottery”, “click on this link”, “your account is blocked”, and similar Most probably, you classify the emails in two types – normal emails and spam ⇒ Binary classification task 44/36 Step 2: Analysis and preprocessing of the data Given the “red flags” (words and phrases) you may attempt using templates For machine learning, define what the relevant data is and how to prepare it: ○ You need access to labelled data of two classes ○ What is the distribution of classes? ○ Are you going to use only textual features? ○ Are there any other significant differences (e.g., spam emails being considerably shorter)? 45/36 Step 3: Definition and extraction of the relevant information Identify relevant signal in the data ○ Is it single words (“lottery”, “blocked”) or phrases (“click on this link”)? ○ Are you going to learn from misspellings? ○ Are you going to learn from different ways to spell words (e.g., “Now”, “now”, “NOW”)? ○ Are you going to learn from word occurrences or word distribution? ○ Will you apply any other normalisation techniques? The above points refer to feature selection, feature representation, and feature weighting 46/36 Step 4: Implementation of the algorithm No algorithm can be considered absolutely the best for all tasks and all datasets (“no free lunch theorem”) Analyse the task to identify which one suits best in each particular case 47/36 Step 5: Testing and evaluation It is important to understand how your current algorithm performs and what you can do better E.g., for classification tasks, you can measure accuracy, precision, recall, F1 Arguably, it is better to let some annoying spam messages to slip through than send important “normal” emails to the spam box – is precision or recall more important? It is advisable to set up some baseline: What is the majority class distribution? How would the simplest algorithm perform? Are you really doing better using a more sophisticated approach? 48/36 Introduction to text tokenization Tokenization For a machine, text comes in as a sequence of symbols, so it does not have an idea of what a word is Tokenization or word segmentation is the task of separating out (tokenizing) words in raw text First of all, how would you define a word? 60/36 Tokenization For a machine, text comes in as a sequence of symbols, so it does not have an idea of what a word is Tokenization or word segmentation is the task of separating out (tokenizing) words in raw text First of all, how would you define a word? The most straightforward solution – split by whitespaces Are there any problems with this solution? E.g., sequences of characters not containing whitespaces that should be split into multiple words, or single “words” containing whitespaces? 61/36 Tokenization It might be desirable to consider New York, rock ‘n’ roll, etc. single units (“words”): New York ≠ New + York Such sequences as I’m, we’ve, etc. should be split (I am, we have, etc.) Not all languages define words by whitespaces 62/36 Tokenization step by step Mr. Sherwood said reaction to Sea Containers’ proposal has been “very positive.” In New York Stock Exchange composite trading yesterday, Sea Containers closed at $62.625, up 62.5 cents. ‘‘I said, ‘what’re you? Crazy?’ ‘‘ said Sadowsky. ‘‘I can’t afford to do that.‘‘ Let’s define patterns and use regular expressions to split this text into words What patterns can you define? Are there any challenges? 63/36 Tokenization step by step Mr. Sherwood said reaction to Sea Containers’ proposal has been “very positive.” In New York Stock Exchange composite trading yesterday, Sea Containers closed at $62.625, up 62.5 cents. ‘‘I said, ‘what’re you? Crazy?’ ‘‘ said Sadowsky. ‘‘I can’t afford to do that.‘‘ re.split('\s+', s) Splitting by whitespaces will keep Containers’, “very, positive.”, etc. unsplit What else should be taken into account? 64/36 Tokenization step by step Mr. Sherwood said reaction to Sea Containers’ proposal has been “very positive.” In New York Stock Exchange composite trading yesterday, Sea Containers closed at $62.625, up 62.5 cents. ‘‘I said, ‘what’re you? Crazy?’ ‘‘ said Sadowsky. ‘‘I can’t afford to do that.‘‘ re.split('([\s.,:;!?\'\”])+', s) Splitting by whitespaces and punctuation marks will solve the problem of, e.g., Crazy?’ and positive.” BUT it will also split Mr. (as well as Ph.D., U.S.A., etc.) into multiple words This is undesirable. Can this be avoided? Are there any other problematic cases? 65/36 Tokenization step by step Mr. Sherwood said reaction to Sea Containers’ proposal has been “very positive.” In New York Stock Exchange composite trading yesterday, Sea Containers closed at $62.625, up 62.5 cents. ‘‘I said, ‘what’re you? Crazy?’ ‘‘ said Sadowsky. ‘‘I can’t afford to do that.‘‘ Apostrophes are ambiguous: can’t and Containers’ should be split, but cap’n and o’clock shouldn’t Number expressions are challenging: 62.5, $62.625, as well as 50,500 and 50 550,500 should not be split despite whitespaces and punctuation marks 66/36 Tokenization step by step Mr. Sherwood said reaction to Sea Containers’ proposal has been “very positive.” In New York Stock Exchange composite trading yesterday, Sea Containers closed at $62.625, up 62.5 cents. ‘‘I said, ‘what’re you? Crazy?’ ‘‘ said Sadowsky. ‘‘I can’t afford to do that.‘‘ Also, dates (11/12/21 as well as 11.12.21), URLs (google.com), email addresses ([email protected]), company names (AT&T) should be kept as single “words” Emoticons (:)) and hashtags (#nlp) should be kept as single “words” Expressions like New York Times should be identified as a single unit – these are called multi-word expressions and are dealt with using specialised algorithms 67/36 Tokenization in practice Tokenization is the first pre-processing step in many NLP applications: it has to be applied before other tools are applied, it has to run fast and be accurate In practice, you don’t need to invent your own tokenizer since all NLP (and ML) libraries and toolkits include highly optimized tokenizers These typically rely on the use of ML algorithms with a combination of regular expressions, lists of common abbreviations, and other techniques This week’s homework: implement your own simple tokenization algorithm based on regular expressions and compare it to one of the available ones from the Natural Language Toolkit (NLTK) 68/36 Introduction to linguistic analysis Frequency analysis Observation: Words in language are not distributed evenly A few most frequent words used in language may cover the vast majority of all word occurrences in language: for example, 135 most frequent vocabulary items in English account for 50% of all words usage according to the Brown Corpus of American English Can you guess what the most frequent words in English are? 70/36 Frequency analysis Observation: Language is a dynamic system Words get added to it (invented or borrowed) all the time; other words may get out of fashion ⇒ hard to estimate at any particular point how many words a language actually contains 71/36 Word distribution Observation: given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table (the most frequent word is at rank 1) E.g.: in the Brown Corpus, the most frequent word the accounts for 7% of all word occurrences in this corpus (69,971 out of ~1mln), the second-ranked word of – around 3.5% (36,411 occurrences), followed by and (28,852), etc. 72/36 Zipf’s law After the American linguist George Kingsley Zipf In the population of N elements (e.g., N individual words in a corpus), the normalized frequency of the element of rank k (the fraction of the time the k-th most frequent word occurs) is defined as: where s is the exponent: for English, s=1 s is different for different languages (s=1.07 for the prediction of the cities’ population size based on cities’ size rank) 73/36 Implications If the most frequent words are the, of, and, a, an, for, in, at, of, do, are, and the like, and together they account for the vast majority of the word usage, what implications does it have for NLP tasks? 74/36 Implications If the most frequent words are the, of, and, a, an, for, in, at, of, do, are, and the like, and together they account for the vast majority of the word usage, what implications does it have for NLP tasks? Note that the words above help “glue” other, more meaningful words together, but by themselves they don’t express much meaning: e.g., a book vs the book, stay in town vs stay out of town In many contexts, such words are considered stopwords and in many applications they are filtered out (Steps 2-3 in the NLP pipeline) 75/36 Course logistics Course objectives This course will help you acquire theoretical knowledge of the fundamental NLP concepts and techniques, as well as practical skills We will be looking into a variety of NLP applications Each concept will be explained from the perspective of its use in a particular application By the end of this course, you will be able to implement your own NLP application in an end-to-end manner 77/36 Course format ~20 lectures this semester (one 2-hour long lecture every Thursday) Programming exercises will be provided every week after the lectures for you to get familiar with the techniques and implementation. Exercises are left as your homework and are not assessed. Solutions will be made available on Mondays Lab sessions are optional: you can use this time to work on the practical exercises especially if using University resources is preferable Assessment 30% for coursework (mini-project) 70% for the exam 78/36 Coursework (mini-project) Your task is to build a sentiment analysis application that can automatically detect whether a movie review is positive or negative Reviews are extracted from the IMDb database. The dataset for this project and the task description will be released tomorrow, Friday, October 7 You will be required to submit a report detailing your implementation steps: the report should be included in your Jupyter notebook together with the accompanying code (if you want to split your implementation into multiple notebooks, you can submit a.zip file, but one of the notebooks has to contain the main report) 79/36 Coursework (mini-project) Your report should detail all the steps of the NLP pipeline and explain the decisions that you’ve made You will be assessed on the basis of your report and not just on the basis of the results achieved by of your algorithm Submission deadline: Monday, December 12, 8pm 80/36 Course resources Lecture slides and recordings Handouts will be released after lectures You are expected to work individually: e.g., handouts have suggested activities. Such activities are not obligatory and are not assessed, however, you are encouraged to attempt the tasks and exchange your observations on the Moodle forum Programming exercises are your homework – they are not assessed Problem sets (tasks of the type you may get in the exam) – will be released during the revision week 81/36 Books Kochmar, E. (2021). Getting Started with Natural Language Processing – available online via the University of Bath Library Jurafsky, D. and Martin, J.M. (2009). Speech and Language Processing – available at https://github.com/rain1024/slp2- pdf/tree/master/chapter-wise-pdf and https://web.stanford.edu/~jurafsky/slp3/ Bird, S., Klein, E., and Loper E. (2009). Natural Language Processing with Python – available at http://www.nltk.org/book/ 82/36 NLP libraries and toolkits NLTK SpaCy Gensim Matplotlib NumPy Scikit-learn Installation instructions available on Moodle You can either install the libraries on your own computer or run the code in Google’s Colab 83/36

Use Quizgecko on...
Browser
Browser