lecture1_v2.pdf
Document Details
Uploaded by AffluentRisingAction9914
Tags
Full Transcript
Introduction, Tokens & Document Similarity Why do computers need to work with text? Sean MacAvaney & Jake Lever Text As Data Welcome! Today we will talk about: When do computers need to work with text?...
Introduction, Tokens & Document Similarity Why do computers need to work with text? Sean MacAvaney & Jake Lever Text As Data Welcome! Today we will talk about: When do computers need to work with text? Overview of the course ○ What will we learn in this course ○ Course practicalities (e.g. labs, assessments, etc) How can computers find similar documents? Text As Data lecturers this year Dr Sean MacAvaney Dr Jake Lever Teaching MSc (5106) Teaching Honours (4074) / MSci (5096) Research: Information Retrieval & Research: Biomedical Text Mining Natural Language Processing Why make computers work with text? Humans interact with one another through language Language can be represented as text This makes text a convenient interface between humans and computers Computers may use text as input or output or both As Input “blah blah blah” system As Output system “blah blah blah” Or Both Input “blah “blah blah system blah and Output? blah” blah” Text data is ever growing and we need to work with it https://www.worldwidewebsize.com/ Every minute Email users send ~200M messages (https://www.statista.com/statistics/456500/daily-number-of-e-mails-worldwide/) Google receives ~6M search queries (https://seo.ai/blog/how-many-people-use-google) Text data is unstructured, so hard to process! Structured data - think tables or databases ○ Column headings provide labels on the meaning of each cell Unstructured text - a section of text with no labels to provide meaning Country Capital Currency Population “With 67 million people, France is the second most populous country in the France Paris Euro 67,000,000 European Union. Its capital is the beautiful city of Paris and since 2002, the country Germany Berlin Euro 83,000,000 has used the Euro. Germany also uses…” Canada Ottawa Canadian 38,000,000 Dollar Unstructured data Structured data Think-Pair-Share: Comparing structured/unstructured Country Capital Currency Population “With 67 million people, France is the second most populous country in the France Paris Euro 67,000,000 European Union. Its capital is the beautiful city of Paris and since 2002, the country Germany Berlin Euro 83,000,000 has used the Euro. Germany also uses…” Canada Ottawa Canadian 38,000,000 Dollar Unstructured data Structured data In your group, discuss the following and add your notes to the Padlet You are building a search system. ○ Why might you prefer to have structured data? ○ Why might you prefer to have unstructured data? https://padlet.com/macavaney/tasd2024_1 Names of this field Just so you are aware: This area of computer science has a few names that overlap: Natural language processing (NLP) Computational linguistics Text Analytics Language follows rules. Can’t we decode those rules? Dealing with text can’t be that hard. Language follows rules that I learnt in school. I’ll write those rules with some pattern matching and job done! No, language is more than just rules (though they help). It uses prior knowledge and reasoning. Language is beautiful and also a massive pain “I saw the elephant with my telescope” Who has the telescope? Examples applications using Text as Input “blah blah blah” system News Aggregation Search Tools Email Suggestions Examples of text inputs Documents Tweets Voice commands Search queries Web pages Medical records Books so much more Example Task: Document similarity Can we find other Reddit posts that are similar to the one below? Example applications using Text as Input and Output “blah “blah blah system blah blah” blah” “Où est la bibliothèque?” “Where is the library?” Assistants (e.g. Siri, Alexa) Machine Translation Email Suggestions We will touch on text as output - but the main focus of this course is taking text as input! Basic text output: Generating numbers with rules Some data (e.g. numbers) can be turned into their textual form with a number of rules 3 “three” 43 “forty-three” 143 “one hundred and forty-three” Very advanced text output: Creative writing https://www.ign.com/articles/stargate-google-ai-script https://medium.com/swlh/i-wrote-a-book-with-gpt-3-ai-in -24-hours-and-got-it-published-93cf3c96f120 Examples of text output: An AI-written text adventure game https://play.aidungeon.io Text as Data has a long exciting history as a research area Building with linguistics research ○ How does language work? ○ How do we learn language? Tied to computational performance ○ New CPUs and GPUs have enabled new advances The Internet provides an incredible source of example text Deep learning is changing the whole approach! The methods are sufficiently mature for high-profile products like ChatGPT, Alexa, etc. You chose the right time to start studying it! https://commons.wikimedia.org/wiki/File:Internet_Archive_book_scanner_1.jpg New amazing abilities with language New language systems are trained by asking them to complete a sentence It showed 9 o’clock on my __________ Text researchers became obsessed with Muppet characters The ELMo and BERT deep learning approaches developed a new model for deep learning that could succeed at several different problems Kenton, Jacob Devlin Ming-Wei Chang, and Lee Kristina Toutanova. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." Proceedings of NAACL-HLT. 2019. The GPT-4 Hype OpenAI’s GPT-4 deep learning model showed incredible new abilities Examples at: https://beta.openai.com/examples Github Copilot Language models can work with code as well as written language. See https://copilot.github.com/ This is a fast moving field - BERT was in 2019! In the past, bigger (and more complex) models haven’t translated into better performance With transformer-based models, bigger (e.g. more parameters & data) seems to be better, for now. But bigger models come with huge costs: training, computational, data, environmental, etc. https://s10251.pcdn.co/pdf/2023-Alan-D-Thompson-AI-Bubbles-Rev-7b.pdf There is a lot of hype about AI “understanding” text Be skeptical of claims of human-level abilities Most systems may be very good at specific task, but research on reliable generalisation is ongoing Models that appear to generalise are often already trained on the task (among many others) https://fortune.com/2021/12/08/deepmind-gopher-nlp-ultra-large-language-model-beats-gpt-3/ https://www.bbc.co.uk/news/technology-27762088 Playing with Text Generation https://chat.openai.com https://www.bbc.co.uk/news/technology-63861322 Summary of Text as Data Introduction Computers are now frequently using text as input and output ○ Inputs can be search queries, documents to process, voice commands, etc ○ Outputs can be text summaries, creative writing and more! The amount of text data is growing at a substantial rate every day Deep learning has had some recent big impact on the field with transformers These new systems have impressive abilities both in interpreting and generating text This field is VERY fast moving Breather! Course Details Syllabus Overview Representing text: traditional methods using sparse vectors Clustering: how do you find a document similar to another? Principles of language modelling: the probability of one word following another Classifying text: for example, sentiment analysis Representing text with dense numerical vectors: word2vec and contextual methods Principles of transformers and deep learning Named entity recognition & linking: Find words that describe specific people, objects, locations, etc Extracting relations: How are the 2+ entities in the text related? Coreference: Making links between mentions that are referring to the same thing Ethical & societal considerations: Are we okay will the applications and implications of these technologies? Technical Content Python Google Colab (a branch of Jupyter Notebook) Specific libraries ○ spacy ○ scikit-learn ○ transformers (from HuggingFace) The Moodle Will contain links to: ○ Lecture slides ○ Recordings (soon after each lecture) ○ Labs Coursework and Exam information (in the future) MSc: https://moodle.gla.ac.uk/course/view.php?id=39020 Teams All course announcements will be made using Teams Feel free to post questions to the Teams channel Various resources will be shared through it so keep an eye on it If you are not a member, contact the me and I can add you Textbook and Other Materials No required textbook for this course Optionally, you can read more in: Speech and Language Processing (3rd ed. draft) ○ Dan Jurafsky and James H. Martin ○ https://web.stanford.edu/~jurafsky/slp3/ scikit-learn documentation is very good! ○ https://scikit-learn.org/stable/user_guide.html Grading Weight Assessment 30% Coursework 70% Exam Coursework ○ Due in early March ○ Main coursework project (20%) ○ Literature survey (10%) Exam ○ In May Labs Location: Boyd Orr 1028 Time: Tuesday mornings (9.00-10.00, or 10.00-11.00 - check your enrollment) There is a mini-lab (Lab 0) online to get used to Colab and the data. Material in labs is examinable, including knowledge of basic libraries Labs are not marked, but are helpful for understanding the course content https://universityofglasgowlibrary.wordpress.com/2015/02/20/defending-the-boyd-orr/ Office Hours Office hours will be announced in the second week of the course. My office (Lilybank Gardens S172 - directions in the Moodle) or message me on Teams to chat online. I’m also happy to chat after the lecture for a short time. I will not be able to respond quickly to questions outside lecture, lab times and office hours Labs: Reddit Data Dataset from Reddit posts will be used for most of the labs Reddit ○ “Front page of the internet” ○ Somewhat social platform ○ Some posts have text content We will focus on the main posts with text content only ○ No replies ○ No “title-only” posts with images/video/links Data retrieved with the Reddit API Important Idea: Different approaches are good for different problems This course teaches a number of different approaches to text problems No approach is the best for all problems Each approach has strengths and weaknesses ○ Often accuracy versus computational cost To know which is the best for your problem, you need to set up an experiment so you can compare methods Learning Outcomes Describe classical models for textual representations such as the one-hot encoding, bag-of-words models, and sequences with language modelling. Identify potential applications of text analytics in practice. Describe various common techniques for classification, clustering and topic modelling, and select the appropriate machine learning task for a potential document processing application. Represent data as features to serve as input to machine learning models. Deploy unsupervised and machine learned approaches for document/text analytics tasks. Critically analyze and critique recent developments in natural language and text processing academic literature. Assess machine learning model quality in terms of relevant error metrics for document processing tasks, in an appropriate experimental design. Evaluate and explain the appropriate application of recent research developments to real-world problems. https://citt.ufl.edu/resources/the-learning-process/designing-the-learning-experience/blooms-taxonomy/ Organisation Summary Lectures on Tuesday from noon to 2pm Labs on Tuesday mornings ○ Using Reddit data ○ On Google Colab ○ Not marked Coursework due in early March Exam in May Office Hours TBA BREAK! Document Similarity Example Task: Document similarity Can we find other Reddit posts that are similar to the one below? Think-Pair-Share: Approaches to document similarity 1. Which of the two posts (A or B) is closest to our post of interest? Why? 2. How might you code it? Post of Interest Option A Option B Possible Solution If two posts have overlap, we hypothesise that they are talking about a similar subject. The more overlap, the higher similarity Data cleaning is often necessary Text documents often contain a lot of formatting, links and other content that needs to be removed or reformatted (depending on what you need). Bold, italics, etc may need to be removed URLs in links may be need to be removed ○ Or could be valuable information depending on what you want to do Tables could be removed in whole, or just the formatting How do we measure overlap between two documents/posts? 1. Individual letters? 2. Short sequences of letters? 3. Words? 1. Individual letters? Computers represent text as raw numbers for each character. Do two posts with similar underlying data have similar meaning? I R N B R U 73 82 78 32 66 82 85 ASCII representation of “IRN BRU” Not really. The order and grouping of letters is key. After all, “forty five” and “over fifty” share the same letters but have different meanings Text Encodings: ASCII and Unicode To be experts in text data, you should be aware of ASCII and Unicode: ASCII: 128 possible characters (in one byte) A-Z, a-z, 0-9, standard punctuation, various control characters (e.g. new line) But what about: á, λ and 😀? Enter: Unicode! Unicode: A variable-byte (2-4 bytes) encoding Encodes the alphabets of many languages Emojis, maths symbols and lots more Is frequently updated with new characters (v15.1 released in Sept 2023) Other encodings exist but ASCII and Unicode are the important ones. Unicode issues can cause problems Being unaware of Unicode can lead to odd character errors (known as mojibake): “The Mona Lisa doesn’t have eyebrows.” Python has the “encoding” parameter when loading files. UTF-8 (the normal Unicode option) is default on Mac and Linux, but not on Windows. A detailed Unicode explainer for when Unicode causes you problems: https://docs.python.org/3/howto/unicode.html Neat use of individual letters: Detecting other languages Different languages use different letters at different frequencies This property can be used to detect other languages (though there are better ways 😀) Frequencies of pairs/triples (e.g. n-grams) of letters can provide more data However, individual letters are not typically used for text analysis https://en.wikipedia.org/wiki/Letter_frequency 2. Sequences of letters? (also known as character n-grams) Individual letters aren’t very useful, but some It’s a Scottish drink and it’s sequences can be quite specific: e.g. “irn” It’s a Scottish drink and it’s It’s a Scottish drink and it’s It’s a Scottish drink and it’s Bi-grams (n=2) are pairs of neighbouring It’s a Scottish drink and it’s characters It’s a Scottish drink and it’s It’s a Scottish drink and it’s e.g. “ir”, “rn”, “n ”, “ b”, “br” “ru” It’s a Scottish drink and it’s It’s a Scottish drink and it’s It’s a Scottish drink and it’s Tri-grams (n=3) are triples of neighbouring It’s a Scottish drink and it’s characters It’s a Scottish drink and it’s It’s a Scottish drink and it’s e.g. “irn”, “rn ”, “r b”, “ br”, “bru” It’s a Scottish drink and it’s It’s a Scottish drink and it’s It’s a Scottish drink and it’s You can extract n-grams by running a moving It’s a Scottish drink and it’s window across the text (see right) It’s a Scottish drink and it’s Sliding window showing character 4-grams Strengths & Weaknesses of Character N-grams Strengths More specific than individual letters, e.g. tri-grams can capture short words Easy to implement Weaknesses Using small n (e.g. bigrams) won’t be specific enough for many words Large n may create n-grams that are very rare Same n-gram can come from different words with different meanings ○ “ong” is in “wrong” and “strong” 3. Words Given a section of text, how do we split it up into “words”? Tokenization Splitting a section of text into individual tokens A token (or term) is the technical name for a meaningful sequence of characters A token isn’t exactly a word. ○ A word may be broken into separate tokens if it meaningful makes sense (e.g. “don’t” is often split into “do” and “n’t”) ○ Tokens may also be punctuation, numbers, some whitespace or other spans of text Other languages can have very different tokenization challenges: ○ 授人以鱼不如授人以渔。 How would you tokenize text in English? Tokenizing with some handmade rules Example sentence: “John’s father didn’t have £100.” Idea 1: Splitting with whitespace (spaces, tabs, etc) John’s father didn’t have £100. Idea 2: Splitting with whitespace + punctuation (,.;!@#$%, etc) John ‘ s father didn ‘ t have £ 100. Ideal Tokenization: Requires some handcrafted rules to get: John ‘s father did n’t have £ 100. Tokenizing with Spacy Good tokenization requires a lot of hand-crafted rules. Fortunately there are many good implementations out there. In labs, we will use Spacy! Other popular toolkits: Stanford CoreNLP nltk Summary of Data for Measuring Overlap What should we measure the overlap between posts with? 1. Individual letters? No. Letters don’t provide enough information 2. Sequences of letters (character n-grams)? Maybe. Has some problems but can be very efficient 3. Words (or really tokens)? Yes. Words capture meaning better than character n-grams. You could use word n-grams to capture extra information What is our post about? Some words are more important for representing the subject of the post Anyone tried Irn Bru? It’s a Scottish drink and it’s banned some countries and I was wondering if anyone here has tried it. It has quite a unique taste and it’s not something you’d forget quickly. You either love it or hate it I think. What is our post NOT about? Some common words do not convey much meaning by themselves Anyone tried Irn Bru? It’s a Scottish drink and it’s banned some countries and I was wondering if anyone here has tried it. It has quite a unique taste and it’s not something you’d forget quickly. You either love it or hate it I think. Filtering using stopwords A common practice is to remove words with minimal meaning. These words are known as “stopwords”. There are many lists (e.g. right) but none are definitive. BUT: Some important phrases contain common words, e.g. “The Who”, “to be or not to be” Fun history: The first standardized list of stopwords was proposed by C.J. Van Rijsbergen in ~1975. He founded the Information Retrieval group at the University of Glasgow. C. J. van Rijsbergen, Information Retrieval, 2nd Edition, Butterworths, London, 1979 Two words only map together if they match exactly! Problem: The two sentences below don’t overlap at all by individual words “I loved IRN BRU” “He loves irn bru” Case (“IRN” and “irn”): A common practice is to make text case insensitive by converting all text to lowercase. But: the case of a word may give important information. Word Forms (“loved” and “loves”): We need to process words to a canonical version by removing suffixes so that they match (e.g. loved -> love) But: you may lose some meaning Breaking words into stems and affixes Stems: The core meaning-bearing units (e.g. love) For verbs, think of the infinitive (to love) Affixes: Bits and pieces that adhere to stems (e.g. -d or -s) Often grammatical additions for conjugation or tense Stems and affixes are known broadly as morphemes. Morpheme: Small meaningful units that make up words We want the stems. We need STEMMING! Stemming Definition: Process for reducing inflected words to their stem or root form Often rule-based and removes common suffixes (e.g. -d, -ing, -s, etc). Generally works on single words with no context. Can make mistakes but often they aren’t that bad. (e.g. computing -> comput) The Porter Stemmer - a popular stemming algorithm Step 1a Step 2 (for long stems) sses → ss caresses → caress ational→ ate relational→ relate ies → i ponies → poni izer→ ize digitizer → digitize ss → ss caress → caress ator→ ate operator → operate s → ø cats → cat … Step 1b Step 3 (for longer stems) (*v*)ing → ø walking → walk al → ø revival → reviv sing → sing able → ø adjustable → adjust (*v*)ed → ø plastered → plaster ate → ø activate → activ … … Lemmas Two words have the same lemma if: 1. Same stem 2. Same part-of-speech (e.g. noun, verb, etc) 3. Essentially the same meaning Difference between lemmas and stems: Lemmas take in the context of the sentence Context of words matter Words don’t work independently and neighbouring words can change their meaning. “really like” and “don’t like” have different meanings! How can we capture that? Word n-grams can capture additional meaning You can go beyond individual words into pairs or triples of words. Word bi-grams could capture “irn bru” as It’s a Scottish drink and it’s banned some two words instead of “irn” and “bru” It’s a Scottish drink and it’s banned some individually It’s a Scottish drink and it’s banned some It’s a Scottish drink and it’s banned some It’s a Scottish drink and it’s banned some Word bi-grams and tri-grams are fairly It’s a Scottish drink and it’s banned some It’s a Scottish drink and it’s banned some powerful but can be computationally It’s a Scottish drink and it’s banned some expensive. Sliding window showing word About ~7 million words in English 3-grams wikitionary Summary of problems and solutions 1. Individual characters aren’t very helpful ○ Use words 2. Need to split text into words - actually quite challenging to do cleanly ○ Tokenizers use set of rules to split using whitespace, punctuation and other rules ○ Use the Spacy parser (generally good) 3. Some words do not add useful information ○ Use stopwords to filter out common words ○ Use other metrics to give stronger weight to “likely important” rarer words 4. Words with same meaning may not match exactly ○ e.g. swim != swimming ○ Use stemming and lemmatization 5. Some words only keep meaning with context of neighbouring word(s) ○ Use bi-grams, tri-grams, etc Summary of a standard pipeline 1. Data cleaning ○ e.g. converting web page to plain text (by removing links, tables, etc) 2. Tokenizing ○ May also separate sentences 3. Stopword removal 4. Normalize the words ○ Using stemming/lemmatization to get a normalized form of words Document similarity using set similarity Problem Statement (again) and overlap idea Can we find other Reddit posts that are similar to the one below? If two posts have substantial overlap of words, we hypothesise that they are talking about a similar subject Comparing documents with sets of tokens Tokens could be A set is: a group of unique items stemmed, etc A document can be represented as a set of the tokens in it. as in previous section Some characteristics of sets: No concept of item frequency (so words that occur many times in a document are treated the same as one that appears once) Very efficient computation when comparing sets Python (and most other languages) have sets as a standard data structure Set operations |X| = number of elements in a set X X Y X ∩ Y = intersection of sets X and Y items that are in X and Y X ∪ Y = union of sets X and Y items that are in X or Y X – Y = set difference of X and Y items that are in X but not Y Similarity measures sim(X,Y) is function that calculates a similarity measure where X and Y are the sets of tokens for two documents to be compared sim(X,Y) should be high when X and Y are similar sim(X,Y) should be low when X and Y are dissimilar There is not one definitive similarity measure. Let’s explore a few! Number of overlapping words sim(X,Y) = |X ∩ Y| where X and Y are sets of words for two documents, ∩ is in the intersection and |.| is the count of elements in a set. Can argue that two documents are similar if they contain many of the same tokens But: needs to be normalized (i.e. so that sim(X,Y) is between 0 and 1). A long document that contains a large selection of words would match with lots of documents so need to factor in the number of tokens in each token set. Overlap Coefficient |X ∩ Y| sim(X,Y) = min( |X|, |Y| ) where X and Y are sets of words for two documents, ∩ is in the intersection and |.| is the count of elements in a set. Meaning: The % of unique tokens in the smaller document that appears in the larger document Improvement: Now using the number of unique tokens in X and Y to normalize the measure Bounded between zero and one 0 means no overlapping tokens 1 means that one set is a subset of the other Sørensen–Dice Coefficient 2 |X ∩ Y| sim(X,Y) = |X| + |Y| Properties: Bounded between 0 and 1 0 means no overlap and 1 means perfect overlap Difference with Overlap Coefficient Will only be 1 if the two sets are exactly matching, not if one is a subset of the other Jaccard Similarity |X ∩ Y| sim(X,Y) = |X ∪ Y| The number of overlapping tokens divided by the number of unique tokens across the two documents Properties Bounded between 0 and 1 (like Sørensen–Dice) Very popular Jaccard Similarity |X ∩ Y| sim(X,Y) = |X ∪ Y| Difference with Sørensen–Dice Coefficient Jaccard distance (1-sim(X,Y)) satisfies the triangle inequality such that: ○ dist(X,Z) ≤ dist(X,Y) + dist(Y,Z) This makes Jaccard a metric Sørensen–Dice does not satisfy this - making it a semi-metric Properties of a metric Distance between X and X is zero : d(X,X) = 0 Positive : d(X,Y) > 0 where X ≠ Y Symmetric : d(X,Y) = d(Y,X) Triangle inequality holds : d(X,Z) ≤ d(X,Y) + d(Y,Z) Think-Pair-Share: Calculating Jaccard Similarity Calculate the Jaccard Similarity for the two “documents” (assume tokenizing on whitespace, lowercasing and ignoring punctuation): 1. IRN BRU rocks! I drink IRN BRU daily! 2. irn bru is bad. sprite is better |X ∩ Y| Use the code #2017312 on Slido.com sim(X,Y) = |X ∪ Y| Example using Jaccard Similarity X (Unique tokens in first document) { “irn”, “bru”, “rocks”, “i”, “drink”, “daily” } Y (Unique tokens in second document) { “irn”, “bru”, “is”, “bad”, “sprite”, “better” } Jaccard Similarity: |X ∩ Y| |{“irn”, “bru”}| 2 1 = = = |X ∪ Y| | { “irn”, “bru”, “rocks”, “i”, “drink”, 10 5 “daily”, “is”, “bad”, “sprite”, “better” } | Summary of measures of set similarity Documents can be represented as sets of tokens Set similarity measures can be used to calculate whether two documents are similar or dissimilar Measures explored: ○ Overlap Coefficient ○ Sørensen–Dice Coefficient ○ Jaccard Similarity Weaknesses ○ All words are treated equally - but some should be more important ○ Word frequencies are ignored ○ Two different words with similar meaning cannot match Next Lab: Tokenizing text and document similarity The next lab will involve tokenizing Reddit posts and using set similarity measures to find similar posts