Full Transcript

Intuitions on Text Data CSMODEL Natural Language Processing Natural Language Processing (NLP) is a branch of computer science that deals with the interaction of computers and humans using natural/human language To accomplish tasks in NLP, there should be a way to represent or m...

Intuitions on Text Data CSMODEL Natural Language Processing Natural Language Processing (NLP) is a branch of computer science that deals with the interaction of computers and humans using natural/human language To accomplish tasks in NLP, there should be a way to represent or model human language in computer systems. 2 Challenges 3 Challenges Human language is ambiguous. Language is consistently evolving. Words can have different meanings based on context. There are figures of speech, sarcasm, etc. 4 Challenges Example: Stolen painting found by tree. This sentence is ambiguous, as it could mean two things: The stolen painting was found beside a tree. A tree found the stolen painting. Humans will understand it, but computers will have a hard time understanding it based on the grammatical structure. 5 Challenges Example: That was fire bro, W move Language is constantly evolving. For example, there are a lot of slang terms on the internet now that were not really used a few years back. If human language is modeled with hardcoded rules, that means that these rules need to be updated as the language evolves. 6 Challenges Example: The number of jobs in PH has drastically decreased. The number of COVID19 cases has drastically decreased. The term “drastically decreased” can have a positive or negative connotation, depending on the context. 7 Challenges Example: It’s been many years since I’ve had such an exemplary vegetable. ~ Jane Austen, Pride and Prejudice A term that is usually to express praise is used in a sarcastic way, thus changing its meaning. 8 General Framework Pre- Data Testing / processing Modelling Evaluation 9 Preprocessing 10 Preprocessing First, some pre-processing steps are done to prepare the natural language text for modelling. The main goal is to extract the individual components from natural language text while at the same time removing the irrelevant parts. 11 Preprocessing Removing irrelevant characters from the text such as non-alphanumeric characters is also executed. Sometimes, there can be random characters that appear in the text as a result of the method of collection (e.g., scraping reviews from the web). Example: The movie was amazing and □ □ □ I will watch it again. □ □ □ remove 12 Preprocessing Tokenization involves splitting the texts into individual words, also (in some cases) removing the punctuation marks. Example: Input: Hi, how are you doing today? Output: Hi how are you doing today 13 Preprocessing Some words may be irrelevant to the modelling process, depending on the goals. Remove them. For example, hashtags or Twitter mentions as part of a sentence might not be useful in certain applications. Congratulations, I am so proud of you my friends! #SummaCumLaude @juandelacruz123 @mariaperez123 14 Preprocessing hello Hello HELLO hello Words may be merged since they are equivalent. Words with common alternative spellings may be merged. 15 Preprocessing flavor flavour flavor Words may be merged since they are equivalent. Words with common alternative spellings may be merged. 16 Preprocessing Original Stemmed Connect Connect Connected Connect Connection Connect Connections Connect Connects Connect Stemming refers to chopping off parts of the word to extract the root word, to reduce inflection. 17 Preprocessing Original Stemmed Trouble Troubl Troubled Troubl Troubles Troubl Troublesome Troublesom Stemming algorithms sometimes lead to strange results depending on the algorithm used. 18 Preprocessing Original Lemmatized Original Lemmatized Trouble Trouble Goose Goose Troubled Trouble Geese Goose Troubles Trouble Tooth Tooth Troubling Trouble Teeth Tooth Instead of just “chopping off” parts of the word, lemmatization uses a dictionary like WordNet or special rule-based approaches, leading to generally better results. 19 Preprocessing Stopwords are words that do not add much value to the meaning of natural language text. Depending on the intention, these might be removed These words include: the, to, are, is, at, in, and others. when the snows fall and the white winds blow the lone wolf dies but the pack survives 20 Preprocessing Stopwords are words that do not add much value to the meaning of natural language text. Depending on the intention, these might be removed These words include: the, to, are, is, at, in, and others. snows fall white winds blow lone wolf dies pack survives 21 Preprocessing the waiter Determiner cleared Verb the Noun plates Existing Part-of-Speech (POS) Taggers identifies the part of speech of each word. 22 Data Modelling 23 Data Modelling At this point, each observation / instance in the dataset may look like a list of words each associated with some information. Now, apply the appropriate data modelling approach to represent the observations. 24 Data Modelling Word embeddings is a mathematical way to represent each word in the document. Words should be represented mathematically to be able to perform mathematical operations such as regression, etc. Types of word embeddings: Frequency based encoding Prediction based encoding 25 Naïve Word Embedding Create a “dictionary” containing the unique words in the document. Example: 0 1 2 3 The dog ate the hotdog. the dog ate hotdog 26 Naïve Word Embedding Each word can be represented as a one-hot encoded vector where 1 marks its location in the dictionary. the: {1, 0, 0, 0} 0 1 2 3 dog: {0, 1, 0, 0} the dog ate hotdog ate: {0, 0, 1, 0} the: {1, 0, 0, 0} hotdog: {0, 0, 0, 1} 27 Frequency based There will be many unique words in the corpus sometimes, so select the top 𝑛𝑛 most frequently occurring words only. In some cases, the presence of a word (1 or 0), instead of the count, might be more useful. the dog is cute cat also red but blue not D1 2 1 2 2 1 1 0 0 0 0 D2 2 0 2 1 2 0 1 1 1 1 28 Frequency based Term Frequency–Inverse Document Frequency (TF-IDF) Measure of the importance of each word in the document 𝑁𝑁 𝑤𝑤𝑖𝑖,𝑗𝑗 = 𝑡𝑡𝑡𝑡𝑖𝑖,𝑗𝑗 × log 𝑑𝑑𝑑𝑑𝑖𝑖 TF IDF 𝑡𝑡𝑡𝑡𝑖𝑖,𝑗𝑗 – number of occurrences of 𝑖𝑖 in 𝑗𝑗 𝑁𝑁 – total number of documents 𝑑𝑑𝑑𝑑𝑖𝑖 – number of documents containing 𝑖𝑖 𝑖𝑖 - some word 𝑗𝑗 – some document 29 Prediction based Word Vectors: A way to represent a word’s “meaning” or “idea” through a numerical vector. General idea: Represent each word as a numerical vector, where each element represents a certain property. 30 Prediction based It is similar with a personality test, where each person is scored across several criteria: Criteria Score Openness to experience 79 out of 100 Agreeableness 75 out of 100 Conscientiousness 42 out of 100 Negative emotionality 50 out of 100 Extraversion 58 out of 100 31 Prediction based Visual representation of a word vector 32 Prediction based Visual representation of a word vector 33 Prediction based Visual representation of a word vector 34 Prediction based 35 Prediction based The words “woman” and “girl” are similar in some code. These words “man” and “boy” are similar in some code. 36 Prediction based The words “girl” and “boy” are similar in some code. These words “woman” and “man” are similar in some code. 37 Prediction based Every word in this vocabulary refers to humans except “water”. 38 Prediction based Might be an encoding for “royalty.” 39 Prediction based The exact meaning behind each element in the vector is usually abstracted to humans, since these are trained using machine learning algorithms. 40 Prediction based How are word vectors learned? General idea: Look at the ordering of the words in the collection of documents (training data). Main intuition: If certain words appear in close proximity to each other frequently, then they are probably related in some way. 41 N-Grams quick brown fox jumped over the lazy dog quick brown fox jumped over the lazy dog quick brown fox jumped over the lazy dog quick brown fox jumped over the lazy dog quick brown fox jumped over the lazy dog N-grams use contiguous sequence of words in a document. Use different 𝑛𝑛 depending on the task. 42 N-Grams Look only on the previous words quick brown fox jumped over the lazy dog quick brown fox jumped over the lazy dog quick brown fox jumped over the lazy dog Input 1 Input 2 Output quick brown fox brow fox jumped fox jumped over 43 N-Grams Look on both ways quick brown fox jumped over the lazy dog quick brown fox jumped over the lazy dog quick brown fox jumped over the lazy dog Input Output Input Output Input Output brown quick fox brown jumped fox brown fox fox jumped jumped over 44 N-Grams These relationships are used to train the word embeddings using machine learning approaches. 1 = related, 0 = unrelated 45 Dependency Parsing Dependency parsing shows the relationship of each word with other words by representing words as a graph. This can be achieved using machine learning techniques. 46 Applications 47 Applications Machine translation Translating from one language to another 48 Applications Sentiment Analysis Determining whether an opinionated text has a positive or negative attitude. 49 Applications Document Classification Classify a document according to some categories. 50 Applications Summarization Automatically creating a summary from a text document. 51 Applications Chatbots Automated systems that can answer questions and respond to certain statements using natural language 52 Summary To perform text data modelling, pre-processing techniques should be applied to extract individual words from the document. Then, convert those into a model that can be used to describe the data, or be used in machine learning algorithms. 53 Intuitions on Text Data CSMODEL

Use Quizgecko on...
Browser
Browser