NLP, Text Mining, and Semantic Analysis PDF
Document Details
Uploaded by OpulentBlankVerse7924
IE School of Science and Technology
2024
Alex Martínez-Mingo
Tags
Summary
This presentation covers the history of Natural Language Processing (NLP) from early models like the Turing Machine and the Perceptron, to significant advancements including the information theory and more contemporary models like LSTMs and Transformers. It also touches on the role of statistical methods and large language models.
Full Transcript
NLP, TEXT MINING AND SEMANTIC ANALYSIS Alex Martínez-Mingo IE School of Science and Technology Compulsory Subject Academic year 2024/25 Session II The Dawn of Computational Linguistics What is Computational Linguistics? Co...
NLP, TEXT MINING AND SEMANTIC ANALYSIS Alex Martínez-Mingo IE School of Science and Technology Compulsory Subject Academic year 2024/25 Session II The Dawn of Computational Linguistics What is Computational Linguistics? Computational Linguistics is the study of using automated computational methods to analyze, interpret and generate human language. Early Stages and Foundational Theories The field began to take shape in the 1950s, driven by the advent of modern computers, but some previous events marked its evolution. The Turing Machine Invented by Alan Turing in 1936. A theoretical device that manipulates symbols on a strip of tape according to a set of rules. Considered a foundational model for computation, able to simulate any computer algorithm. Turing Machine provided the basic concept of algorithmic processing, crucial for NLP development. Turing's work during World War II focused on decoding the Enigma machine. This task required understanding the structure and patterns of language, making it one of the first Alan Turing (1936) - The major linguistic challenges solved computationally. Turing Machine The Artificial Neuron Model Proposed by Warren McCulloch and Walter Pitts in 1943. First conceptual model to represent a neuron with a simple mathematical model. The model was a pioneering effort in cognitive science and computational neuroscience. It bridged the gap between biological processes and computational models. Introduced the idea of neural networks, a fundamental concept in modern NLP. Modern deep learning techniques in NLP, such as recurrent neural networks (RNNs) and transformers, McCulloch and Pitts (1943) - are an evolution of these early concepts. The Artificial Neuron Model The Artificial Neuron Model McCulloch and Pitts (1943) - The Artificial Neuron Model The Information Theory Developed by Claude Shannon in 1948. Introduced concepts like entropy, information content, and redundancy in communication systems. Marked the beginning of the digital communication era. Fundamentally changed understanding of language as a form of information transfer. Enabled the quantification of information in language, facilitating computational analysis. Concepts like entropy are still useful in NLP for Claude Shannon (1948) - tasks like language modeling and text classification. Information Theory The N-Gram Model Shannon's idea of entropy is crucial in language modeling, where the goal is to predict the probability of a sequence of words. The concept of N-grams was not directly invented by Shannon but was influenced by his work. N-grams emerged as a practical application of Information Theory to language modeling. An N-gram model predicts the probability of a word based on the occurrence of its preceding 'N-1' words. It's a form of Markov model (1913), where the prediction of the next item in a sequence is based on a fixed number of preceding items. Shannon and Markov - The N-Gram Model. Early Stages and Foundational Theories Georgetown Experiment (1954) using N-Grams: ⬡ One of the earliest applications of n-grams was automating translation. ⬡ Developed by IBM and Georgetown University. ⬡ Russian to English translation. ⬡ Approximately 250 words and six grammatical rules. ⬡ Accurately translate 60 sentences. The Perceptron Developed by Frank Rosenblatt in 1958. The perceptron was one of the earliest models in the field of artificial intelligence. A form of artificial neuron designed to simulate the decision-making process of the human brain. Operates by weighing input signals, summing them, and passing them through a non-linear function to produce an output. Provided a basic model for understanding how machines can process and classify linguistic data. Concepts from the perceptron are still relevant in Rosenblatt (1958) - current NLP methodologies. The Perceptron The Perceptron Rosenblatt (1958) - The Perceptron The Linguistic Wars There was a significant intellectual debate within linguistics in the 20th century between generative linguists, led by Noam Chomsky, and behaviorist linguists, led by B.F. Skinner centered on the nature of language and the processes of language acquisition and understanding. The Linguistic Wars did not have a clear 'winner' in a traditional sense. Chomsky's theories of Generative Grammar and Universal Grammar gained significant influence and reshaped much of linguistic theory. However, opposing views, especially from empirical and cognitive linguistics, continued to contribute Chomsky and Skinner - valuable insights. The Linguistic Wars The Multi-Layer Perceptron Proposed by Marvin Minsky and Seymour Papert as an extension of Rosenblatt's perceptron with multiple layers of neurons. Each layer can learn complex patterns by combining outputs from the previous layer. Influenced the development of deeper neural network architectures, crucial for advanced NLP tasks. Today, multi-layer perceptrons are a core component of many advanced NLP systems. Minsky and Papert - Multi-Layer Perceptron The Multi-Layer Perceptron Minsky and Papert used the XOR (exclusive or) logical function to critique the limitations of single-layer perceptrons. Their work, particularly the book "Perceptrons" (1969), highlighted significant limitations of early neural networks. Led to disillusionment in the AI community and a reduction in funding and interest, marking the onset of the first AI winter. Minsky and Papert - Multi-Layer Perceptron “ The XOR Problem Single-layer perceptrons cannot solve problems where the data is not linearly separable. The First AI Winter During the 1960s and 1970s, there was a kind of "golden age" for Natural Language Processing (NLP) based on rules. Regular Expressions (RegEx). ELIZA (1966) and SHRDLU (1972) programs. The first AIs. The First AI Winter Also, during the first winter of the AI, other approaches tried to explain how human language works and how to implement it algorithmically. The Semantic Network ⬡ The semantic network was proposed by M. Ross Quillian in the 1960s. ⬡ Graph based model representing knowledge as a network of interconnected nodes (concepts) and links (relationships). ⬡ Quillian demonstrates how information retrieval can be enhanced through networked structures. ⬡ Highly influential in the creation of knowledge graphs and ontology-based NLP systems. The Semantic Memory ⬡ The semantic memory was proposed by Endel Tulving in the 1970s as a system that stores general world knowledge. Unlike the episodic memory, it doesn’t store personal experiences. ⬡ It provides a theoretical basis for understanding how knowledge and language are stored and retrieved in the human brain and influences the design of knowledge representation systems in NLP. The Prototype Theory ⬡ Developed by Eleanor Rosch in the 1970s. ⬡ Challenges classical categorization theory by suggesting categories are centered around "prototypes" or typical examples instead of in a set of necessary and sufficient characteristics. A prototype is the "best" or "most typical" example of a category. ⬡ Shifted the understanding of how concepts are organized and categorized in the human mind and influences the way algorithms are designed for categorization and clustering in NLP. The Renaissance of Connectionist Models During the 1980s, more specifically in 1986, something happened that would bring connectionist models back into the spotlight of AI and NLP. The Backpropagation Algorithm Developed by David Rumelhart, Geoffrey Hinton, and Ronald Williams in 1986. Backpropagation enables efficient training of MLPs. It adjusts weights in not just the output layer, but also in hidden layers, based on the error gradient. Allows MLPs to learn complex patterns including non-linear separations like the XOR. Backpropagation consists of two main steps: forward propagation of input and backward propagation of error. ○ Forward Propagation: Computes the output of the network for given inputs. ○ Backward Propagation: Calculates and Rumelhart, Hinton and Williams propagates the error back, updating weights to (1986) - The Backpropagation minimize loss. Algorithm Feedforward Models Thanks to the backpropagation algorithm, scientists could develop the feedforward network models. These consists of neural networks where connections between nodes do not form cycles and were commonly used for classification and regression tasks in NLP. Advantages Over N-gram Models: ○ Can capture more complex patterns in language data. ○ Not limited by the fixed context size of n-grams, allowing for a more flexible approach to language modeling. ○ Better at generalizing from training data, reducing the issue of data sparsity. Feedforward models struggle to capture long-term dependencies in sequential data given that they lack an Minsky and Papert - Multi-Layer internal memory mechanism to remember past inputs Perceptron for future predictions. Recurrent Neural Networks Developed by Jeffrey Elman in 1990. A type of neural network designed to process sequences of inputs by maintaining internal state (or memory). Ideal for sequential data or any data where order is important, the model holds information about previous inputs, allowing for context understanding. This model was essential in the development of generative language models, and other sequence-dependent tasks. The major problem of these models is the Jeffrey Elman (1990) - "vanishing gradient" problem. Recurrent Neural Networks Recurrent Neural Networks Vanishing Gradient. RNNs are typically trained using a method called Backpropagation Through Time (BPTT), which involves unrolling the RNN through time and then applying the standard backpropagation algorithm. This unrolling leads to deep networks for long sequences. During BPTT, gradients of the loss function are propagated backward through each time step. As they are propagated, these gradients are multiplied by the weight matrix at each step. When the weights are small, repeated multiplication of these small numbers during backpropagation results in exponentially smaller gradients. This means that the gradients can become very small, effectively vanishing. The vanishing gradient problem makes it difficult for the RNN to learn and retain information from inputs that appeared many steps earlier Jeffrey Elman (1990) - in the sequence. Recurrent Neural Networks The Second AI Winter AI technologies couldn't meet the overly ambitious expectations set in the 1980s by the connectionists models. Hardware limitations restricted the complexity of models and the size of datasets that could be processed. Disappointment in AI progress led to reduced funding and support from governments and investors. Research focus shifted to more feasible, rule-based models and statistical methods. Corpus Based Linguistics Although the Brown Corpus began to be developed during the 1960s, and was completed during the 1970s and 1980s, it was not until the 1990s that the British National Corpus (BNC) was developed, consisting of 100 million words and made available to the public (Burnard, 2000), allowing researchers in the field to have a vast amount of data to work with. This represented a huge boost to statistical methods applied to linguistics. Statistical Methods and Machine Learning During the second AI winter (1990s), statistical methods experienced significant growth, establishing themselves as the only viable means of making predictions about text. This later led to the application of machine learning to enhance the performance of these algorithms. Statistical Methods and Machine Learning Naive Bayes: Logistic Regression: ⬡ Based on Bayes' Theorem, ⬡ An older statistical method, but popularized in NLP for text its use in NLP surged for binary classification tasks. classification and text ⬡ Widely used for spam categorization. detection and document ⬡ Effective in scenarios where categorization. features (words, phrases) and ⬡ Valued for its simplicity and categories exhibit more linear, efficiency in handling large less complex relationships. datasets. The Geometry of Meaning During the same decade (1990s), the first spatial models of language began to be developed. Deerwester et al. (1990) introduced the Latent Semantic Analysis (LSA) model, which starts from a term-document matrix and applies matrix decomposition using the Singular Value Decomposition (SVD) method. This defines an orthogonal vector space model (VSM) that allows for the representation of both terms and documents from the original count matrix. The Geometry of Meaning On the other hand, Lund and Burgess (1996) used the term co-occurrence matrix in various contexts to define their Hyperspace Analogue to Language (HAL) model. This model could optionally employ a dimensionality reduction technique (such as PCA) to represent, in this case, only the terms from the original matrix in a new vector space. Sentences: "Data analysis involves processing large datasets." "Machine learning models process data to make predictions." "Neural networks are a type of machine learning model." "Statistical methods in data science are essential for analysis." "Predictive models in machine learning use statistical techniques." Terms Sentence 1 Sentence 2 Sentence 3 Sentence 4 Sentence 5 analysis 1 0 0 1 0 data 1 1 0 1 0 datasets 1 0 0 0 0 Term-Document Matrix essential 0 0 0 1 0 involves 1 0 0 0 0 … … … … … … use 0 0 0 0 1 LSA example LSA Terms Matrix 2D Terms Representation Term Component 1 Component 2 analysis 0,461 1,296 data 1,076 1,247 datasets 0,193 0,679 essential 0,268 0,617 involves 0,193 0,679 … … … use 0,591 -0,231 LSA example The Geometry of Meaning In the 2010s, with the advent of Big Data and a substantial increase in computational capacity, static spatial models qualitatively improved thanks to the efforts of a group of researchers at Google. In 2013, Mikolov et al. (2013a, 2013b) proposed two architectures (CBoW and Skip-Gram) for creating dense word vectors through the application of neural network models that had been largely underexploited in NLP until then. The name of the model: Word2Vec. The Geometry of Meaning Finally, it is essential to highlight GloVe (Global Vectors for Word Representation) as the last static VSM (non-contextual vectors) for language representation. Developed at Stanford University by Pennington, Socher, and Manning in 2014, GloVe's architecture is based on the idea that the relationships between words should be encoded based on a co-occurrence matrix of these words across different context windows. GloVe combines the benefits of matrix factorization methods (like those used in LSA) and the context-based learning of Word2Vec, aiming to leverage global word co-occurrence statistics. The Last Connectionist Wave These new spatial models opened the door to revisiting the use of more complex neural network models capable of capturing the subtleties of human language. From the late 2010s to the present, we are experiencing a third wave of connectionist models, and this time they are not disappointing. Long-Short Term Memory To address the RNNs vanishing gradient problem, Hochreiter & Schmidhuber proposed the use of Long-Short Term Memory (LSTM) models in 1997. These included memory cells capable of retaining long-term information and managed to forget or maintain relevant context information through a set of gates (input, output, forget). Although this model was theoretically very powerful, it wasn't until the development of applied computational models like ELMo (Embeddings from Language Models) by Peters et al. (2018) that this architecture could truly be tested. ELMo is considered the first model of Contextual Embeddings in history, featuring a bidirectional LSTM architecture (which we will explore in depth later in the course). The major problem with these models is their computational Peters et al. (2018) - ELMo cost, as the network training process is not parallelizable. Transformers The Transformer model was introduced by Vaswani et al. in the landmark paper "Attention Is All You Need" in 2017. Unlike LSTMs, Transformers handle long-range dependencies more efficiently due to the self-attention mechanism. This mechanism allows the model to weigh the importance of different parts of the input data regardless of their distance in the sequence. Transformer architectures facilitate greater parallelization during training, significantly reducing training times compared to LSTMs. Transformers scale better with the amount of data and computational resources, making them more effective for large-scale NLP tasks. BERT GPT Large Language Models Large Language Models (LLMs) are advanced AI models trained on massive corpora encompassing a diverse range of text sources. The scale of training data and model parameters (billions of weights) are key to their performance. They require significant computational power, typically involving high-performance GPUs or TPUs. Training can take weeks or months, consuming substantial energy resources. LLMs improve with scale; larger models tend to exhibit better performance and generalization capabilities. Large Language Models Large Language Models Large Language Models represent a significant leap in AI's ability to process and generate human language. The scaling process is integral to their success, leading to the emergence of sophisticated behaviors like In Context Learning and Step-by-Step Reasoning. In Context Learning: LLMs excel in understanding and maintaining context over longer passages. Capable of referring back to previously mentioned information and maintaining coherence in dialogue or narrative. Step-by-Step Reasoning: Some LLMs can mimic step-by-step reasoning, a vital aspect of problem-solving and decision-making tasks. This includes mathematical problem solving, logical reasoning, and technical troubleshooting. Assignment: In-Depth Exploration of NLP Models or Techniques Objective: Choose one of the models or techniques discussed in class. Conduct an independent, in-depth investigation into the chosen topic. Instructions: Selection of Topic: ○ Select a model or technique from our NLP course. This could be anything from N-Grams to Transformer models. ○ Ensure the topic is one that you find engaging and are willing to explore in detail. Research: ○ Independently research your chosen topic. Utilize a variety of sources to gather comprehensive information. ○ You are encouraged to use ChatGPT as a research tool to answer questions, clarify doubts, or find resources. Writing the Essay: ○ Write a one-page essay delving into your chosen topic. ○ Your essay should not just summarize the model or technique but also provide deeper insights or perspectives. ○ Discuss aspects like the development, underlying principles, applications, strengths, and limitations of the topic. Submission: ○ You must submit the one-page essay via Turnitin as a PDF before the due date (on campus). THANK YOU! Contact: [email protected]