Web and Text Analytics 2024-2025 Week 11 PDF

Web and Text Analytics 2024-25 Week 11 Evangelos Kalampokis https://kalampokis.github.io http://islab.uom.gr © Information Systems Lab RNN and encoder-decoder architecture ▪ However, just as with convolutional neural networks, there has been a tremendous amount of innovation in RNN architectures, culminating in several complex designs that have proven successful in practice ▪ In general sequence-to-sequence problems like machine translation, inputs and outputs are of varying lengths that are unaligned. ▪ The standard approach to handling this sort of data is to design an encoder–decoder architecture consisting of two major components: – an encoder that takes a variable-length sequence as input, and – a decoder that acts as a conditional language model, taking in the encoded input and the leftwards context of the target sequence and predicting the subsequent token in the target sequence © Information Systems Lab Machine Translation ▪ Let’s take machine translation from English to French as an example. ▪ Given an input sequence in English: “They”, “are”, “watching”, “.”, this encoder–decoder architecture first encodes the variable-length input into a state, then decodes the state to generate the translated sequence, token by token, as output: “Ils”, “regardent”, “.”. ▪ The encoder–decoder architecture forms the basis of different sequence-to-sequence models in subsequent sections. © Information Systems Lab Sequence to sequence (seq2seq) models ▪ Seq2Seq (Sequence-to-Sequence) models are a type of neural network, an exceptional Recurrent Neural Network architecture, designed to transform one data sequence into another. ▪ They are handy for tasks where the input and output are sequences of varying lengths, which traditional neural networks struggle to handle, such as solving complex language problems like machine translation, question answering, creating chatbots, text summarization, etc. © Information Systems Lab Use Cases of the Sequence to Sequence Models ▪ Machine Translation: One of the most prominent applications of Seq2Seq models is translating text from one language to another. ▪ Text Summarization: Seq2Seq models can generate concise summaries of longer documents, capturing the essential information while omitting less relevant details. ▪ Speech Recognition: Converting spoken language into written text. Seq2Seq models can be trained to map audio signals (sequences of sound) to their corresponding transcriptions (sequences of words). ▪ Chatbots and Conversational AI: These models can generate human-like responses in a conversation, taking the previous sequence of user inputs and generating appropriate replies. ▪ Image Captioning: Seq2Seq models can describe the content of an image in natural language. The encoder processes the image (often using Convolutional Neural Networks, CNNs) to produce a context vector, which the decoder converts into a descriptive sentence. ▪ Video Captioning: Similar to image captioning but with videos, Seq2Seq models generate descriptive texts for video content, capturing the sequence of actions and scenes. ▪ Time Series Prediction involves predicting the future values of a sequence based on past observations. ▪ Code Generation: This process generates code snippets or entire programs from natural language descriptions. © Information Systems Lab Encoder-Decoder Architecture ▪ The most common architecture used to build Seq2Seq models is Encoder-Decoder architecture ▪ We can demonstrate the application of an encoder–decoder architecture, where both the encoder and decoder are implemented as RNNs, to the task of machine translation. ▪ Here, the encoder RNN will take a variable-length sequence as input and transform it into a fixed-shape hidden state. ▪ Later, we will introduce attention mechanisms, which allow us to access encoded inputs without having to compress the entire input into a single fixed-length representation. © Information Systems Lab Encoder-Decoder Architecture ▪ Then to generate the output sequence, one token at a time, the decoder model, consisting of a separate RNN, will predict each successive target token given both – the input sequence and the – preceding tokens in the output. ▪ The special “” token marks the end of the sequence. ▪ Our model can stop making predictions once this token is generated. © Information Systems Lab Encoder-Decoder Architecture ▪ At the initial time step of the RNN decoder, there are two special design decisions to be aware of: – First, we begin every input with a special beginning-of-sequence “” token. – Second, we may feed the final hidden state of the encoder into the decoder at every single decoding time step. In some other designs, the final hidden state of the RNN encoder is used to initiate the hidden state of the decoder only at the first decoding step. © Information Systems Lab Encoder-Decoder Architecture ▪ During training, the decoder will typically be conditioned upon the preceding tokens in the official “ground truth” label. ▪ However, at test time, we will want to condition each output of the decoder on the tokens already predicted. © Information Systems Lab Transformers ▪ The Transformer architecture was introduced in June 2017. The focus of the original research was on translation tasks. – https://arxiv.org/abs/1706.03762 ▪ This was followed by the introduction of several influential models, including: – June 2018: GPT, the first pretrained Transformer model, used for fine-tuning on various NLP tasks and obtained state-of-the-art results – October 2018: BERT, another large pretrained model, this one designed to produce better summaries of sentences – February 2019: GPT-2, an improved (and bigger) version of GPT that was not immediately publicly released due to ethical concerns – October 2019: DistilBERT, a distilled version of BERT that is 60% faster, 40% lighter in memory, and still retains 97% of BERT’s performance – October 2019: BART and T5, two large pretrained models using the same architecture as the original Transformer model (the first to do so) – May 2020, GPT-3, an even bigger version of GPT-2 that is able to perform well on a variety of tasks without the need for fine-tuning (called zero-shot learning) © Information Systems Lab Transformer models ▪ Broadly, they can be grouped into three categories: – GPT-like (also called auto-regressive Transformer models) – BERT-like (also called auto-encoding Transformer models) – BART/T5-like (also called sequence-to-sequence Transformer models) https://huggingface.co/learn/nlp-course/chapter1/4 © Information Systems Lab General Architecture ▪ The model is primarily composed of two blocks: – Encoder (left): The encoder receives an input and builds a representation of it (its features). This means that the model is optimized to acquire understanding from the input. – Decoder (right): The decoder uses the encoder’s representation (features) along with other inputs to generate a target sequence. This means that the model is optimized for generating outputs. © Information Systems Lab General Architecture ▪ Each of these parts can be used independently, depending on the task: – Encoder-only models: Good for tasks that require understanding of the input, such as sentence classification and named entity recognition. – Decoder-only models: Good for generative tasks such as text generation. – Encoder-decoder models or sequence-to-sequence models: Good for generative tasks that require an input, such as translation or summarization. © Information Systems Lab Attention layers ▪ A key feature of Transformer models is that they are built with special layers called attention layers. ▪ In fact, the title of the paper introducing the Transformer architecture was “Attention Is All You Need”! ▪ This layer will tell the model to pay specific attention to certain words in the sentence you passed it (and more or less ignore the others) when dealing with the representation of each word. © Information Systems Lab Attention mechanism (Translation) ▪ To put this into context, consider the task of translating text from English to French. ▪ Given the input “You like this course”, a translation model will need to also attend to the adjacent word “You” to get the proper translation for the word “like”, because in French the verb “like” is conjugated differently depending on the subject. The rest of the sentence, however, is not useful for the translation of that word. ▪ In the same vein, when translating “this” the model will also need to pay attention to the word “course”, because “this” translates differently depending on whether the associated noun is masculine or feminine. Again, the other words in the sentence will not matter for the translation of “course”. ▪ With more complex sentences (and more complex grammar rules), the model would need to pay special attention to words that might appear farther away in the sentence to properly translate each word. © Information Systems Lab Attention mechanism ▪ The same concept applies to any task associated with natural language: – a word by itself has a meaning, but that meaning is deeply affected by the context, which can be any other word (or words) before or after the word being studied. © Information Systems Lab © Information Systems Lab Architectures vs. checkpoints ▪ Terms all have slightly different meanings: – Architecture: This is the skeleton of the model — the definition of each layer and each operation that happens within the model. – Checkpoints: These are the weights that will be loaded in a given architecture. – Model: This is an umbrella term that isn’t as precise as “architecture” or “checkpoint”: it can mean both. This course will specify architecture or checkpoint when it matters to reduce ambiguity. ▪ For example, BERT is an architecture while bert-base-cased, a set of weights trained by the Google team for the first release of BERT, is a checkpoint. However, one can say “the BERT model” and “the bert- base-cased model.” © Information Systems Lab Encoder models ▪ Encoder models use only the encoder of a Transformer model. At each stage, the attention layers can access all the words in the initial sentence. These models are often characterized as having “bi- directional” attention, and are often called auto-encoding models. ▪ The pretraining of these models usually revolves around somehow corrupting a given sentence (for instance, by masking random words in it) and tasking the model with finding or reconstructing the initial sentence. ▪ Encoder models are best suited for tasks requiring an understanding of the full sentence, such as sentence classification, named entity recognition (and more generally word classification), and extractive question answering. ▪ Representatives of this family of models include: ALBERT, BERT, DistilBERT, ELECTRA, RoBERTa © Information Systems Lab Decoder models ▪ Decoder models use only the decoder of a Transformer model. At each stage, for a given word the attention layers can only access the words positioned before it in the sentence. These models are often called auto-regressive models. ▪ The pretraining of decoder models usually revolves around predicting the next word in the sentence. ▪ These models are best suited for tasks involving text generation. ▪ Representatives of this family of models include: CTRL, GPT, GPT-2, Transformer XL © Information Systems Lab Sequence-to-sequence models ▪ Encoder-decoder models (also called sequence-to-sequence models) use both parts of the Transformer architecture. At each stage, the attention layers of the encoder can access all the words in the initial sentence, whereas the attention layers of the decoder can only access the words positioned before a given word in the input. ▪ The pretraining of these models can be done using the objectives of encoder or decoder models, but usually involves something a bit more complex. For instance, T5 is pretrained by replacing random spans of text (that can contain several words) with a single mask special word, and the objective is then to predict the text that this mask word replaces. ▪ Sequence-to-sequence models are best suited for tasks revolving around generating new sentences depending on a given input, such as summarization, translation, or generative question answering. © Information Systems Lab Types of attention mechanism ▪ Soft attention provides a differentiable mechanism that allows models to focus on different parts of the input with varying degrees of emphasis, but smoothly and without making hard decisions. ▪ Hard attention, in contrast to soft attention, makes a discrete choice about which part of the input to attend to. It’s like turning a spotlight on one area while ignoring the others completely. ▪ Self-attention layers allow elements in an entire sequence to attend to all other elements in the same sequence – For example, in a sentence, the meaning of a word can depend on the other words around it, not just the ones directly adjacent. Self-attention is a core component of the Transformer architecture, which has been revolutionary in fields like natural language processing ▪ Multi-head attention mechanism is an extension of self-attention where the mechanism is applied several times in parallel. © Information Systems Lab How Does Attention Mechanism Work? ▪ The attention mechanism typically involves three main components: queries, keys, and values, which are context vectors (lists of numbers). In a translation task, for example, these can be representations of words in the sentence. – Query: This is related to the current word or part of the output sequence. For example, if the model is trying to translate the English word “apple” into French, the representation of “apple” would be the query element. – Key: Keys are representations of the input elements that the model should pay attention to. Each word in the input sentence has an associated key. – Value: Each key has a corresponding value, which is what the model should focus on if it decides that the associated key is important. © Information Systems Lab Attention Scores ▪ The model calculates an attention score by comparing the query with each key. – This alignment score determines how much attention to pay to the corresponding value. – The comparison is often done using a dot product, which is a way of measuring how similar two vectors are. ▪ The alignment scores are typically passed through a softmax function, which converts them into a set of probabilities (between 0 and 1). These probabilities sum up to 1 and determine the weight of each value in the final output sequence. ▪ The model produces a context vector by taking the weighted combination of the values, using the softmax probabilities as weights. This sum is the output of the attention mechanism, providing a focused blend of the input elements based on the context provided by the query. ▪ This output is then used in the next steps of the model. In our translation example, the attention mechanism helps the model focus on the relevant parts of the input sentence when translating a specific word, improving the accuracy and context-awareness of the translation © Information Systems Lab The World Wide Web http://info.cern.ch/Proposal.html https://www.flickr.com/photos/hansel5569/7991125444 © Information Systems Lab 25 The World Wide Web ▪ The World Wide Web (WWW), commonly known as the Web, is an information system where documents and other web resources are identified by Uniform Resource Locators (URLs, such as https://example.com/), which may be interlinked by hyperlinks, and are accessible over the Internet. ▪ The resources of the Web are transferred via the Hypertext Transfer Protocol (HTTP), may be accessed by users by a software application called a web browser, and are published by a software application called a web server. © Information Systems Lab Web vs Internet ▪ The World Wide Web is not synonymous with the Internet, which pre- dated the Web in some form by over two decades and upon the technologies of which the Web is built. ▪ The Internet is the global system of interconnected computer networks that uses the Internet protocol suite (TCP/IP) to communicate between networks and devices. ▪ The Internet gives the communication infrastructure, the Web is one of the applications of this infrastructure among others Tim Berners Vinton Cerf Lee © Information Systems Lab W3C ▪ The World Wide Web Consortium (W3C) is an international community where Member organizations, a full-time staff, and the public work together to develop Web standards. ▪ The standards we will use in this course have been created inside the W3C. © Information Systems Lab The Web as originally envisioned by Tim Berners-Lee March 1989 © Information Systems Lab 29 The Web of Documents ▪ In the web architecture, two important parts exist, the web client known as browser, and the web server that is going to serve document and data to the client whenever they require it. ▪ For these two work, there's three components in the web architecture: – addresses (URIs) that allows us to identify and address locate the document on the web, – communication protocol (HTTP) that allow the client to connect to the server, to send the request and get an answer, and – representation language (HTML) that allow us to describe the content of the pages, the documents, that are going to be transferred. © Information Systems Lab The Web of Documents ▪ The document Web is built on a small set of simple standards: – Uniform Resource Identifiers (URIs) as globally unique identification mechanism – Hypertext Transfer Protocol (HTTP) as universal access mechanism – Hypertext Markup Language (HTML) as a widely used content format. – In addition, the Web is built on the idea of setting hyperlinks between Web documents that may reside on different Web servers. © Information Systems Lab 31 But what about linked information and data? March 1989 © Information Systems Lab 32 Historical Video ▪ A talk given at the First International Conference on the World-Wide Web (CERN, 25 - 27 May 1994). https://videos.cern.ch/record/2671957 © Information Systems Lab Historical e-mail ▪ TimBL's message to a mailing list about the WWW project ▪ https://web.archive.org/web/20180826220707/https://groups.google.co m/forum/message/raw?msg=alt.hypertext/eCTkkOoWTAY/bJGhZyooX zkJ © Information Systems Lab The Semantic Web ▪ In early 2000 the vision of the Semantic Web was proposed to fulfill the 1989’s idea of “linked information” “The Semantic Web is an extension of the current Web in which information is given well-defined meaning, better enabling computers and people to work in cooperation” Tim Berners-Lee, James Hendler, Ora Lassila: The Semantic Web, Scientific American, 284(5), pp. 34-43 (2001) © Information Systems Lab 35 The Semantic Web Stack ▪ The Semantic Web related technologies and concepts was too complex to enable wide adoption. © Information Systems Lab 36 The Linked Data Principles ▪ Tim Berners-Lee introduced in 2006 the Linked Data principles in order to provide a basic recipe for publishing and connecting data using the infrastructure of the Web. http://www.w3.org/DesignIssues/LinkedData.html © Information Systems Lab 37 Linked Data Principles What do we need Linked Data principles ▪ Uniquely define things: ▪ Use URIs as name of things. – Real world things (e.g. ▪ Use HTTP URIs so that TimBL). people can look up those – Concepts and terms (e.g. is names. a professor) ▪ When someone looks up a ▪ A way to get the data. URI, provide useful ▪ A way to model and information, using the structure the data. standards. ▪ A way to link the data. ▪ Include links to other URIs © Information Systems Lab 38 © Information Systems Lab The linked data Web ▪ Based on this principles the Web of Linked Data emerged. ▪ Each bubble is a database published using the linked data principles. ▪ Links between bubbles denote links between data The linked data version of wikipedia © Information Systems Lab Linked Data Web (2020) The linked data version of wikipedia © Information Systems Lab Knowledge Graphs ▪ Recently, the term “Knowledge Graphs” has been more widely accepted to describe Linked Data. ▪ The knowledge graph represents a collection of interlinked descriptions of entities – objects, events or concepts. ▪ Knowledge graphs put data in context via linking and semantic metadata and this way provide a framework for data integration, unification, analytics and sharing. © Information Systems Lab Knowledge graphs © Information Systems Lab Knowledge Graph ▪ Knowledge graph is a directed labeled graph in which the labels have well-defined meanings. ▪ A directed labeled graph consists of nodes, edges, and labels. ▪ Anything can act as a node, for example, people, company, etc. ▪ An edge connects a pair of nodes and captures the relationship of interest between them, for example, friendship relationship between two people and customer relationship between a company and person ▪ The labels capture the meaning of the relationship, for example, the friendship relationship between two people. © Information Systems Lab Knowledge Graph ▪ The knowledge graph (KG) represents a collection of interlinked descriptions of entities – real-world objects and events, or abstract concepts (e.g., documents) – where: – Descriptions have formal semantics that allow both people and computers to process them in an efficient and unambiguous manner; – Entity descriptions contribute to one another, forming a network, where each entity represents part of the description of the entities, related to it, and provides context for their interpretation. © Information Systems Lab Examples ▪ Google Knowledge Graph. Google made this term popular with the announcement of its knowledge graph in 2012. However, there are very few technical details about its organization, coverage and size. There are also very limited means for using this knowledge graph outside Google’s own projects. ▪ DBpedia. This project leverages the structure inherent in the infoboxes of Wikipedia to create an enormous dataset of 4.58 things (link https://wiki.dbpedia.org/about ) and an ontology that has encyclopedic coverage of entities such as people, places, films, books, organizations, species, diseases, etc. This dataset is at the heart of the Open Linked Data movement © Information Systems Lab Graph ▪ A graph represents the relations (edges) between a collection of entities (nodes). ▪ Vertex (or node) attributes e.g., node identity, number of neighbors ▪ Edge (or link) attributes and directions e.g., edge identity, edge weight ▪ Global (or master node) attributes e.g., number of nodes, longest path © Information Systems Lab 47 Graph example: Social networks ▪ A social network is a social structure made up of a set of social actors (such as individuals or organizations), sets of dyadic ties, and other social interactions between actors. © Information Systems Lab Graph example: Images as graphs ▪ Each pixel represents a node and is connected via an edge to adjacent pixels © Information Systems Lab Graph example: Natural Language Processing ▪ Graphs are being used as a target output representation for natural language processing. ▪ Entity extraction and relation extraction from text are two fundamental tasks in natural language processing. ▪ The extracted information from multiple portions of the text needs be correlated, and knowledge graphs provide a natural medium to accomplish such a goal © Information Systems Lab Graph example: Computer vision ▪ For example, from the image shown below, an image understanding system should produce a knowledge graph shown to the right. ▪ The nodes in the knowledge graph are the outputs of an object detector. ▪ Current research in computer vision focuses on developing techniques that can correctly infer the relationships between the objects, such as, man holding a bucket, and horse feeding from the bucket, etc. © Information Systems Lab Graphs as input to Machine Learning ▪ Graph Neural Networks: From niche to one of the hottest fields of AI research © Information Systems Lab Open Graph Benchmark ▪ The Open Graph Benchmark (OGB) is a collection of realistic, large- scale, and diverse benchmark datasets for machine learning on graphs. – https://ogb.stanford.edu ▪ Three fundamental graph machine learning task categories: predicting the properties of nodes, links, and graphs. © Information Systems Lab Node property prediction ▪ The task is to predict properties of single nodes ▪ Amazon product co-purchasing network: – Nodes represent products sold in Amazon, and edges between two products indicate that the products are purchased together – Νode features are generated by extracting bag-of-words features from the product descriptions – Predict the category of a product in a multi-class classification setup, where the 47 top-level categories are used for target labels © Information Systems Lab Link Property Prediction ▪ The task is to predict properties of edges (pairs of nodes) ▪ TheDrugBank database – Each node represents an FDA-approved or experimental drug – Edges represent interactions between drugs and can be interpreted as a phenomenon where the joint effect of taking the two drugs together is considerably different from the expected effect in which drugs act independently of each other – The task is to predict drug-drug interactions given information on already known drug-drug interactions. ▪ Wikidata – Different types of relations between entities in the world, e.g., (Canada, citizen, Hinton) – Predict new triplet edges given the training edges. © Information Systems Lab Graph Property Prediction ▪ The task is to predict properties of entire graphs or subgraphs ▪ Moleculenet – Each graph represents a molecule, where nodes are atoms, and edges are chemical bonds. – Input node features are 9-dimensional, containing atomic number and chirality, as well as other additional atom features such as formal charge and whether the atom is in the ring or not – Predict the target molecular properties as accurately as possible, where the molecular properties are cast as binary labels, e.g., whether a molecule inhibits HIV virus replication or not. © Information Systems Lab

Web and Text Analytics 2024-2025 Week 11 PDF

Document Details

Tags

Related

Summary

Full Transcript