Web and Text Analytics 2024-25 Week 1 PDF
Document Details
Uploaded by CooperativeIntellect47
University of Macedonia
Evangelos Kalampokis
Tags
Summary
This document covers the introduction to web analytics, text analytics, and the semantic web. It discusses the concepts, processes, and applications of each topic in detail. The document is part of a course on web and text analytics for 2024-25.
Full Transcript
Web and Text Analytics 2024-25 Week 1 Evangelos Kalampokis https://kalampokis.github.io http://islab.uom.gr © Information Systems Lab The course ▪ Web Analytics – Web analytics is the measurement, collection, analysis...
Web and Text Analytics 2024-25 Week 1 Evangelos Kalampokis https://kalampokis.github.io http://islab.uom.gr © Information Systems Lab The course ▪ Web Analytics – Web analytics is the measurement, collection, analysis and reporting of web data for purposes of understanding and optimizing web usage ▪ Text Analytics – Text Analytics is the process of drawing meaning out of written communication – Aka text mining – Natural Language Processing ▪ Semantic Web – The term was coined by Tim Berners-Lee for a web of data (or data web) that can be processed by machines—that is, one in which much of the meaning is machine-readable E. Kalampokis 2024-25 Web Analytics ▪ Analytics platforms measure activity and behavior on a website, for example: – how many users visit, – how long they stay, – how many pages they visit, – which pages they visit, and – whether they arrive by following a link or not. E. Kalampokis 2024-25 or…digital analytics? ▪ In March 2012 the Web Analytics Association changed its name to Digital Analytics Association ▪ “Web Analytics” companies like WebTrends, Omniture (now Adobe), Google Analytics etc. transformed from Web Analytics tools to Digital Analytics tools. E. Kalampokis 2024-25 Or…marketing analytics? ▪ Google Marketing Platform – https://marketingplatform.google.com/ ▪ Google Analytics is a web analytics service offered by Google that tracks and reports website traffic, currently as a platform inside the Google Marketing Platform brand. E. Kalampokis 2024-25 Google Analytics ▪ Google Analytics is a page tag solution that tracks visitors by using cookies E. Kalampokis 2024-25 6 Text Analytics ▪ Text mining, also known as text analysis, is the process of transforming unstructured text into meaningful and actionable information ▪ Text mining (also referred to as text analytics) is an artificial intelligence (AI) technology that uses natural language processing (NLP) to transform the free (unstructured) text in documents and databases into normalized, structured data suitable for analysis or to drive machine learning (ML) algorithms. E. Kalampokis 2024-25 Natural Language Processing ▪ Natural Language Understanding helps machines “read” text (or another input such as speech) by simulating the human ability to understand a natural language such as English, Spanish or Chinese. ▪ Natural Language Processing includes both Natural Language Understanding and Natural Language Generation, which simulates the human ability to create natural language text, e.g., to summarize information or take part in a dialogue E. Kalampokis 2024-25 The World Wide Web https://www.flickr.com/photos/hansel5569/7991125444 E. Kalampokis 2024-25 9 The Web as originally envisioned by Tim Berners-Lee March 1985 E. Kalampokis 2024-25 10 But what about linked information and data? March 1985 E. Kalampokis 2024-25 11 Linked Data Web aka the Semantic Web E. Kalampokis 2024-25 12 The linked data Web (2019) E. Kalampokis 2024-25 The course 2023-24 ▪ Web Analytics Marketing – Web analytics is the Analytics measurement, collection, analysis and reporting of web data for purposes of understanding and optimizing web usage ▪ Text Analytics – Text Analytics is the process of drawing meaning out of written communication – Aka text mining – Natural Language Processing ▪ Semantic Web – The term was coined by Tim Berners-Lee for a web of data (or data Introduction to Data Management web) that can be processed by machines—that is, one in which much of the meaning is machine-readable E. Kalampokis 2024-25 The course 2023-24 ▪ In Web and Text Analytics 2023-24 we will focus on Text Mining and Natural Language Processing ▪ Text Analytics – Text Analytics is the process of drawing meaning out of written communication – Aka text mining – Natural Language Processing E. Kalampokis 2024-25 The launch of ChatGPT is a milestone in AI ▪ ChatGPT, which stands for Chat Generative Pre-trained Transformer, is a large language model-based chatbot developed by OpenAI and launched on November 30, 2022 E. Kalampokis 2024-25 Large Language Models (LLM) ▪ A large language model (LLM) is a type of language model notable for its ability to achieve general-purpose language understanding and generation. ▪ LLMs acquire these abilities by using massive amounts of data to learn billions of parameters during training and consuming large computational resources during their training and operation. ▪ LLMs are artificial neural networks (mainly Transformers) and are (pre-) trained using self-supervised learning and semi-supervised learning. ▪ Notable examples include OpenAI's GPT models (e.g., GPT-3.5 and GPT-4, used in ChatGPT), Google's PaLM (used in Bard), and Meta’s LLaMa, as well as BLOOM, Ernie 3.0 Titan, and Anthropic’s Claude 2. E. Kalampokis 2024-25 Large Language Models E. Kalampokis 2024-25 Google’s BERT ▪ BERT is Google’s open-source machine learning framework for natural language processing ▪ A massive dataset of 3.3 Billion words has contributed to BERT’s continued success. ▪ BERT’s training was made possible thanks to the novel Transformer architecture and sped up by using TPUs—64 TPUs trained BERT over the course of 4 days. E. Kalampokis 2024-25 Training Cost https://www.semianalysis.com/p/the-ai-brick-wall-a-practical-limit E. Kalampokis 2024-25 Train and Deploy Cost ▪ Large language models have a keen appetite for electricity. The energy used to train Openai’s gpt-4 model could have powered 50 American homes for a century. ▪ By one estimate, today’s biggest models cost $100m to train; the next generation could cost $1bn, and the following one $10bn. ▪ On top of this, asking a model to answer a query comes at a computational cost—anything from $2,400 to $223,000 to summarise the financial reports of the world’s 58,000 public companies. In time such “inference” costs, when added up, can exceed the cost of training. https://www.economist.com/leaders/2024/09/19/the-breakthrough-ai-needs E. Kalampokis 2024-25 Can we fine-tune a pre-trained LLM? ▪ Fine-tuning is the process of retraining a foundation model on new data. ▪ Any machine learning model might require fine-tuning or retraining on different occasions. ▪ This can happen if, for example, we are using an LLM for a medical application, but its training data did not contain any medical literature. ▪ Differet fine-tuning techniques – Repurpose vs full fine-tuning – Unsupervised vs supervised fine-tuning (instruction fine-tuning) – Reinforcement learning from human feedback (RLHF) – Parameter-efficient fine-tuning (PEFT) E. Kalampokis 2024-25 Prompt engineering ▪ Prompt engineering and fine-tuning are both important means of optimizing AI performance and output. However, there are several key differences between the two techniques: – Prompt engineering focuses on eliciting better output for users, whereas fine-tuning focuses on enhancing model performance on certain tasks. – Prompt engineering aims to improve output by creating more detailed and effective inputs, whereas fine-tuning involves training a model on new data to improve knowledge in specific areas. – Prompt engineering offers more precise control over an AI system's actions and outputs, whereas fine-tuning can add detail and depth to relevant topic areas. – Prompt engineering demands almost no computing resources, as prompts are created by people. In contrast, the additional training and data used for fine-tuning can demand significant computing resources. E. Kalampokis 2024-25 Retrieval Augmentation Generation ▪ In some cases, LLM fine-tuning is not possible or not useful – Some models are only available through application programming interfaces (API), e.g., ChatGPT – The data in the application might change frequently – The application might be dynamic and context-sensitive. – For example, if you’re creating a chatbot that customizes its output for each user, we can not fine-tune the model on user data. ▪ In such cases, we can use in-context learning or retrieval augmentation E. Kalampokis 2024-25 Applications using LLMs ▪ We need to come up with applications that exploit LLM models. E. Kalampokis 2024-25 The Evolution of NLP E. Kalampokis 2024-25 26 Twitter mood predicts the stock market ▪ Tweets from February to December 2008 ▪ 10M tweets from 2.7M users ▪ OpinionFinder (Positive or negative) ▪ GPOMS (Calm, Alert, Sure, Vital, Kind and Happy) Bollen, J., Mao, H. and Zeng, X.J. (2011), “Twitter mood predicts the stock market”, Journal of Computational Science, Vol. 2, No. 1, pp. 1-8. E. Kalampokis 2024-25 27 Earthquake shakes Twitter users ▪ Detect earthquakes in Japan (2009) ▪ The users function as sensors. Sakaki, T., Okazaki, M., & Matsuo, Y. (2010). Earthquake shakes twitter users: real-time event detection by social sensors. Proceedings of the 19th International Conference on World Wide Web (WWW’10), (pp. 851-860), New York: ACM Press. E. Kalampokis 2024-25 28 Spatio-temporal detection of an earhquake ▪ Location estimation of an earthquake on August 11, 2009 based on Tweets in Japan E. Kalampokis 2024-25 The Evolution of NLP ▪ Rules-Based NLP (1950s to 1970s) ▪ Early Statistical NLP Models (1980s to 1990s) ▪ Lexicon-Based and Topic Modeling (1990s to early 2000s) ▪ Embedding-Based NLP (2000s to 2010s) ▪ Deep Learning-Based Approaches to NLP (2020s) E. Kalampokis 2024-25 Rules-Based NLP (1950s to 1970s) ▪ Early NLP research focused on rules-based approaches, including grammar-based parsers and dictionary-based information extraction systems. ▪ Such parsing and pattern matching was used by the early chatbots like ELIZA (created in 1967 at MIT) https://psych.fullerton.edu/mbirnbaum/psych101/eliza.htm E. Kalampokis 2024-25 Lexicon-Based and Topic Modeling (1990s to early 00s) ▪ An important step forward in NLP modeling was the combination of the dictionary and rules-based approach with statistical modeling – Lexicon-based models: they rely on the use of predefined dictionaries of words and phrases to analyze and understand natural language text. – Topic models: they use statistical techniques, e.g., Latent Dirichlet Allocation (LDA), to identify the main topics discussed in a (collection) document. E. Kalampokis 2024-25 Example of Lexicon based NLP ▪ Explore the Expression of Emotions in 20th Century Books ▪ Google’s Ngram database (http://books.google.com/ngrams/datasets) – 4% of all books ever printed – 1-grams, 2-grams, 3-grams, 4-grams and 5-grams ▪ They used WordNet affect lexicon that includes “mood” words for – Anger, – Disgust, – Fear, – Joy, – Sadness – Surprise E. Kalampokis 2024-25 33 The Expression of Emotions in 20th Century Books Acerbi A, Lampos V, Garnett P, Bentley RA (2013) The Expression of Emotions in 20th Century Books. PLoS ONE 8(3): e59030. doi:10.1371/journal.pone.0059030 http://www.plosone.org/article/info:doi/10.1371/journal.pone.0059030 E. Kalampokis 2024-25 34 Embedding-Based NLP (2000s to 2010s) ▪ Word embeddings are a type of NLP technique that represents words as continuous-valued vectors in a high-dimensional mathematical space. ▪ Typically, the representation is a real-valued vector that encodes the meaning of the word in such a way that words that are – Count-based method: These methods use the frequency of words in a given corpus of text (including TF-IDF and bag-of-words models). Dimensionality reduction can then be applied to count-based embedding to obtain word embeddings. – Predictive methods: These methods, such as GloVe and word2vec use a predictive objective to learn word representations. E. Kalampokis 2024-25 Deep Learning-Based Approaches to NLP (2020s) ▪ Most recently, with the joint exponential growth in data and computational infrastructure, large “deep learning” based language models have been developed and used widely for NLP applications. ▪ Initially, recurrent neural networks (RNNs) were developed for the analysis of sequential data, and their use extended to NLP tasks. ▪ Along with RNNs, convolution neural networks (CNNs) were used for text classification. ▪ Over time, RNNs were modified for use in applications like neural machine translation, for which sequence to sequence models were developed ▪ The “imagenet moment” for NLP—referring to the development of generic models that can used for multiple tasks—happened with the development of the transformer-based large language model, E. Kalampokis 2024-25 Transformer Architecture ▪ The transformer architecture, proposed in the 2017 paper “Attention is all you need”, reinvented the concepts of encoders and decoders in the context of natural language machine learning models. ▪ The transformer owes its performance to the attention mechanism that allows for the capturing of contextual information in a sentence. In this way all token relationships can be mapped efficiently. E. Kalampokis 2024-25 Encoder - decoder ▪ The encoder is responsible for transforming natural language text into a high dimensional vector representation that has captured the semantic meaning of each token in context. ▪ The decoder on the other hand performs the inverse operation of turning token vector representations to tokens. – For this reason, decoders are used in generative applications. E. Kalampokis 2024-25 Transformers described in the Economist ▪ https://www.economist.com/interactive/science-and-technology/2023/04/22/large- creative-ai-models-will-transform-how-we-live-and-work E. Kalampokis 2024-25 Common NLP tasks ▪ Text/document classification ▪ Sentiment Analysis ▪ Information Retrieval ▪ Parts of speech tagging ▪ Language Detection and Machine Translation ▪ Conversational agents ▪ Knowledge Graph and QA system ▪ Text summarization ▪ Topic Modelling ▪ Text Generation ▪ Spell checking and Grammar correction ▪ Text parsing ▪ Speech to text E. Kalampokis 2024-25 Text Classification ▪ Text classification is the process of assigning categories (tags) to unstructured text data. This essential task of Natural Language Processing (NLP) makes it easy to organize and structure complex text, turning it into meaningful data. ▪ For example, use of clinical notes to predict residual disease in women with advanced epithelial ovarian cancer following cytoreductive surgery – RoBERTa model – 555 cases of EOC cytoreduction performed by eight surgeons between January 2014 and December 2019 – AUROC 0.86; AUPRC 0.87, precision, recall and F1 score of 0.77 and accuracy of 0.81 – Outperformed models that used discrete clinical and engineered features A. Laios, E. Kalampokis, M. Mamalis, C. Tarabanis, D. Nugent, A. Thangavelu, G. Theophilou, D. De Jong (2023) RoBERTa-Assisted Outcome Prediction in Ovarian Cancer Cytoreductive Surgery using Operative Notes, Cancer Control, [Accepted for publication E. Kalampokis 2024-25 Sentiment Analysis ▪ Sentiment Analysis: consists of analyzing the emotions that underlie any given text. – Sentiment analysis helps you understand the opinion and feelings in a text, and classify them as positive, negative or neutral. – Sentiment analysis has a lot of useful applications in business, from analyzing social media posts to going through reviews or support tickets. E. Kalampokis 2024-25 Sentiment in Tweets http://www.flickr.com/photos/stuckincustoms/4286568923 E. Kalampokis 2024-25 Sentiment in Tweets http://www.flickr.com/photos/13709576@N03/2552190638 E. Kalampokis 2024-25 Brand reputation management tools ▪ Monitoring public sentiment about a brand – https://awario.com E. Kalampokis 2024-25 Text Extraction ▪ Text extraction is a text analysis technique that extracts specific pieces of data from a text, like keywords, entity names, addresses, emails, etc. By using text extraction, companies can avoid all the hassle of sorting through their data manually to pull out key information. – Keyword Extraction: keywords are the most relevant terms within a text and can be used to summarize its content. – Named Entity Recognition: allows you to identify and extract the names of companies, organizations or persons from a text. – Feature Extraction: helps identify specific characteristics of a product or service in a set of data. E. Kalampokis 2024-25 Named Entity Recognition (Reuters) - Research In Motion Ltd said on Tuesday its subscriber base has risen to 80 million from the 78 million it reported earlier this year, surprising many on Wall Street and sending its shares up more than 3 percent. Most analysts had expected RIM, for the first time in its history, to begin losing subscribers in the recently completed quarter as it has rapidly lost market share in North America to Apple's snazzieriPhone and Samsung's Galaxy devices E. Kalampokis 2024-25 The 2010 UK Elections E. Kalampokis, A. Karamanou, E. Tambouris, and K. Tarabanis (2017) On Predicting Election Results using Twitter and Linked Open Data: The Case of the UK 2010 Election, Journal of Universal Computer Science, Vol.23, No.3, pp.280-303 E. Kalampokis 2024-25 Text summarization ▪ Text summarization uses NLP techniques to digest huge volumes of digital text and create summaries and synopses for indexes, research databases, or busy readers who don't have time to read full text. ▪ The best text summarization applications use semantic reasoning and natural language generation (NLG) to add useful context and conclusions to summaries. E. Kalampokis 2024-25 Machine translation ▪ Machine translation: Google Translate is an example of widely available NLP technology at work. ▪ Truly useful machine translation involves more than replacing words in one language with words of another. ▪ Effective translation has to capture accurately the meaning and tone of the input language and translate it to text with the same meaning and desired impact in the output language. ▪ Machine translation tools are making good progress in terms of accuracy. ▪ A great way to test any machine translation tool is to translate text to one language and then back to the original. E. Kalampokis 2024-25 Rephrasing ▪ In this cross-sectional study, a public and nonidentifiable database of questions from a public social media forum (Reddit's r/AskDocs) was used to randomly draw 195 exchanges from October 2022 where a verified physician responded to a public question. ▪ Chatbot responses were generated by entering the original question into a fresh session (without prior questions having been asked in the session) on December 22 and 23, 2022. Ayers JW, Poliak A, Dredze M, et al. Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum, JAMA Intern Med, 2023 E. Kalampokis 2024-25 How to apply NLP in practical arena ▪ Noise Removal – Remove whitespaces, HTML tags – Convert accented characters – Remove special characters, stop words ▪ Normalization – Lowercasing characters – Numeric words to numbers – Stemming / Lemmatization ▪ Tokenization ▪ Vocabulary ▪ Vectorization - Word Embeddings E. Kalampokis 2024-25 DataCamp ▪ Introduction to Natural Language Processing in Python https://app.datacamp.com/learn/courses/introduction-to-natural-language- processing-in-python – Regular expressions & word tokenization – Simple topic identification (bag of words, tf-idf) – Named-entity recognition ▪ Sentiment Analysis in Python https://app.datacamp.com/learn/courses/sentiment- analysis-in-python ▪ Recurrent Neural Networks (RNN) for Language Modeling in Python https://app.datacamp.com/learn/courses/recurrent-neural-networks-rnn-for- language-modeling-in-python – Sequence to sequence models, the embedding layer, word2vec ▪ Large Language Models (LLMs) Concepts https://app.datacamp.com/learn/courses/large-language-models-llms-concepts ▪ Large Language Models for Business https://app.datacamp.com/learn/courses/large-language-models-for-business E. Kalampokis 2024-25