Document Details

HardWorkingErudition4314

Uploaded by HardWorkingErudition4314

University of Applied Sciences and Arts Northwestern Switzerland

Dr. Stephan Jüngling

Tags

information extraction natural language processing big data AI

Summary

This document is part of a course on Big Data, specifically focusing on information extraction techniques within the field of natural language processing (NLP) and the use of Large Language Models (LLMs) such as BLOOM and GPT-4. It details various topics such as information retrieval, semantic retrieval, and text mining, along with a discussion of the role of NLP-based tools in different applications.

Full Transcript

Big- Data – Semi- and Unstructured Data – Part II VARIETY: Working with Semi- and Unstructured Text Lecturer Dr. Stephan Jüngling Peter Merian-Strasse 86 4002 Basel...

Big- Data – Semi- and Unstructured Data – Part II VARIETY: Working with Semi- and Unstructured Text Lecturer Dr. Stephan Jüngling Peter Merian-Strasse 86 4002 Basel [email protected] CAS Data & AI for Business - HS24 2 Content Overview – Part II – Variety L06 – Information Retrieval – How do Machines «Read»? Introduction to Natural Language Processing (NLP) – A short glimpse on Large Language Models (LLM) (BLOOM, GPT-4) – Text Pre-pocessing for Information Retrieval – Understand principles of information extraction L07 – Semantic Retrieval – Information Retrieval and Ranking Is a preliminary version! – Information Extraction (metadata, entities, relationships, topics etc.) – Named Entity Recognition (NER) – Undestand how to use ChatPDF and ChatGPT for generating specific and global insights – Understand In-Context Learning L08 – Text Mining – Text analysis – Analyse Restaurant Reviews with ChatPDF and ChatGPT Big Data, Structured and Unstructured Data - BDSUD HS24 3 Content Overview – Part II L09 – Text Mining Applications – Sentiment Analysis – Different Tools and Frameworks (eg. Orange, Lang Chain etc). – Further NLP tasks (e.g.,Topic Modelling) L10 – Generative AI – Research Augmented Generation (RAG) Group Work Presentations – A “conference track” – Getting insights into different challenges and solution for the analysis of large text corpus Big Data, Structured and Unstructured Data - BDSUD HS24 4 Additional Resources – Getting Started with Natural Language – web.stanford.edu/~jurafsky/slp3/ed3bookaug2 Processing 0_2024.pdf Big Data, Structured and Unstructured Data - BDSUD HS24 5 Getting Started with Natural Language Processing - Ekaterina Kochmar Getting Started with Natural Language Processing is an enjoyable and understandable guide that helps you engineer your first NLP algorithms. Your tutor is Dr. Ekaterina Kochmar, lecturer at the University of Bath, who has helped thousands of students take their first steps with NLP. Full of Python code and hands-on projects, each chapter provides a concrete example with practical techniques that you can put into practice right away. If you’re a beginner to NLP and want to upgrade your applications with functions and features like information extraction, user profiling, and automatic topic labeling, this is the book for you. Big Data, Structured and Unstructured Data - BDSUD HS24 6 Why NLP? – Data Analysis: NLP allows for the extraction of meaningful insights from large volumes of unstructured text data, which is valuable for businesses and researchers. – Language Understanding: Learning NLP helps in understanding the complexities of human language and how it can be processed by machines. – Automation: NLP can automate repetitive and time-consuming tasks such as document summarization, e-mail filtering, content generation, increasing efficiency and productivity. – Enhancing Communication: NLP technologies improve human-computer interaction, making it easier to develop applications like chatbots, virtual assistants, and automated customer services. – Innovation: NLP is at the forefront of AI research with the application of Large Language Models, driving innovations in machine translation, sentiment analysis, and the application of generative AI in many diverse application domains. NLP and LLMs are changing the way we work and interact with machines! – What are your observations? What exactly is the transition we are currently in? Big Data, Structured and Unstructured Data - BDSUD HS24 7 Group Work Task – Is in Moodle in the Materials section – Stay in the same groups as for GW part I – You all need to be part of the presentation, which will be held in form of a small NLP conference Be creative, try out new things, have much fun and share your insights! Big Data, Structured and Unstructured Data - BDSUD HS24 8 Big- Data – Semi- and Unstructured Data L06 Variety – Information Extraction Content and Learning Objectives Content – Introducing Natural Language Processing – Preprocessing Text for IR and the beginning of the NLP pipeline – A short glimpse on Large Language Models (LLM) (BLOOM, GPT-4) – Explaining ways machines represent words and understand their meaning – Text Preprocessing (stop word removal, stemming, tokenization, text prediction and n-grams) Learning Objectives – Getting a first glimpse on NLP – Understand principles of information retrieval – Be able to use Colab and Gemini for generating Jupyter Notebooks – Find and discuss your case for the Group Work Part II Exercises and Homework – Information Extraction Quizz – Hands-on in Google Colab – Text Preprocessing – Chapter 1 of Getting Started with NLP Big Data, Structured and Unstructured Data - BDSUD HS24 10 Introduction to Natural Language Processing (NLP) What is NLP? – Field addressing how computers deal with human language – Gained the spotlight due to intelligent machines understanding and producing natural language Why Learn NLP? – Useful for programmers, machine learning practitioners, and anyone interested in language processing – Essential for tasks involving textual information (e.g., documents, websites, emails) NLP’s Importance: – Integral to human intelligence and recent AI advancements. – Enhances the ability to extract and learn from large text data. Big Data, Structured and Unstructured Data - BDSUD HS24 11 Overview of NLP Development Key Historical Points: Approach Evolution: – 1950s: Georgetown-IBM experiment – Rule-based: Rigid, expert-driven. aimed at machine translation. – Statistical: Data-driven, flexible. – Early Approaches: Rule-based and – Deep Learning: Efficient with large data. template-based systems. Current State: Combination of all approaches – Example: ELIZA chatbot. – 1980s: Introduction of statistical approaches and machine learning. – 1990s: World Wide Web enabled access to large data sets. – 2010s: Advances in hardware led to deep learning. Big Data, Structured and Unstructured Data - BDSUD HS24 12 NLP adopts techniques from many different fields https://cs-114.org/wp-content/uploads/2019/01/1-Intro-to-CL.pdf – Information Retrieval, Computer Science, AI, ML, Statistics, Logic, Electrical Engineering, Computational Linguistics, Cognitive Science Big Data, Structured and Unstructured Data - BDSUD HS24 13 Connecting to the first part What did you learn? – Volume, Velocity, Variety – Hadoop – … Why is it important? – Current trends of using cloud based NLP – Privacy issues are most important bottlenecks – What are current challenges? Source: P. Strengholt – Data Management at Scale Big Data, Structured and Unstructured Data - BDSUD HS24 14 important NLP «pipelines» true/false questions Upstream tasks (training the models): Downstream tasks (apply the models): – Tokenization: breaking text into individual – Sentiment analysis: determining the sentiment words or tokens (positive, negative, neutral) of a piece of text – Part-of-speech (POS) tagging: assigning a – Text classification: categorizing a piece of text part of speech to each word/token in a into one or more predefined categories sentence (e.g. verbs, nouns, adjectives, – Machine translation: translating text from one adverbs) language to another – Named entity recognition (NER): identifying – Question answering: answering a question named entities such as people, organizations, posed in natural language based on a given and locations in text context – Dependency parsing: identifying the – Text generation: generating text based on a grammatical relationships between words in a given prompt or input sentence – NER – as application of pre-trained LLMs Big Data, Structured and Unstructured Data - BDSUD HS24 15 Searching for Information – Information Retrieval also maybe important Submit Query: Query: “management meeting” – Enter your question or information request. Receive List of Results: – Computer or search engine provides answers or related results. Relevance Ordering: – Search engines list websites/documents by relevance. – Most relevant results appear first. How do you determine the relevance? Big Data, Structured and Unstructured Data - BDSUD HS24 16 Asking Questions: ChatGPT or Google? Do the experiment: – What temperature does water boil at?” – What is the number of inhabitants of the capital of France?” Do machines get real language understanding and intelligence? Big Data, Structured and Unstructured Data - BDSUD HS24 17 Information Retrieval & Information Extraction Processes could be important too NLP Task: Answering questions Part of speech (POS) tagging: identify that words like – Understand key words: content words water do the action, and the other words like boil denote {temperature, water, boil} the action itself – Function words or stop words – Stemmers {what, do, at} – Reduce words to their base form by removing suffixes – Lemmatizers: Filtering by stop word removal – reduce to the root form (lemmas) Big Data, Structured and Unstructured Data - BDSUD HS24 18 Word Counts in Documents important Big Data, Structured and Unstructured Data - BDSUD HS24 19 Finding similar Documents important Searching Documeents based on Term Frequencies Intelligent Personal Assistants Alexa, GoogleAssistant, or Cortana – Different scenarios and use cases than search: – Involve actions: «calling someone» Language generation Processing pipeline of an intelligent virtual assistant – Text prediction (e.g. completion assistant) – Respond to e-mails? (based on the own experience Big Data, Structured and Unstructured Data - BDSUD HS24 22 Text Prediction and Probabilities in Text important – An n-gram is a continuous sequence of n symbols (e.g., characters, words) – Context is crucial for accurate prediction. Without context, predicting a word or character is nearly impossible. Big Data, Structured and Unstructured Data - BDSUD HS24 23 NLP Applications important – Virtual Assistants: (e.g. Siri, Alexa, Google Assistant – Spam filtering – Spell- and grammar checking – Sentiment Analysis: determining the sentiment or emotion behind a text (e.g. customer reviews, social media posts, feedback forms, customer service to gauge public opinion). – Chatbots – Machine Translation – Text Summarization: creating concise summaries of longer texts, such as articles, research papers, or reports, helping users quickly grasp the main points – Named Entity Recognition (NER): identifies and classifies key information (entities) in text, such as names of people, organizations, locations, dates, and more. It’s useful in information extraction and organizing large datasets Big Data, Structured and Unstructured Data - BDSUD HS24 24 How do Humans Read Text? - Can you read it? Aoccdrnig to a rscheearch at Cmabrigde Uinervtisy, it deosn't mttaer in waht oredr the ltteers in a wrod are, the olny iprmoetnt tihng is taht the frist and lsat ltteer be at the rghit pclae. The rset can be a toatl mses and you can sitll raed it wouthit porbelm. Tihs is bcuseae the huamn mnid deos not raed ervey lteter by istlef, but the wrod as a wlohe. Big Data, Structured and Unstructured Data - BDSUD HS24 25 Text Processing - Tokenization Tokenization is a crucial step in natural language processing (NLP) – as it helps to break down text into manageable pieces for analysis and processing – Tokens can be words, characters, or subwords depending on the tokenization method used Word Tokenization: splits the text into individual words – Input: “Tokenization is fun!” – Output: [“Tokenization”, “is”, “fun”, “!”] Character Tokenization: splits the text into individual characters – Input: “Tokenization is fun!” – Output: [“T”, “o”, “k”, “e”, “n”, “i”, “z”, “a”, “t”, “i”, “o”, “n”, " ", “i”, “s”, " ", “f”, “u”, “n”, “!”] Subword Tokenization: splits the text into meaningful subwords, https://www.hackingchinese.com/36-samples-of-chines-handwriting-from-students-and-native-speakers/ often used in models like BERT or GPT. – Input: “Tokenization is fun!” – Output: [“Token”, “ization”, " is", " fun", “!”] Big Data, Structured and Unstructured Data - BDSUD HS24 26 Machine-based Information Extraction - Tokenization Initial tasks & preprocessing – NLTK tokenizers whats it used for? – Token normalization: – Stemming: remove or replace suffixes https://www.youtube.com/watch?v=nxhCyeRR75Q Big Data, Structured and Unstructured Data - BDSUD HS24 27 Word Tokenizers – Combining Keywords and Embeddings from nltk.tokenize import word_tokenize retrieving relevant information from the document based on the user's query using a combination of retrieval methods: vectorstore_retriever = vectorstore.as_retriever(search_kwargs={"k": 5}) – Vector Store Retriever (vectorstore_retriever): leverages embeddings to find semantically keyword_retriever = similar chunks of text to the query within the BM25Retriever.from_documents(pages, vectorstore, which was created from the preprocess_func=word_tokenize) document. keyword_retriever.k = 5 – Keyword Retriever (keyword_retriever): Uses the BM25 algorithm to identify relevant chunks ensemble_retriever = based on keyword matching between the query EnsembleRetriever(retrievers=[vectorstore_retrie and document. ver, keyword_retriever], weights=[0.7, 0.3]) – Ensemble Retriever (ensemble_retriever): Combines the results of both retrievers, giving more weight to the vectorstore_retriever (0.7) than the keyword_retriever (0.3). Big Data, Structured and Unstructured Data - BDSUD HS24 28 Stemming important – Stemming is a text normalization technique Purpose: used NLP to reduce words to their base or – reduce the variations of words in a text – IR: Find similar documents containing different forms of the same root form, called a stem word – Text Mining: topic modeling, sentiment analysis, and document – use a set of rules or heuristics to identify and classification remove common suffixes and prefixes – Can sometimes over-stem or under-stem words, leading to loss of information – don't necessarily produce valid dictionary Different Algorithms: words, but the resulting stems are usually similar enough to be grouped together for analysis – E.g.: – "running," "runs," "runner" would all be stemmed to "run.“ – "playing," "played," "plays" would all be stemmed to "play.“ Big Data, Structured and Unstructured Data - BDSUD HS24 29 QUIZ Quiz: text preprocessing for IR do the quizes! Additional hands-on in Colab! Big Data, Structured and Unstructured Data - BDSUD HS24 30 Hands-On Getting Started with Colab – Create your environment in Google Colab Big Data, Structured and Unstructured Data - BDSUD HS24 31 GW & Discussion and Agreement of Group Work Coaching – Coaching Sessions Big Data, Structured and Unstructured Data - BDSUD HS24 32 Homework Finalize your Topic for the second Group Work – Dtermine uses cases and stakeholders – Determine scope and text corpus – Determine information need – Be ready to present it in two weeks! Further References and Reading – Chapter 1 of Getting Started with NLP Big Data, Structured and Unstructured Data - BDSUD HS24 33 QUIZZ Quizz – text preprocessing for IR – Do the Quizz (15’) and use ChatPDF for answering and recapitulation of meta data extraction – Terms to understand (K2/K3): – NLP, NER, NLP, POS, B-I-O notation, stemmer, overstemming, tokenization, – Getting hints for the theory in the script: e.g. Big Data, Structured and Unstructured Data - BDSUD HS24 34 Text Generation – Hello Transformers GPT = Generative Pretrained Transformers Source: 1. Hello Transformers | Natural Language Processing with Transformers, Revised Edition (oreilly.com) Big Data, Structured and Unstructured Data - BDSUD HS24 35 Large Language Models (LLM) https://research.aimultiple.com/future-of-nlp/ Big Data, Structured and Unstructured Data - BDSUD HS24 36 Big Data, Structured and Unstructured Data - BDSUD HS24 37 Transformer Architectures – Recurrent Neural networks (RRNs): use the internal state to process a sequence of inputs (“build and internal memory”) – Architecture of Transformers (e.g. BERT, the encoder is used to create rich contextual embeddings, while in models like GPT, the decoder generates text based on these embeddings. – 2017: Google Paper: A. Vaswani et al., “Attention Is All You Need”, (2017) – allowing the decoder to have access to all of the encoder’s hidden states – Attention: prioritize which states to use by introducing a weight for each of the hidden states of the encoder Source: 1. Hello Transformers | Natural Language Processing with Transformers, Revised Edition (oreilly.com) Big Data, Structured and Unstructured Data - BDSUD HS24 38 Prompt Based Learning UMass CS685 S22 (Advanced NLP) #12: Prompt-based learning - YouTube Big Data, Structured and Unstructured Data - BDSUD HS24 39 In-Context Learning (ICL) – LLMs are capable to identify and learn Named Entities with only few examples – Prompt – List of input-output pairs that demonstrate the task – The model “learns temporarily” from a prompt – Does not extend the model from one conversation to the other – Tagset: a set of symbols or labels that are used to annotate “parts of speech” POS – Example: Treebank tagset, which uses NN for noun, VB for verb, JJ for adjective etc. Big Data, Structured and Unstructured Data - BDSUD HS24 40 In-Context Learning – Exercise 06-IE-ICL – Create an account at deepnote – https://deepnote.com/ – Create an account at huggingface – copy your Access Token to be reused for the API calls to huggingface services – https://huggingface.co/settings/tokens – Deepnote: Create a new Project – Upload the Jupiter Notebook from Moodle to your Deepnote Big Data, Structured and Unstructured Data - BDSUD HS24 41 In-Context Learning – Exercise 06-IE-ICL - Solution Hints Indicate som samples to learn from from: Add test cases: Check Output: Big Data, Structured and Unstructured Data - BDSUD HS24 42 Further In-Context Learning Experiments: ChatGPT Big Data, Structured and Unstructured Data - BDSUD HS24 43 Different Metrics to Evaluate Classification Results – Accuracy: single metric – Fraction of examples it classified correctly – Precision & Recall: – Recall: – F-score: The F1 score is the harmonic mean of the precision and recall – Precision is also known as positive predictive value, and recall is also known as sensitivity in diagnostic binary classification. – https://en.wikipedia.org/wiki/F-score Big Data, Structured and Unstructured Data - BDSUD HS24 44 What is also possible … Big Data, Structured and Unstructured Data - BDSUD HS24 45 Eat Now Data Sources (in Materials Part II) Big Data, Structured and Unstructured Data - BDSUD HS24 46 Getting a first impression! Specific vs. Global Insights Big Data, Structured and Unstructured Data - BDSUD HS24 47 Data Processing & Data Representations – csv  xls – In Excel/File/Convert – Text to colum – Separator: , – Ausreisser behandeln – Von Hand – Als pdf speichern – Mit ChatPdf und ChatGPT einen ersten Eindruck gewinnen! Big Data, Structured and Unstructured Data - BDSUD HS24 48 Notizblatt – Follow Up Ideen zur weiteren Überprüfung – Statistische Vergleiche zwischen verschiedenen Lieferanten (CI) – Umweltverträglichkeit der Lieferung (CI) – Qualität der Lebensmittel? Big Data, Structured and Unstructured Data - BDSUD HS24 49 Summary and Review Questions – What are regular expressions in text strings. Which do exist? – What are prompts? – What is ICL? – What are LLMs and what are the principles behind it? – What are transformers? Can you give an example? Big Data, Structured and Unstructured Data - BDSUD HS24 50 Exercise Solutions Possible Solution import nltk nltk.download('punkt') def tokenize(text): from nltk.tokenize import word_tokenize tokens = word_tokenize(text) return tokens Big Data, Structured and Unstructured Data - BDSUD HS24 52 Concordance – Concordance finds the queried word in a text and displays the context in which this word is used. Big Data, Structured and Unstructured Data - BDSUD HS24 53 Information Retrieval : How to rank? – An example Information Systems Biology Databases ……….database…………………… systems ……….system…………………………………… …. ……………………..database ………………….databases……… ……….system………………………… ……………………………………………………… ……………..database……………… …………….system…………………….. ……………………………………………... ……………………………………… system…………………………………… ……………………………………………………… …………………………………… ………………………….system...………..system………………………………… ……………………………………………... ……………………………………………………… Which is the …………….system….system….system ………….………………………………… ……………………………………………... ……………………………………………………… best document? …………..….system……..system…… ……………………..…………………….. ……………………………………………... ……………………………………………………… Document length: 100 words Why? Collect system……system……….system…… …………….system….system….system ……………………………………………... ……………………………………………………… criteria (that a machine can ………………………………………….… ……………………………………………....system…………system….system…… ……………………………………………………… …………...system ……. system ……………………………………………...system ….. ……………………………………………………… evaluate)! ……………………………………………... ……………………………………………………… Document length: 2000 words ……………………………………………... Document length: 10,000 words Source: ECM-IR: Frieder Witschel Big Data, Structured and Unstructured Data - BDSUD HS24 54 Ranking Principles: Document Weight Vectors & Dot Product – TF: more occurrences of a query term → higher ranking – TF convex: the increase of the tf weight element should be smaller, the – greater tf is (can be realised e.g. by using 𝑡𝑡𝑡𝑡𝑓𝑓𝑓𝑓⁄(𝑡𝑡𝑡𝑡𝑓𝑓𝑓𝑓+1)) query vector * document vector – IDF: favour documents with many occurrences of rare query terms – Length: longer document, same number = Dot Product of query terms → lower ranking – Dot Product: how well each document matches the query terms, with higher scores indicating a better match Source: ECM-IR: Frieder Witschel Big Data, Structured and Unstructured Data - BDSUD HS24 55 Information from the Script Big Data, Structured and Unstructured Data - BDSUD HS24 56 Dot Product Big Data, Structured and Unstructured Data - BDSUD HS24 57 Quizz: Big Data, Structured and Unstructured Data - BDSUD HS24 58 Additional Slides Further References – Speech and Language Processing – Dan Jurafsky - Home Page Big Data, Structured and Unstructured Data - BDSUD HS24 60 Tools an Libraries for NLP Poppler-utils is a collection of command- line utilities for working with PDF documents. It's based on the Poppler library, which is a free and open-source PDF rendering library. Functionality: Poppler-utils provides tools to perform various operations on PDF files, including: – pdfinfo: Extracts metadata and information about a PDF file. – pdftotext: Converts a PDF file to plain text. – pdfimages: Extracts images from a PDF file. – pdftoppm: Converts a PDF file to a series of image files (PPM format). – pdffonts: Lists the fonts used in a PDF file. Big Data, Structured and Unstructured Data - BDSUD HS24 61 Tools an Libraries for NLP Sentence-Transformers is a Python library for generating sentence and text embeddings. It's built on top of transformer models, a type of deep learning model known for its ability to understand and represent text effectively. Functionality: – Generates Sentence Embeddings: Sentence-Transformers takes sentences or paragraphs as input and produce dense vector representations (embeddings) that capture their semantic meaning. – Similarity Comparison: These embeddings can be used to compare the similarity between sentences or paragraphs. Similar sentences will have embeddings that are closer together in vector space. – Semantic Search: Sentence-Transformers can be used to build semantic search engines, where you can search for information based on meaning rather than just keywords. – Clustering: The generated embeddings can be used to cluster similar sentences or documents together. – Paraphrase Detection: Sentence-Transformers can identify sentences that are paraphrases of each other. Big Data, Structured and Unstructured Data - BDSUD HS24 62 Big Data, Structured and Unstructured Data - BDSUD HS24 63 History of LLMs Source: (1) (PDF) A Review on Large Language Models: Architectures, Applications, Taxonomies, Open Issues and Challenges (researchgate.net) Big Data, Structured and Unstructured Data - BDSUD HS24 64 Orange Big Data, Structured and Unstructured Data - BDSUD HS24 65 Solr Installation – Cmd tool – cd solr\bin – solr start – solr – Java – Durl=http://localhost:8983/solr/solrinact ion/update –Dauto –jar example Big Data, Structured and Unstructured Data - BDSUD HS24 66 Big Data, Structured and Unstructured Data - BDSUD HS24 67 Big Data, Structured and Unstructured Data - BDSUD HS24 68 Exercise Word-Tokenization and Indexing Consider the following two short news stories: – D1: "Founder of the new tech company FUIT announces headquarter-to-be: New York" – D2: "FUIT have changed their headquarter from New York to Zurich" For tokenising documents, an information retrieval system relies on the following set of rules: Non-word characters should split the text: – Non‐word characters are all characters that are not in this list of word characters: [‘a’,…,’z’,’A’,…’Z’,’‐‘] – Any string resulting from the split operation and consisting of at least 2 word characters is an index term MC Questions: – Which of the two documents above will be returned when the user enters the query "headquarter"? 1. Both, D1 and D2 2. None 3. D1 4. D2 – Which of the two documents from the last question should have been retrieved in order to satisfy the user's (most likely) information need? Task: Make a small program, that prints the Tokens of D1 and D2 using Colab Big Data, Structured and Unstructured Data - BDSUD HS24 69

Use Quizgecko on...
Browser
Browser