L07 Semantic Retrieval Past Paper PDF
Document Details
Uploaded by Deleted User
University of Applied Sciences and Arts Northwestern Switzerland
2024
Dr. Stephan Jüngling
Tags
Summary
This document is a past paper from the University of Applied Sciences and Arts Northwestern Switzerland, HS2024, for the course 'Big, Semi and Unstructured Data, L07'. It contains content and learning objectives, exercises, quizzes, typical NLP examples, and ranking principles.
Full Transcript
Big, Semi and Unstructured Data L07 – Semantic Retrieval, Classification and Ranking Dr. Stephan Jüngling - HS 2024 Content and Learning Objectives Learning Objectives Reference Chapter in Book: – Understand how to d...
Big, Semi and Unstructured Data L07 – Semantic Retrieval, Classification and Ranking Dr. Stephan Jüngling - HS 2024 Content and Learning Objectives Learning Objectives Reference Chapter in Book: – Understand how to determine features of documents for ranking – Understand principles behind searching and ranking – Be able to understand Metrics for the evaluation of search results bad and good results – Understand how to separate relevant from irrelevant documents Content – Term & Document Frequency (simple TF, convex TF, IDF, BM25) – Metrics in IR: Precision and Recall and Confusion Matrix – Document Weight Vectors, and Vector Space Models – Reference Chapters in the Book: – https://github.com/ekochmar/Getting-Started-with-NLP/blob/master/Chapter3.ipynb – Data cisi.zip Chapter 2 & 3 Getting-Started-with-NLP/Chapter3.ipynb at master · ekochmar/Getting-Started-with-NLP · GitHub Exercises – Quiz VSM – Quiz Ranking criteria – Hands-On Example: Spam Filter Big Data, Semi and Unstructured Data - BDSUD HS24 2 Typical NLP Examples Multiclass classification Searching documents Binary classification Big Data, Semi and Unstructured Data - BDSUD HS24 3 From «Boolean Search» to more complex Semantics IMPORTANT Simple Term Frequency – Too many results … – What additional Ranking criteria could be useful? – What about using additional Meta-data? Priors: – Priors can be used to incorporate domain knowledge into the retrieval process. – In ranking algorithms, priors can help adjust the relevance scores based on factors like document age, author expertise, or publication source. – Examples: favor more recent documents or documents from authoritative sources. Fielded Rankig: – Additional criteria from Meta-Data fields (e.g. from the structure of the entire documentation, the storage location, information from different websites, number of referral sites) Big Data, Semi and Unstructured Data - BDSUD HS24 4 Example: Text Classifications (e.g. Spam Filtering) Text classification steps – Data Collection: Gather text data relevant to the task. – Preprocessing: Clean and prepare the text (e.g., tokenization, removing stop words). – Feature Extraction: Convert text into numerical features (e.g., TF-IDF, word embeddings). Use data as well as metadata! – Model Training: Train a classifier using labeled data. – Evaluation: Assess the model’s performance using metrics like accuracy, precision, recall, and F1-score. Big Data, Semi and Unstructured Data - BDSUD HS24 5 Feature Extraction Process Spam detection: – you estimate P(spam | email content) and P(ham | email content), or generally P(outcome | (given) condition) Big Data, Semi and Unstructured Data - BDSUD HS24 6 Hands-ON Example All Code of the Book is available on GitHub (Chapter 2) IMPORTANT – Dataset size = 5172 emails – Apply Naive Bayes classifier (Supervised Learning) – separate in training data and test data – Do manual classification of training data – Iterations for improving the models Big Data, Semi and Unstructured Data - BDSUD HS24 7 First NLP Examples – Getting-Started-with-NLP/Chapter2.ipynb at master · ekochmar/Getting-Started-with-NLP · GitHub – Copy it to your local Juypiter Notebook environment, or to Colab – All book examples are valid templates for the group work task – adapt it to your cases! Big Data, Semi and Unstructured Data - BDSUD HS24 8 Hands-ON Example How good is your classifier? What kind of metrics would you apply? – What is the accuracy of your classifier on this small dataset? – Is this a good accuracy? That is, does it suggest that the classifier performs well? What if you know that the ratio of ham to spam emails in your set of emails is 50:50? What if it is 60% ham emails and 40% spam—does it change your assessment of how well the classifier performs? – Does it perform better in identifying ham emails or spam emails? Big Data, Semi and Unstructured Data - BDSUD HS24 9 Different Metrics to Evaluate Classification Results – Accuracy: single metric – Fraction of examples it classified correctly – Precision & Recall: – Recall: – F-score: The F1 score is the harmonic mean of the precision and recall – Precision is also known as positive predictive value, and recall is also known as sensitivity in diagnostic binary classification. IMPORTANT – https://en.wikipedia.org/wiki/F-score CHEAT SHEET Big Data, Semi and Unstructured Data - BDSUD HS24 10 Ranking Criteria – Swiss Bikes QUIZ Priors for ranking criteria – LTR (learn to rank) – Pointwise – Regression model – Pairwise – triplet (q,di, dj) of the query and two results as training instance – Observed preferences from query logs and clickthrough data: – preferences d3 > d1 and d3 > d2 – Train classifier – Pre-retrieval of small set for LTR soure: File:MLR-search-engine-example.png - Wikimedia Commons Big Data, Semi and Unstructured Data - BDSUD HS24 11 QUIZ Big Data, Semi and Unstructured Data - BDSUD HS24 12 Information Retrieval : How to rank? – An example Information Systems Biology Databases ……….database…………………… systems ……….system…………………………………… …. ……………………..database ………………….databases……… ……….system………………………… ……………………………………………………… ……………..database……………… …………….system…………………….. ……………………………………………... ……………………………………… system…………………………………… ……………………………………………………… …………………………………… ………………………….system...………..system………………………………… ……………………………………………... ……………………………………………………… Which is the …………….system….system….system ………….………………………………… ……………………………………………... ……………………………………………………… best document? …………..….system……..system…… ……………………..…………………….. ……………………………………………... ……………………………………………………… Document length: 100 words Why? Collect system……system……….system…… …………….system….system….system ……………………………………………... ……………………………………………………… criteria (that a machine can ………………………………………….… ……………………………………………....system…………system….system…… ……………………………………………………… …………...system ……. system ……………………………………………...system ….. ……………………………………………………… evaluate)! ……………………………………………... ……………………………………………………… Document length: 2000 words ……………………………………………... Document length: 10,000 words Source: Frieder Witschel Big Data, Semi and Unstructured Data - BDSUD HS24 15 Text Similarity Measures IMPORTANT – Techniques used to quantify the similarity Term Frequency-Inverse Document Frequency between two pieces of text (TF-IDF) – Applications: Information retrieval, document – Combines TF with Inverse Document clustering, plagiarism detection, etc. Frequency (IDF) to weigh terms by their importance Term Frequency (TF) – measures the relative concentration of a term – Measures how frequently a term appears in a in a given piece of text document – Simple and intuitive, but doesn’t account for the importance of terms across the entire corpus – Inverse Document Frequency: (IDF) – Measures, how special the term is Big Data, Semi and Unstructured Data - BDSUD HS24 16 BM25 (Best Matching 25) – Probabillistic IR NOT relevant – advanced ranking function used in information retrieval, based on probabilistic retrieval framework – More sophisticated than TF-IDF, accounts for term frequency saturation and ( q ): Query terms; ( d ): Document document length normalization ( k_1 ): Term frequency saturation parameter Hint for group-work: (usually 1.2 to 2.0) You might explore the different methods ( b ): Length normalization parameter (usually 0.75) in your group work (e.g. showing different results in ranking based on ( |d| ): Length of document; ( \text{avgdl} ): Average document length different algorithms) BM25 The Next Generation of Lucene Relevance - OpenSource Connections Big Data, Semi and Unstructured Data - BDSUD HS24 17 Inverted Index – Is an index data structure storing a map of content such as words or numbers for a particular set of documents. IMPORTANT – It contains 2 components – Term dictionary: – Postings: – The query keywords are typically much shorter – Query is also a vector space model – Tf-idf is used to generate the ranking Big Data, Semi and Unstructured Data - BDSUD HS24 18 RSV (Retrieval Status Value) & Term Document Matrix Relevance ranking models IMPORTANT – BIM (binary independence model) CHEAT SHEET – VSM – vector space models such as BM25 model: – avdl (average document length) – tf (term frequency) Big Data, Semi and Unstructured Data - BDSUD HS24 19 Core Idea: Inverted List – RSV (retrieval status value) = sum of term frequencies – 3 documents Big Data, Semi and Unstructured Data - BDSUD HS24 20 Ranking Principles: Document Weight Vectors & Dot Product – TF: more occurrences of a query term → higher ranking – CTF (convex TF): the increase of the tf weight element should be smaller, the greater tf is, e.g. using 𝑡𝑡𝑓𝑓⁄ (𝑡𝑡𝑓𝑓+1) – IDF: favour documents with many query vector * document vector occurrences of rare query terms – Length: longer document, same number of query terms → lower = Dot Product ranking – Dot Product: how well each document matches the query terms, with higher scores indicating a better match Source: ECM-IR: Frieder Witschel Big Data, Semi and Unstructured Data - BDSUD HS24 21 Example: effect of plain Term Frequencies (TF) Effect: finds documents with many occurences of any query term (here: mostly «information») Big Data, Semi and Unstructured Data - BDSUD HS24 22 Example: effect of convex TF Effect: better balanced results («more purple») Big Data, Semi and Unstructured Data - BDSUD HS24 23 Example: effect of convex TF + IDF Effect: bias towards documents with more occurrences of informative terms (even more purple) Hint: N is >4, values of IDF are assumptions Big Data, Semi and Unstructured Data - BDSUD HS24 24 Ranking Principles: Document Weight Vectors & Dot Product IMPORTANT CHEAT SHEET – tf: number of occurrences of term t1 divided by total number of terms – tf convex: the increase of the tf weight element should be smaller, the greater tf is, e.g. by using 𝑡𝑡𝑓𝑓⁄ (𝑡𝑡𝑓𝑓+1)) – IDF: favour documents with many occurrences of rare query terms query vector * document vector IDF = log10 (1+ N/DF) – Length: longer document, same number of query terms → lower = Dot Product ranking – Dot Product: how well each document matches the query terms, with higher scores indicating a better match Source: ECM-IR: Frieder Witschel Big Data, Semi and Unstructured Data - BDSUD HS24 25 Example: how results differ Big Data, Semi and Unstructured Data - BDSUD HS24 26 QUIZ – Vector Space Model (VSM) VSM Vector Space Model – represents documents and queries as vectors in a multi-dimensional space. where (N) is the total number of – Each dimension corresponds to a unique documents and (𝑛𝑛𝑡𝑡 ) is the number term from the document collection of documents containing the term. – Term Frequency (TF): This measures how frequently a term appears in a document. The more frequently a term appears, the higher its TF value. – Inverse Document Frequency (IDF): This measures how important a term is within the entire document collection. Terms that appear in many documents have lower IDF values, while terms that appear in fewer documents have higher IDF values. – TF-IDF Weighting: This combines TF and IDF to give a weight to each term in a document, balancing the term’s frequency in the document with its rarity across the collection. Big Data, Semi and Unstructured Data - BDSUD HS24 27 VSM – Vector Space Model QUIZ Apply (K3): – 3 documents – Calculate TF, IDF – Calculate Dot Products – Build a Term-Document Matrix – Additional Hints: WILL BE IN THE EXAM Type text here Big Data, Semi and Unstructured Data - BDSUD HS24 28 Solution(s) - Working with Jupyter Notebooks QUIZ – You can use a limited set of formatting elements in.ipynb files, which are useful for commenting your solutions! Big Data, Semi and Unstructured Data - BDSUD HS24 31 Jupyter Notebook Formatting Elements – Structure your Markdown and Code elements Big Data, Semi and Unstructured Data - BDSUD HS24 32 Useful Packages and Data structures for NLP – Panda DataFrames (The Excel View) – Conversion to Arrays – df_tfidf = pd.DataFrame(.toarray(), columns=feature_names) – Handling Matrix dimensions – Array.transpose() – Tokenization, Preprocessing – !pip install numpy pandas scikit-learn nltk – from nltk.corpus import stopwordsfrom nltk.stem.snowball import SnowballStemmer – TF-IDF – tfidf_matrix = vectorizer.fit_transform(processed_docs) – # Get feature names (words) feature_names = vectorizer.get_feature_names_out() Big Data, Semi and Unstructured Data - BDSUD HS24 33 Convert Data Frame to Matrix: – Prompt and initial version is generated – Fix the issue, with only having two columns (the Panda DataFrame is different than assumed; the index is not counted!) Big Data, Semi and Unstructured Data - BDSUD HS24 34 Now calculate the Vector Product between Matrix A and q Big Data, Semi and Unstructured Data - BDSUD HS24 35 Hints for Working with Google Colab – You need a Google Account – You can mount Google drive – You can navigate within the drive Big Data, Semi and Unstructured Data - BDSUD HS24 36 Using SciKit Learn for TF-IDF calculation – the list of feature names (i.e., the terms or words) that the TfidfVectorizer has identified and used to create the TF-IDF matrix – feature_names = vectorizer.get_feature_names_out() – Vectorizer Initialization: When you initialize and fit the TfidfVectorizer on your documents, it tokenizes the text, removes stop words (if specified), and stemming. – Vocabulary Creation: During this process, the vectorizer builds a vocabulary of all unique terms (words) found in the documents. Each term is assigned a unique index Big Data, Semi and Unstructured Data - BDSUD HS24 37 Prompt Engineering: – Decompose the task into parts – Dot Product of Arrays with Shape(n,3) and Shape(n,1) give and Error message: Big Data, Semi and Unstructured Data - BDSUD HS24 38 Coaching Time! - Contiue on the Group Work Task – Show your ideas! – Work in groups & split the tasks! – Problem based learning – Use ChatGPT, Gemini etc. and validate the hints iteratively. – Skim through the NLP Book to get implementation ideas! – Look at SciKit Learn or other Frameworks Big Data, Semi and Unstructured Data - BDSUD HS24 39