Podcast
Questions and Answers
In the vector space model, what do the coordinates of a document vector represent?
In the vector space model, what do the coordinates of a document vector represent?
What does a higher term frequency (TF) for a specific term in a document indicate?
What does a higher term frequency (TF) for a specific term in a document indicate?
When comparing document vectors to a query vector, what does it mean if two vectors are closer together?
When comparing document vectors to a query vector, what does it mean if two vectors are closer together?
Which term frequency value indicates that the term 'pear' appears in document D3?
Which term frequency value indicates that the term 'pear' appears in document D3?
Signup and view all the answers
Which document vector would represent a document that discusses both 'apple' and 'pear' equally?
Which document vector would represent a document that discusses both 'apple' and 'pear' equally?
Signup and view all the answers
What is the primary characteristic of Boolean retrieval that distinguishes it from best match retrieval?
What is the primary characteristic of Boolean retrieval that distinguishes it from best match retrieval?
Signup and view all the answers
In a vector-space model, what does each document represent?
In a vector-space model, what does each document represent?
Signup and view all the answers
What does the term frequency-inverse document frequency (TF-IDF) measure in document retrieval?
What does the term frequency-inverse document frequency (TF-IDF) measure in document retrieval?
Signup and view all the answers
Which similarity measure is often used to assess the relevance between two documents in vector-space models?
Which similarity measure is often used to assess the relevance between two documents in vector-space models?
Signup and view all the answers
What is a key advantage of best match retrieval compared to Boolean retrieval?
What is a key advantage of best match retrieval compared to Boolean retrieval?
Signup and view all the answers
Which Boolean connector is specifically used to ensure that results contain one or both terms?
Which Boolean connector is specifically used to ensure that results contain one or both terms?
Signup and view all the answers
What problem might arise from using popular terms in Boolean queries?
What problem might arise from using popular terms in Boolean queries?
Signup and view all the answers
Which of the following statements about document matching techniques is true?
Which of the following statements about document matching techniques is true?
Signup and view all the answers
What does cosine similarity primarily evaluate between query and document vectors?
What does cosine similarity primarily evaluate between query and document vectors?
Signup and view all the answers
How does the term frequency-inverse document frequency (tf-idf) affect document representation?
How does the term frequency-inverse document frequency (tf-idf) affect document representation?
Signup and view all the answers
Which of these techniques is used for document similarity measurements?
Which of these techniques is used for document similarity measurements?
Signup and view all the answers
What is the main advantage of using a vector space model in information retrieval?
What is the main advantage of using a vector space model in information retrieval?
Signup and view all the answers
What is a characteristic of stopword removal in the context of document processing?
What is a characteristic of stopword removal in the context of document processing?
Signup and view all the answers
In the context of vector space models, what is the effect of adding more terms?
In the context of vector space models, what is the effect of adding more terms?
Signup and view all the answers
Which characteristic of the cosine coefficient is essential in evaluating document similarity?
Which characteristic of the cosine coefficient is essential in evaluating document similarity?
Signup and view all the answers
Which term describes the overlap of vectors in a similarity measurement context?
Which term describes the overlap of vectors in a similarity measurement context?
Signup and view all the answers
Study Notes
Search Systems
- Search systems aim to locate relevant documents based on user input.
- Users expect searching by text content, or citations/links to other documents.
- Documents matching similar topics or tagged similarly are needed.
- Real-time or fast search is also expected.
Full-Text Search
- Full-text search involves processing data and handling large input/output loads.
- The desired output is fast and efficient retrieval.
- Building indexes rather than repeating searches on the same files is key.
- Querying indexes to retrieve results is crucial.
- Background processes are used for indexing.
- Web robots crawl content to build up indexes for fast access.
- Operating System indexes files in background processes.
Citations and Links
- Citation and link searches identify documents referencing other documents.
- Building up the importance of certain documents is a key feature.
- Assessing the number of citations or their direction establishes document significance.
- A single page may be referenced from many others.
- Document crawling is done via background processes.
Categories
- Users often specify categories while searching.
- Categories can include full-text based on phrases or subject matter.
- Document pictures can be categorized using color shades.
- Crawling of documents to collect image and other data is frequent.
Recommendations
- Recommendations consider similar searches from other users.
- User search history helps find similar documents.
- Correlation information is used to suggest similar documents.
- A dedicated "Recommender Systems" lecture likely covers this in more detail.
Indexing
- Building databases containing indexes is a main function.
- Distributed databases with large amounts of data are used.
- NoSQL solutions like OpenSearch are suitable.
- Web browser user interfaces query NoSQL databases.
Web Page Structure
- Web servers deliver page content to browsers.
- Web pages use HTML, CSS, and JavaScript.
- Browsers convert JavaScript to HTML for rendering.
- Browsers render HTML and CSS for layout display.
- Searching for text occurs after JavaScript evaluation.
- Headless browsers are often used for JavaScript execution and text extraction.
- Parsing libraries like BeautifulSoup make text extraction from HTML documents efficient.
Download HTML Files
- Downloading three sample HTML files is part of the analysis.
- The sample files are simple.
- JavaScript is often ignored.
- Conversion of HTML files to text format is a crucial step.
- The extension .html is changed to .txt for further processing.
Searching Text Files
- Searching for the word "software" in the downloaded text files.
- Commonly, two approaches in programming exist—interpreted or compiled languages.
- Programming languages use tools for software_engineering, and projects.
- Software_engineering addresses software system topics, development, and maintenance.
- Databases are also software designed to maintain data structures based on user queries.
Counting Words
- Counting instances of "software" in the text files using command-line tools.
- The count of "software" rather than the entire file content.
Scaling Up
- Web searches take time due to massive data volume, files.
- Time for searching from the presented example with command-line.
Conclusions
- Text searches are feasible from standard tools.
- Search time is dependent on factors.
- Search time is reliant on factors like file size, logic, file number.
- Indexing is needed for full-text searches, handling large quantities of files.
Tokenization
- Tokens are vital for building indexes for efficient data retrieval.
- Converting input text into tokens split using spaces or punctuations.
- Removing duplicates, punctuations, and converting to lower case are common data manipulations.
- Optional rules for nouns can also be applied, if necessary.
Stop Word Removal
- Common words are often not useful for search.
- Removing stop words improves search efficiency.
- Standard lists are often used to identify and remove stop words.
- Removing stop words increases search efficiency.
Word Frequency: Zipf Distribution
- Word frequency distribution is a common property in many textual datasets.
- Data exhibits properties of a Zipf distribution.
Stemming
- Singular/plural nouns may appear in different forms.
- Verbs may exhibit multiple forms (e.g. send/sending/sent).
- Reducing tokens to the stem word improves efficiency.
Stemming Approaches
- Dictionary stemming requires data and processing time.
- Algorithms are preferred for performance.
- Stemming algorithms use sets of rules.
- Stemming results aren't always totally correct.
Porter Stemming Algorithm
- Developed by Martin Porter in the 1980s.
- Removes suffixes according to rules.
- Converts words to singular form using rules (e.g. caresses -> caress, ponies-> poni, cats -> cat).
Porter Algorithm
- Stemming algorithms, like Porter, can contain mistakes.
- False connections can occur (e.g., "europe" vs "european," or “police” vs "policy").
- Proper nouns are sometimes incorrectly handled.
- The algorithm remains a baseline for non-English stemming.
Converting Queries into Tokens
- Queries are processed the same way as documents.
- Stop word removal extracts significant words, improving performance.
- Stemming converts tokens to their root form.
- Queries are converted into tokens for search.
Other Indexing Choices
- Using phrases (e.g., collection of words, more expressive) is an alternative to tokens.
- Collect phrases from multiple documents.
- Synonyms (e.g., "shut," "close") can be included.
- Care is needed to avoid ambiguous results.
- Weights are assigned to terms to reflect importance.
Term Weighting
- Processing words from documents for weighting is significant.
- Early models used absence/presence as weights to control space used.
- Advanced models use term and collection frequency (TF-IDF)
- TF measures terms within document relevance.
- IDF measures terms in entire collection for document relevance.
Term Frequency
- Term importance within a document's measured by frequency.
- A higher frequency of term suggests a greater weight, higher relevance.
- Multiple equation and measurement types for determining term frequency.
- Raw term frequency (tf(t,d)) counts the instances of a term in a document.
Inverse Document Frequency
- Identifying how common a term within multiple documents is useful.
- Rarer terms are often better discriminators for relevance among documents.
- Inverse document frequency (IDF) is computed after stop words are removed.
- Rarer terms are weighted higher, implying greater collection relevance.
TFIDF
- Combining TF and IDF to create a term's overall value representing importance across document collection and individual documents.
- TFIDF(t,d) combines term frequency and inverse document frequency.
- Weights of words within document d.
- Inverse document frequency of term t.
- Term frequency of term t in document d.
Information Retrieval Models
- Methods for matching queries with documents.
- Different models exist for information retrieval.
- Boolean logic combines search terms via "AND," "OR," "NOT."
- Hidden Markov Models (HMMs) are statistical models.
- Latent Semantic Analysis (LSA) recognizes relationships between documents and terms.
- Vector Space Model uses cosine similarity to compare document vectors.
- Probabilistic Relevance describes the probability that a document is relevant to a query.
- Language Models assess the likelihood of a document generating a query.
Boolean Retrieval
- Boolean retrieval is a common, oldest approach.
- Data description based on term presence.
- It does not implement term weights.
- Retrieval based on query formulas using Boolean logic.
Terms, Formula, and Applications
- Terms are query words.
- Formulas use Boolean logic connectors (e.g. AND, OR, NOT).
- It is similar to SQL.
- Useful in precise searches like legal or medical ones.
Example Boolean Queries
- Basic Boolean search examples (e.g., "cat," "cat OR dog," "cat AND dog," “(cat AND dog) OR budgie,” etc.) are shown.
Boolean Retrieval Usage
- Boolean retrieval narrows down document sets.
- Typically does not employ term weighting.
- Documents are commonly presented as sets.
Boolean Retrieval Problems
- Result set size can be substantial, influenced by the query terms.
- Queries with unique terms often produce small results.
- Using "OR" generates larger results, while "AND" results often smaller.
- Bad combinations can lead to large or empty sets.
- Trial-and-error can improve result set selection.
Boolean Retrieval Problems (Artificial Language)
- This approach is often considered as an artificial language, sometimes considered less user-friendly than other approaches.
- Unintuitive, misunderstood
- User interfaces aid in understanding, building queries.
- Returned documents are potentially unordered outputs
Boolean Retrieval Advantages
- Exact, unambiguous queries.
- Simple to use, easily programmed using Boolean algebra.
- Experts can use precise searches.
- Precise results.
Best Match Retrieval
- Best match retrieval is an alternative.
- Most retrieval models are best match models.
- Ranked output based on query relevance.
- Documents sorted by relevance.
- First listed documents often most relevant.
Advantages over Boolean Retrieval
- No query syntax needed.
- Natural language use.
- Easier to modify the query.
- Documents can match exactly or partially.
- More likely to find relevant documents.
- Users may still have difficulty generating effective queries.
Vector-Space Model
- Treats terms in a document as vectors.
- One vector per document.
- One dimension per term.
- Originated with G. Salton in the 1960's.
- An important information retrieval technique.
Building the Vectors
- Illustrated with four documents and term frequencies for "apple" and "pear."
Building the Vectors
- Vectors position documents in a multi-dimensional space.
- The number of dimensions is the number of terms.
- Geometric approach to identifying how similar documents are to eachother
Building the Vectors (Query Expression)
- A query is expressed as a vector of term frequencies.
- Documents close to the query vector are assumed to be more similar.
Effect of Stop Word Removal
- Removing stop words impacts vector position within a space.
Effect of Stemming
- Reducing words to their stem reduces the dimensions of the vector space.
- Stemming improves efficiency.
Frequency Measures
- IDF and tf-idf affect vector position and proximity within the vector space.
- More separation in vector space occurs when appropriate.
Adding More Terms
- Each term maps to an orthogonal axis in the vector space.
- Positioning relies on TF, IDF, or TF-IDF values.
Similarity Measures
- Simple coordinate-level matching is used to compare vectors.
- Documents with shared terms will have overlapping vectors.
- Cosine coefficients.
- Evaluating angles between query and document vectors.
Simple Matching (Dot Product)
- Documents with higher dot products with the query are assumed to be relevant.
- Dot product computed from each document's term vectors, and the query vector.
- This simple method often produces long documents as best matches because they contain more terms and higher probabilities.
Cosine Matching (Cosine Similarity)
- Length of the document vector is accounted for in cosine matching.
- Cosine(0) = 1 and Cosine(180) = -1.
- Documents closer to query imply greater similarity.
- Cosine matching, by accounting for length, avoids biasing results towards long documents, giving better results than simple matching for queries.
Simple and Cosine Compared
- This comparison showcases how simple matching and cosine similarity methods compute different values for different documents.
- These methods often yield different results due to distinct approaches used.
Advantages
- Easy to understand.
- Geometric interpretation.
- Works with other similarity measures.
- Often better performance than Boolean retrieval.
Disadvantages
- Lack of theoretical foundation.
- Limited predictive power unless accompanied by experiments.
- Limited control over selection or results.
Web Searches
- Similar process to standard information retrieval.
- Obtain data, index data, obtain query.
- Match documents with query, present to user.
Data Collection
- Web data collection differs from standard information retrieval, as it requires methods to gather data.
- There is no central control of the data.
- Data is distributed, structured (or semi-structured). Often large volumes of varied data.
- Content can be dynamically generated from web services or queries against other databases.
Data Collection (Spider/Crawler)
- Spiders (crawlers/robots) gather data recursively.
- Known URLs are used to fetch, index, and discover new URLs.
- Additional data comes via direct submissions.
- Recurrence and priority are applied for data freshness.
Adversarial Information Retrieval
- Content providers may manipulate search engines..
- Promoting their own pages is common by manipulating results, sometimes evading advertisement fees.
- Purposeful hiding of real content.
- Causing the user to look at unwanted pages.
- Some cases include legitimate efforts.
- Spamming is a common tactic.
Static Data
- Static data from known data sets, with known quality and format.
- Language, data quality, and structure are known.
- Format and language are predetermined..
Web Data
- Web data differs from static data, being dynamic, potentially unreliable, in various formats (e.g., HTML, image files, PDFs).
- Languages are varied and potentially unknown.
- Content can be outdated or manipulated, thus reliability is not assumed.
Web Search Engine Requirements
- Search engines handle indexing, capitalization, stemming, phrases, and encoding schemes.
- Data variety makes it complicated.
- Dynamic updating (e.g., deleting indexed results) is needed.
- Cached pages exist beyond when the actual page was deleted, potentially causing confusion.
Query and Information Provided
- Typical user queries shorter, only including a few keywords.
- Early search engines were often more sophisticated.
- Modern ones often hide advanced features and attempt to autocorrect.
- Users may not have proper training or experience for sophisticated search requests.
- Real-time user response is desired.
Matching Web Pages
- Varying web page quality: trustworthiness and relevance needed.
- Pages should be trustworthy, current and maintained well.
- Duplicates should be avoided.
- Pages/documents should be readable/accessible.
- Security protocols applied (e.g., SSL, DNS).
Web Page Ranking
- Scoring page quality based on various factors.
- Relevance to a query is considered.
- Non-query factors like popularity, age, reputation can impact ranking.
- Scores are combined with crawling to evaluate page relevance.
PageRank of Web Pages
- Page Rank, a method to assess page quality based on incoming links, is a measure of a page's importance.
- Used in website ranking on various search platforms.
Graph of Web Pages
- Webpages reference other webpages.
- Incoming links imply relevance.
- Isolated clusters exist (pages with no links)
PageRank - Random Walk
- Assuming random user searches.
- Following out links randomly with equal probability.
- Probability of visiting each page is calculated.
PageRank - Damping Factor
- Visiting isolated graphs are considered.
- Probabilistic link following (e.g. 85%).
- Random jump component is incorporated.
PageRank - Damping Factor (random page skipping)
- 15% probability to randomly switch pages.
- Assumes not all links are followed, and random page hopping is common.
- Probability that a link is followed is assumed equally.
Dead Ends
- Pages lacking outgoing links (dead ends)
- Random searches could "get stuck" at dead ends.
- Random teleporting (jumping to random pages) to prevent getting stuck.
PageRank
- PageRank equation and underlying mathematical relationship to assess page qualities according to its links.
- N: total number of pages; d: damping factor (typically 85%); M(p): set of pages linking to p; L(pj): total number of links from p₁.
- Iterative calculation of PageRank values.
PageRank - Page Quality
- PageRank depends on the PageRank score of pages linking to a specific page.
- Pages linked to by high quality pages suggests a high pageRank score.
- Pages are assumed equal in weighting.
Aggregated Search
- Combines search results across multiple sources.
- Categorization often referred to as verticals (News, Videos, Images, etc).
- Blended results are presented.
- Traditional search engine performance measures are problematic in this context.
Search Tools
- Various tools, libraries support searches.
Lucene
- An open-source, free information retrieval library.
- Supported by the Apache Software Foundation.
- Originally written in Java, but available for many other programming languages.
- Contains a wide array of techniques for information retrieval.
Solr
- Implementation of Lucene, a Java-based search engine
- Supported by Apache Software Foundation libraries.
- Provides full-text, faceted, real-time indexing tools.
- NoSQL capable, handles varied file formats (e.g. Word, PDF files).
- Scalable and flexible tool for enterprise searches.
Elasticsearch
- Implementation of the Lucene library for creating scalable and flexible systems.
- HTTP web service interface for full-text searches.
- JSON schemas for flexible queries.
- Open-source Apache license.
- Often used with other frameworks for data collection and processing.
- Supporting tools often used for large-scale searches, like Logstash and Kibana.
Other Tool Kits
- Whoosh: python-based tool for full-text indexing and searching.
- Indri: C++, Java, or PHP based tools are employed for information retrieval.
- Terrier: Java toolkit for information retrieval developed at the University of Glasgow.
- Bing: Web search engine sometimes providing programming interfaces.
Case Study: Company Connecting
- Case study on implementing a search system.
- Stored company details in Microsoft Access, converted to MySQL.
- Django and Elasticsearch implemented.
- Amazon Web Services (AWS) supported functionality.
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.
Related Documents
Description
Test your knowledge on document retrieval concepts within the vector space model. This quiz covers important topics such as term frequency, document vectors, and the difference between Boolean and best match retrieval systems. Understand how these concepts apply to information retrieval and similarity measures.