Podcast
Questions and Answers
What does the vector for document D2 represent?
What does the vector for document D2 represent?
What key concept does the distance between vectors relate to?
What key concept does the distance between vectors relate to?
Which document has the lowest term frequency for both terms?
Which document has the lowest term frequency for both terms?
How many dimensions are used in the vector representation of the documents?
How many dimensions are used in the vector representation of the documents?
Signup and view all the answers
Which document is best matched for a query close to the vector of D1?
Which document is best matched for a query close to the vector of D1?
Signup and view all the answers
What is the first step in the process of building an index from text files?
What is the first step in the process of building an index from text files?
Signup and view all the answers
Why are common words considered unhelpful for search functionality?
Why are common words considered unhelpful for search functionality?
Signup and view all the answers
What is the purpose of tokenisation in text processing?
What is the purpose of tokenisation in text processing?
Signup and view all the answers
Which of the following is NOT a part of the tokenisation process?
Which of the following is NOT a part of the tokenisation process?
Signup and view all the answers
What is referred to as 'stop words'?
What is referred to as 'stop words'?
Signup and view all the answers
How can search time be affected in the context of indexing?
How can search time be affected in the context of indexing?
Signup and view all the answers
Which step directly follows splitting text into words in the indexing process?
Which step directly follows splitting text into words in the indexing process?
Signup and view all the answers
In the content provided, how many times did the word 'software' occur in the programming_language.txt file?
In the content provided, how many times did the word 'software' occur in the programming_language.txt file?
Signup and view all the answers
Which Boolean connector used in queries will result in a smaller set of documents?
Which Boolean connector used in queries will result in a smaller set of documents?
Signup and view all the answers
What is a key advantage of Boolean retrieval compared to best match retrieval?
What is a key advantage of Boolean retrieval compared to best match retrieval?
Signup and view all the answers
In Boolean searching, what would be the outcome of a query structured as 'cat OR dog'?
In Boolean searching, what would be the outcome of a query structured as 'cat OR dog'?
Signup and view all the answers
What happens when using a combination of 'AND' and 'OR' in a Boolean query?
What happens when using a combination of 'AND' and 'OR' in a Boolean query?
Signup and view all the answers
What is one disadvantage of Boolean retrieval models?
What is one disadvantage of Boolean retrieval models?
Signup and view all the answers
Why might users struggle with Boolean queries?
Why might users struggle with Boolean queries?
Signup and view all the answers
What model represents documents as vectors in multi-dimensional space?
What model represents documents as vectors in multi-dimensional space?
Signup and view all the answers
Which query is most likely to yield a small result set?
Which query is most likely to yield a small result set?
Signup and view all the answers
What is the purpose of indexing in search systems?
What is the purpose of indexing in search systems?
Signup and view all the answers
What role do web crawlers play in building search indices?
What role do web crawlers play in building search indices?
Signup and view all the answers
How does a search system utilize citations in documents?
How does a search system utilize citations in documents?
Signup and view all the answers
What technique can be used to display the content of a web page after JavaScript evaluation?
What technique can be used to display the content of a web page after JavaScript evaluation?
Signup and view all the answers
What does full-text search entail in a search system?
What does full-text search entail in a search system?
Signup and view all the answers
What is a common output from a background process in a search system?
What is a common output from a background process in a search system?
Signup and view all the answers
What would be a likely result of utilizing a NoSQL solution like OpenSearch?
What would be a likely result of utilizing a NoSQL solution like OpenSearch?
Signup and view all the answers
What effect does stopword removal have on document representation?
What effect does stopword removal have on document representation?
Signup and view all the answers
In the context of search systems, what is the main purpose of recommendations?
In the context of search systems, what is the main purpose of recommendations?
Signup and view all the answers
Which of the following statements about stemming is true?
Which of the following statements about stemming is true?
Signup and view all the answers
What is the purpose of using inverse document frequency (idf)?
What is the purpose of using inverse document frequency (idf)?
Signup and view all the answers
Which similarity measure is commonly used for comparing document vectors?
Which similarity measure is commonly used for comparing document vectors?
Signup and view all the answers
How does adding more terms to the analysis affect the document representation?
How does adding more terms to the analysis affect the document representation?
Signup and view all the answers
What does the dot product indicate in vector similarity measures?
What does the dot product indicate in vector similarity measures?
Signup and view all the answers
What is the primary benefit of using tf.idf in document representation?
What is the primary benefit of using tf.idf in document representation?
Signup and view all the answers
What is the primary goal of document representation techniques like stemming and stopword removal?
What is the primary goal of document representation techniques like stemming and stopword removal?
Signup and view all the answers
Which document has a higher dot product with the query vector?
Which document has a higher dot product with the query vector?
Signup and view all the answers
What does a higher cosine similarity value indicate?
What does a higher cosine similarity value indicate?
Signup and view all the answers
What is a disadvantage of using simple matching algorithms?
What is a disadvantage of using simple matching algorithms?
Signup and view all the answers
What is the purpose of using cosine similarity in information retrieval?
What is the purpose of using cosine similarity in information retrieval?
Signup and view all the answers
Which document showed better matching performance using cosine similarity?
Which document showed better matching performance using cosine similarity?
Signup and view all the answers
What factor is not considered in simple matching algorithms?
What factor is not considered in simple matching algorithms?
Signup and view all the answers
Web search processes typically include all of the following steps except:
Web search processes typically include all of the following steps except:
Signup and view all the answers
In the context provided, which statement is true regarding algorithm performance?
In the context provided, which statement is true regarding algorithm performance?
Signup and view all the answers
Study Notes
Search Systems
- Search systems expect users to search documents using text, citations, or links to other documents
- Users may search for similar documents based on similar topics or other users' decisions
- Search systems should allow for real-time or fast search capabilities
Full-text search
- Full-text search involves costly processing and data input/output
- Building indices is more efficient than searching files repeatedly
- Indices are queried to return search results
- Background processes are used to create indices, including web crawlers and operating systems
- Web crawlers crawl content to build indices
- Operating systems index files using background processes
Citations and Links
- Search systems can look for documents containing citations and links
- Counting citations builds up the importance of documents
- The direction of citations, and the number of citations, indicate the importance of a page
- Web pages can be referenced by many other pages
- Documents are frequently crawled by background processes
Categories
- Users may specify categories for search queries
- Full-text search categories include phrases or subject areas
- Pictures in documents could be searched using specific color shades
- Search systems crawl documents, images, and other data
Recommendations
- Search systems can recommend similar documents based on other users' searches
- Systems use correlation information from users' searches
- Further information about recommender systems is available from specific lectures
Indexing
- Search systems build databases containing indices
- NoSQL solutions, like OpenSearch, are often used
- NoSQL databases are accessed from web browser user interfaces
Web Page Structure
- Web servers return page content to browsers
- Web pages consist of HTML, CSS, and JavaScript
- Browsers convert JavaScript into HTML and render HTML and CSS
- Search often requires evaluating JavaScript to access text content
- Parsers, like BeautifulSoup, extract text from HTML pages
Downloading HTML Files
- Three example HTML files are often downloaded for study purposes
- These files represent simple web pages
- JavaScript is ignored in the example cases
- HTML files are converted to text files
Searching Text Files
- The term "software" is often included as a search query
- Two separate approaches to searching through programming languages are available
Counting Words
- The word "software" is counted in text files
- Search systems need to identify the word "software" rather than only text fragments
Scaling Up
- Searching times vary
- Smaller files are processed faster than very large files
- Big datasets, like billions of web pages, require indexing
Conclusions
- Search systems can be built from standard tools
- Search times depend on file size and the number of files to index
- Often, indexing is necessary to conduct full-text searches on the majority of files
Selecting Tokens
- Tokens are necessary for building the index
- Input text is split into tokens using white space delimiters
- Punctuation is frequently removed
- Tokens are converted to lowercase for consistency
- Special rules for nouns may apply
- Removing nouns may be undesirable in some instances.
Splitting Text into Tokens
- Common words are irrelevant for search.
- These words are also referred to as stop words
- Stop lists are used to remove these words
Word Frequency: Zipf Distribution
A graph showing the relationship between rank and frequency of words in a document.
Stop Word Removal
- Stop word removal reduces irrelevant words from a search query
- Standard stop words lists are often used
- More specific stop lists may improve search performance.
Stemming
- Nouns can be singular or plural
- Verbs can be in different forms, requiring stemming (e.g., send, sending, sent)
- Stemming reduces tokens to their stem word
Stemming Approaches
- Dictionary stemming is more accurate but requires more processing time
- Algorithm stemming is often preferred due to its better performance
- Various stemming algorithms exist, based on rules
- Algorithm stemming results may not always be correct
Porter Stemming Algorithm
- Martin Porter developed the Porter stemming algorithm in 1980
- This algorithm removes suffixes from words
- Words are converted to their singular form using rules.
Porter Algorithm (Problems)
- The Porter algorithm can make mistakes (e.g., "europe" vs "european")
- It does not correctly handle proper nouns, like names
- It is a popular algorithm (and its basis) used in other non-English stemmers
Converting Queries into Tokens
- Search queries are treated the same way as document text
- Stop words are removed from queries
Other Indexing Choices
- Search systems can use phrases instead of tokens
- Lists of known phrases are useful
- Identifying synonyms, like "shut" and "close", is useful
- Term weighting can be based on importance
Term Weighting
- Search systems process terms from words in documents
- Early information retrieval systems use presence or absence of terms to indicate relevance
- Early weighting often saved space and was suitable for short documents or titles
- More advanced systems estimate term importance in the document as well as overall importance in the entire dataset
Term Frequency
- Term frequency (tf) is a measure of how important a word is within a single document
- Higher term frequencies often imply higher importance
- There are several ways to measure term frequencies in a document. Raw frequency is the simplest calculation
- Results affected because the document with the highest term frequency is shown first in a search query
Inverse Document Frequency
- Inverse document frequency (idf) measures how prevalent the words are within the dataset overall
- Rare terms are significantly more important than frequent terms in a corpus
TFIDF
- TF-IDF combines term frequency and inverse document frequency to represent the importance of a term within a document based on its overall frequency in the entire dataset
Information Retrieval Models
- Information retrieval models are methods for determining the best match between a query and a document
Boolean Retrieval
- The oldest information retrieval (IR) model
- It focuses solely on the presence or absence of terms to ascertain retrieval
- It doesn't use terms' weights
Boolean Retrieval (Terms, Formula, and Applications)
- Query terms utilize Boolean connectors (AND, OR, NOT).
- Boolean retrieval is similar to SQL
- Precise searches, such as in legal fields, frequently use the Boolean retrieval approach
Example Boolean Queries
- Some examples of queries using Boolean operators (AND, OR, NOT).
Boolean Retrieval Usage
- Boolean retrieval narrows down results
- It does not weight terms
Boolean Retrieval Problems
- Large result sets (especially using OR)
- Small result sets, when using AND
- Unintuitive language
Boolean Retrieval Advantages
- Exact queries
- Easy to program
- Precise results
Best Match Retrieval
- This is an alternative to Boolean retrieval.
- Most retrieval models use the best match approach
- Documents are ranked according to how well they match the query
- First result is generally the most relevant result in this approach
Advantages over Boolean Retrieval
- Queries are in natural language, reducing user effort
- Easier to modify queries
Vector-Space Model
- Terms are expressed as vectors in a multi-dimensional space.
- One axis or dimension is associated with each term
- A separate document vector is created for every document
- Proposed in 1960 and remains a popular model
Building the Vectors
- Building vectors involves assigning numerical values to represent document terms
- Values are chosen to simplify illustrations
- Vectors define positions in multi-dimensional space
Building the Vectors (Continued)
- Document vectors imply document similarity
- The closest vector to the query vector has the best match
- Closest document vectors represent the best retrieval candidate
Effect of Stopword Removal
- Stopword removal alters document vector positions
- Relevance is often measured in multi-dimensional space without using stopwords
Effect of Stemming
- Stemming affects document vector dimensionality
- Documents sharing a stem word are closer in the vector space
Frequency Measures
- IDF and tf.idf measurement techniques often create better separation in the vector space.
Adding More Terms
- Each added term results in an orthogonal axis
- TFIDF values determine position on an axis
Similarity Measures
- Methods for comparing vectors exist, including coordinate-level matching and calculating the angle between query and document vectors using the cosine coefficient
Simple Matching (Dot Product)
- The dot product between query and document vectors is calculated
- The highest dot product value corresponds to the closest document
- Simple matching returns documents containing common terms.
Cosine Matching
- Cosine similarity considers document length and shared terms
- Closer angle implies higher similarity
Simple and Cosine Compared
- Dot product measures (simple matching) and angles in multi-dimensional space (cosine) are alternatives
- Matching algorithms return slightly different results (e.g., different orderings that may yield different matches)
Advantages
- Vector space models are generally easy to understand and interpret, including their geometric view
- They work well with various similarity measures
- They often lead to better performance than Boolean models
Disadvantages
- Vector models might not have a strong theoretical foundation in some instances
- They do not inherently provide predictive power, requiring empirical evaluations
- They may lack control over document selection
Web Searches
- Similar principles apply to standard information retrieval
- Data acquisition, indexing, query matching, and presentation are steps
- The Web differs in that data is decentralized and changes dynamically
Data Collection
-
Data collection on the web has to be frequently collected and updated for a search engine
-
Data is decentralized and distributed
-
Large scale Web data often is collected dynamically and is semi-structured
-
Web data handling frequently includes various file types (PDF, HTML, word), encodings, and languages
Data Collection (Deployment)
- Web crawlers and robots collect data from web pages
- These programs (often referred to as 'spiders') crawl through web pages recursively
- They collect pages and follow links (crawling)
Adversarial Information Retrieval
- Web search engines face manipulation attempts from individuals or companies
- Content suppliers frequently try to manipulate search results
- Manipulating search results include promotion of a company's own pages without paying for advertising
- Manipulators might also intentionally hide the real content of a website, or redirect searchers to different pages
Static Data
- Static datasets are typically well-defined and reliable
Web Data
- Data on web pages can be old or have false or manipulated information
Web Search Engine Requirements
- Web search engines have policies, as well as many considerations, regarding how sites should be indexed
- Policies handle encoding schemes
- Managing dynamic pages and keeping cached pages up-to-date are extra considerations
Query and Information Provided
- Users might submit poor quality queries which might be one or two terms.
- Search engines must often clean up or modify the query itself to match a format a typical database can use
Matching Web Pages
- Web pages vary in quality
- Search engines prioritize trustworthy pages
- Web page readability must be considered
- Pop-ups and other annoyances should be avoided
- Security concerns, including HTTPS verification, need addressing
Web Page Ranking
- Scoring and ranking web pages depends on search queries
- Search engines combine multiple search scores (e.g., matching the query, popularity, age, etc.)
- Popularity is a key factor affecting web retrieval
- PageRank measurement of web page popularity
PageRank of Web Pages
- PageRank algorithm assesses the importance of web pages
Graph of Web Pages
- The web can be modeled as a graph
PageRank - Random Walk
- Web page importance assesses random user page browsing
- Probability of visiting a page is calculated
- Important pages have higher probability scores
PageRank - Damping Factor
- Adding a damping factor accounts for isolated graphs to improve PageRank accuracy
- Random jumps account for user behavior
PageRank - Damping Factor (Continued)
- Random jumps add stochastic behavior to the PageRank calculation, addressing isolated or dead-end pages
- Random jumps can occur with a given probability (e.g., 15%).
- The user might "jump" to another page with a given probability
- Normal browsing would reflect a typical use probability for following web links (e.g., 85%).
Dead Ends
- In PageRank algorithm, dead-end web pages cause issues
- Random jumps give dead end pages a chance to be revisited
PageRank Formula
- A mathematical formula is often used to calculate PageRank
PageRank - Page Quality
- An important quality factor affecting PageRank is the PageRank values of other pages that link into a page.
Aggregated Search
- Web search incorporates several sources
- Multiple sources might be combined in a search (e.g. images, news, video, blog content)
Search Tools
- Search tools used for information retrieval (often based on open source tools)
Lucene
- A frequently used (free and open-source) library for information retrieval
Solr
- Enterprise-grade search engine, based on Lucene, often used for large-scale data
Elasticsearch
- Enterprise-grade search engine, based on Lucene
- JSON documents are stored
- Solr relies on JSON format for storing and handling documents
Other Tool Kits
- Other web search tool kits are also frequently used for search functionality
- Many websites and search engines sometimes use Python
- Some search engines have APIs (Application Programming Interfaces) that allow for programmatic access
Case Study
- One example case study shows a Python-based system implementing facetted and keyword searches on a dataset
- The case study demonstrates a Python system using the case study's company names to retrieve other company data.
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.
Related Documents
Description
This quiz explores key concepts in text processing, including vector representation, tokenization, and the role of stop words in indexing. Test your understanding of how documents are analyzed for effective search functionality. Ideal for students studying information retrieval or data management.