Podcast
Questions and Answers
What problem does the Web face regarding stored information?
What problem does the Web face regarding stored information?
Web mining is important only for retrieving textual data.
Web mining is important only for retrieving textual data.
False
What is used to describe the intended documents for retrieval?
What is used to describe the intended documents for retrieval?
keywords
A video movie may have associated keywords such as its title, director, actors, and ______.
A video movie may have associated keywords such as its title, director, actors, and ______.
Signup and view all the answers
Match the following types of data with their respective keyword associations:
Match the following types of data with their respective keyword associations:
Signup and view all the answers
What is the primary function of document retrieval systems?
What is the primary function of document retrieval systems?
Signup and view all the answers
Web search engines are a common example of information-retrieval systems.
Web search engines are a common example of information-retrieval systems.
Signup and view all the answers
What factors influence the ranking of documents in information retrieval systems?
What factors influence the ranking of documents in information retrieval systems?
Signup and view all the answers
The formula used to measure the relevance of a document to a term is known as _____.
The formula used to measure the relevance of a document to a term is known as _____.
Signup and view all the answers
Which of the following statements about TF-IDF ranking is correct?
Which of the following statements about TF-IDF ranking is correct?
Signup and view all the answers
All terms used as keywords have the same level of importance in document relevance estimation.
All terms used as keywords have the same level of importance in document relevance estimation.
Signup and view all the answers
Match the retrieval factors with their descriptions:
Match the retrieval factors with their descriptions:
Signup and view all the answers
What happens to the relevance score of a document containing multiple occurrences of a term?
What happens to the relevance score of a document containing multiple occurrences of a term?
Signup and view all the answers
What is the purpose of inverse document frequency (IDF) in the context of document ranking?
What is the purpose of inverse document frequency (IDF) in the context of document ranking?
Signup and view all the answers
Stop words are commonly used as keywords in information-retrieval systems.
Stop words are commonly used as keywords in information-retrieval systems.
Signup and view all the answers
What is the TF–IDF approach?
What is the TF–IDF approach?
Signup and view all the answers
The formula for the relevance of a document to a set of terms is denoted as r(d, Q) and can be modified to take into account term __________.
The formula for the relevance of a document to a set of terms is denoted as r(d, Q) and can be modified to take into account term __________.
Signup and view all the answers
Match the following terms with their definitions:
Match the following terms with their definitions:
Signup and view all the answers
How can users enhance the relevance of their document queries?
How can users enhance the relevance of their document queries?
Signup and view all the answers
Documents that contain terms occurring far apart should be ranked higher than those that contain terms close together.
Documents that contain terms occurring far apart should be ranked higher than those that contain terms close together.
Signup and view all the answers
What happens when a user supplies keywords that include stop words?
What happens when a user supplies keywords that include stop words?
Signup and view all the answers
What is a major factor in determining the relevance ranking of a web page?
What is a major factor in determining the relevance ranking of a web page?
Signup and view all the answers
Popularity ranking involves ranking pages based on their access frequency only.
Popularity ranking involves ranking pages based on their access frequency only.
Signup and view all the answers
What is the basic idea behind popularity ranking in web search?
What is the basic idea behind popularity ranking in web search?
Signup and view all the answers
The relevance of a page can be enhanced by combining traditional TF–IDF measures with the page's __________.
The relevance of a page can be enhanced by combining traditional TF–IDF measures with the page's __________.
Signup and view all the answers
Which of the following describes a challenge in determining the access frequency of web pages?
Which of the following describes a challenge in determining the access frequency of web pages?
Signup and view all the answers
Ranking a page based solely on the number of links to it will always yield accurate popularity measurements.
Ranking a page based solely on the number of links to it will always yield accurate popularity measurements.
Signup and view all the answers
What term describes the phenomenon where sites may misrepresent their access frequency?
What term describes the phenomenon where sites may misrepresent their access frequency?
Signup and view all the answers
What is the vector for document d defined as?
What is the vector for document d defined as?
Signup and view all the answers
Relevance feedback requires users to add more keywords to their search query.
Relevance feedback requires users to add more keywords to their search query.
Signup and view all the answers
What measure is used to determine the similarity between two document vectors?
What measure is used to determine the similarity between two document vectors?
Signup and view all the answers
Relevance feedback can help users find relevant documents from a large set of documents matching the given query keywords, allowing users to identify one or a few of the returned documents as __________.
Relevance feedback can help users find relevant documents from a large set of documents matching the given query keywords, allowing users to identify one or a few of the returned documents as __________.
Signup and view all the answers
Match the following concepts with their descriptions:
Match the following concepts with their descriptions:
Signup and view all the answers
What is a possible drawback of early Web-search engines that used only TF–IDF based relevance measures?
What is a possible drawback of early Web-search engines that used only TF–IDF based relevance measures?
Signup and view all the answers
Clustering documents can help display a representative set when the number of documents is very large.
Clustering documents can help display a representative set when the number of documents is very large.
Signup and view all the answers
What is done to avoid returning multiple copies of the same document in search results?
What is done to avoid returning multiple copies of the same document in search results?
Signup and view all the answers
What is the primary purpose of the PageRank algorithm?
What is the primary purpose of the PageRank algorithm?
Signup and view all the answers
PageRank relies on the number of links pointing to a page in order to determine its ranking.
PageRank relies on the number of links pointing to a page in order to determine its ranking.
Signup and view all the answers
What does the variable 'δ' represent in the PageRank algorithm?
What does the variable 'δ' represent in the PageRank algorithm?
Signup and view all the answers
The PageRank of a page is defined as the probability that a random walker is __________ the page at any given point in time.
The PageRank of a page is defined as the probability that a random walker is __________ the page at any given point in time.
Signup and view all the answers
Match the components of PageRank with their descriptions:
Match the components of PageRank with their descriptions:
Signup and view all the answers
How is the jump probability matrix T defined for each link?
How is the jump probability matrix T defined for each link?
Signup and view all the answers
In PageRank, each PageRank value is initially set to 1 divided by the total number of pages (1/N).
In PageRank, each PageRank value is initially set to 1 divided by the total number of pages (1/N).
Signup and view all the answers
What iterative technique is used to solve the equations generated in PageRank?
What iterative technique is used to solve the equations generated in PageRank?
Signup and view all the answers
Study Notes
Web Mining
- Web mining is a critical tool for researchers to help locate interesting information on the internet.
- The internet provides easy access to numerous sources, but information overload is a challenge.
Document Searching
- Users search for specific documents or types of documents.
- Keywords, like "database system" or "stock-market scandals", help locate relevant documents.
- Documents associated with keywords matching the query are retrieved.
- Keyword-based search works for textual, video, and audio data (if they have descriptive keywords).
Document Searching (more detail)
- Keywords like title, director, actors, and genre help search for movies or video clips (video/audio tags work the same way as keywords)
- Document retrieval uses user keywords or sample documents to find relevant documents.
- Web search engines are the most common use of this system; they can search even image data with associated keywords.
Keyword Search
- Information retrieval systems allow searches using keywords and logical operators (and, or, not).
- "And" is understood implicitly even without specifying it.
Keyword Search (Ranking)
- Ranking documents by estimated relevance is critical.
- Factors for ranking include:
- Term frequency (how often a keyword appears in a document)
- Inverse document frequency (how many documents contain the keyword)
- Fewer documents containing the query keyword gives more importance to the keyword
Relevance Ranking Using Terms (TF-IDF)
- TF-IDF stands for Term Frequency/Inverse Document Frequency.
- n(d) is the number of terms in document d
- n(d, t) is the number of times term t appears in document d
- Relevance of document d to term t: TF(d,t) = log(1 + n(d,t)/n(d))
- The log factor avoids excessive preference for frequent terms.
- Relevance of document to query Q: r(d,Q) = Σ TF (d, t) * IDF(t) (summation over all terms in query Q)
Motivation for TF-IDF Formula
- Document length affects the number of keyword occurrences.
- 10 occurrences of a term in a long document doesn't imply 10 times the relevance as 1 occurrence in a short document.
Relevance Ranking Using Terms (Cont.)
- Systems prioritize words present in titles, author lists or headings.
- Words appearing later in a document are given less preference.
Relevance Ranking Using Terms (Cont. 2)
- The formulas for relevance (TF, IDF) can be extended.
- Term frequency, TF, refers to the importance of a term regardless of the specific formula used.
- Documents are usually returned in decreasing order of their relevancy score.
- Typically only few top-ranked documents are shown, not all results.
Inverse Document Frequency
- Relevance for a query with multiple keywords is calculated by combining the relevancy of each keyword for a document.
- A simple way to combine keyword scores is to add them.
- Not all terms are equal as keywords. More rare terms are valued more highly.
Inverse Document Frequency (Why rare terms matter more)
- Imagine a query with "computing" (common) and "quantum" (less common).
- A document with "quantum" but not "computing" should rank higher.
Inverse Document Frequency (Calculating IDF)
- IDF(t)= 1/n(t) , where n(t) is the number of documents containing term t.
- Relevance measure of document d to terms Q: r(d, Q) = Σ TF (d, t) * IDF(t) (summation over all terms in query Q).
Inverse Document Frequency (Refinement)
- Users may specify weights for query terms.
- Weights are considered by multiplying TF(t) by w(t) in the relevance formula; such adjustment allows users to fine tune the search outcomes.
- Terms with assigned high weights will get more importance and the resulting document retrieval will be more exact based on the user's query
Stop Words
- Common words like "and," "or," "a," etc. have extremely low inverse document frequency, making them irrelevant for queries.
- Stop words (common words) are not considered when indexing documents.
- Stop words are discarded if present in user-supplied keywords.
Proximity
- Proximity of keywords in a document to each other affects ranking.
- Documents with closely positioned keywords get more priority.
- The ranking formula can accommodate proximity.
Similarity Based Retrieval
- Some systems use similarity to retrieve documents similar to a given document A.
- Similarity is often defined based on common terms.
Similarity Based Retrieval (Finding similar terms)
- An approach is to identify terms in document A with the highest TF(A,t) * IDF(t) scores.
- These terms become the query to find similar documents
- Query terms are weighted by TF(A,t) * IDF(t)
Similarity Based Retrieval (Vector Space Model)
- An n-dimensional space (n is the number of words in the document set) is defined.
- Each document is represented as a vector in this space.
- The vector's coordinate is TF(d,t) * IDF(t), where d is the document, t is the term.
- The cosine of the angle between two vectors of documents d and e is used to measure their similarity.
Relevance Feedback
- If the set of similar documents to a query is large, the system may present a few most relevant ones to the user.
- The user can rate the displayed docs, and the system restarts the search considering this feedback.
- This allows the user to find documents better aligned with their original search intent.
Similarity Based Retrieval (Relevance Feedback used in Document Search)
- Relevance feedback can be used to find documents that are relevant when the original query and results set are too large.
- Users can tag docs as "relevant" and the system automatically refines the search using this feedback to discover more similar documents.
Clustering
- Search systems often cluster similar documents to provide a representative subset when many documents match a query.
- Cosine similarity based clustering shows documents from different clusters in the results, showing varied answers.
Mirroring
- Multiple copies of a document on the web exist (website mirrors).
- Search systems should identify and return only one copy of the duplicate document.
Hyperlinks
- Early search engines used only TF-IDF for ranking.
- However, text documents often miss crucial information: hyperlinks.
- Pages with many incoming links often rank higher than those with lesser incoming links (popular pages receive the most hits).
Popularity Ranks
- Popularity ranking considers page popularity when ranking documents; pages linked from many other pages (popular pages) are ranked higher.
- Example: the "google.com" page appears frequently in queries and is often linked to frequently, this page often is highly ranked in searches for "google".
Combined Measure
- Traditional ranking methods (TF-IDF) and popularity measures can be combined to provide a more comprehensive measure of a page's relevance to a given query.
- Results are returned by sorting based on the combined relevance score; higher scores appear earlier.
Popularity Ranking (Defining Popularity)
- Determining popularity is difficult.
- Simply counting page accesses as a measure of popularity has limitations and issues.
- Difficulty obtaining the access frequency from every website is notable
- Websites may falsely report access frequency to gain a ranking advantage.
Popularity Ranking (Crawling)
- Web crawling processes analyze hyperlinks to estimate popularity.
Popularity Ranking (Popularity of Sites, not just pages)
- A popular site's linked pages benefit from its popularity.
PageRank
- Google introduced PageRank to improve query results.
- PageRank is a measure of a page's popularity based on links from other popular pages.
- This iterative process refines ranking until minimal changes.
PageRank (Random Walk Model)
- Use of random web surfing model.
- Probability of jumping to a different page or following a link.
- Iterative system to define page rank.
- Random walker follows random links.
- PageRank is the odds the walker visits a specific page at any given instant.
PageRank (Random Walk Model details)
- Pages frequently pointed to are more likely to hold the attention of the random searcher;
- Pages linked to by pages with high PageRank often rank high.
PageRank (Mathematical)
- PageRank is represented by a set of linear equations that can be solved using matrix manipulation techniques.
- Each page gets its PageRank from other related documents.
PageRank (Solving Equations)
- A set of equations defines PageRank, which are solved using iterative techniques,
- initial page rank is set to 1/N
- Repeated calculation until minimal change.
Example
- Example values for page 1, 2, 3 and 4; probabilities of linking among the pages.
- Shows the iterative calculations for PageRank values of each page.
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.
Related Documents
Description
Test your knowledge on information retrieval systems and the factors influencing document rankings. This quiz covers web mining, keyword associations, and document relevance measurements. Perfect for students studying information science or related fields.