Web Mining PDF

 The Web provides a convenient way to get to, and to interact with, information sources across the Internet.  However, a persistent problem facing the Web is the explosion of stored information, with little guidance to help the user to locate what is interesting.  Web mining has played a critical role in making the Web a productive and useful tool, especially for researchers.  A user may want to retrieve a particular document or a particular class of documents.  The intended documents are typically described by a set of keywords— for example, the keywords “database system” may be used to locate books on database systems, and the keywords “stock” and “scandal” may be used to locate articles about stock-market scandals.  Documents have associated with them a set of keywords, and documents whose keywords contain those supplied by the user are retrieved.  Keyword-based document classification can be used not only for retrieving textual data, but also for retrieving other types of data, such as video and audio data, that have descriptive keywords associated with them.  For instance, a video movie may have associated with it keywords such as its title, director, actors, and genre, while an image or video clip may have tags, which are keywords describing the image or video clip, associated with it.  Document retrieval locates relevant documents, on the basis of user input such as keywords or example documents ◦ e.g., find documents containing the words “database systems”  Can be used even on textual descriptions provided with non-textual data such as images  Web search engines are the most familiar example of such systems  Information-retrieval systems typically allow query expressions formed using keywords and the logical connectives and, or, and not ◦ Ands are implicit, even if not explicitly specified  Ranking of documents on the basis of estimated relevance to a query is critical  Relevance ranking is based on factors such as ◦ Term frequency  Frequency of occurrence of query keyword in document ◦ Inverse document frequency  How many documents the query keyword occurs in  Fewer  give more importance to keyword ◦ Hyperlinks to documents  More links to a document  document is more important  TF-IDF (Term frequency/Inverse Document frequency) ranking: ◦ Let = number of terms in the document ◦ = number of occurrences of term in the document. ◦ Relevance of a document to a term The log factor is to avoid excessive weight to frequent terms ◦ Relevance of document to query Q  the number of occurrences depends on the length of the document,  a document containing 10 occurrences of a term may not be 10 times as relevant as a document containing one occurrence.  Most systems add to the above model ◦ Words that occur in title, author list, section headings, etc. are given greater importance ◦ Words whose first occurrence is late in the document are given lower importance  The above notions can be formalized by extensions of the formula we have shown for TF(d, t).  The relevance of a document to a term is still referred to as term frequency (TF), regardless of the exact formula used  Documents are returned in decreasing order of relevance score ◦ Usually only top few documents are returned, not all  The relevance of a document to a query with two or more keywords is estimated by combining the relevance measures of the document to each keyword.  A simple way of combining the measures is to add them up (as previously discussed).  However, not all terms used as keywords are equal.  Suppose a query uses two terms, one of which occurs frequently, such as “computing”, and another that is less frequent, such as “quantum”. A document containing “quantum” but not “computing” should be ranked higher than a document containing the term “computing” but not “quantum”.  To fix the above problem, weights are assigned to terms using the inverse document frequency (IDF), defined as:  where n(t) denotes the number of documents (among those indexed by the system) that contain the term t.  The relevance of a document d to a set of terms Q is then defined as:  This measure can be further refined if the user is permitted to specify weights w(t) for terms in the query, in which case the user-specified weights are also taken into account by multiplying TF(t) by w(t) in the above formula.  The above approach of using term frequency and inverse document frequency as a measure of the relevance of a document is called the TF–IDF approach.  Almost all text documents (in English) contain words such as “and,” “or,” “a,” and so on.  Hence, these words are useless for querying purposes since their inverse document frequency is extremely low.  Information-retrieval systems define a set of words, called stop words, containing 100 or so of the most common words, and ignore these words when indexing a document.  Such words are not used as keywords, and are discarded if present in the keywords supplied by the user.  Another factor taken into account when a query contains multiple terms is the proximity of the terms in the document.  If the terms occur close to each other in the document, the document would be ranked higher than if they occur far apart.  The formula for r(d, Q) can be modified to take proximity of the terms into account.  Certain information-retrieval systems permit similarity-based retrieval.  Here, the user can give the system document A, and ask the system to retrieve documents that are “similar” to A.  The similarity of a document to another may be defined, for example, on the basis of common terms.  One approach is to find k terms in A with highest values of TF(A, t) ∗ IDF(t), and to use these k terms as a query to find relevance of other documents.  The terms in the query are themselves weighted by TF(A, t) ∗ IDF(t).  Vector space model: define an n-dimensional space, where n is the number of words in the document set. ◦ Vector for document d goes from origin to a point whose i th coordinate is r(d,t) = TF (d,t ) * IDF (t ) ◦ The cosine of the angle between the vectors of two documents d and e is used as a measure of similarity.  If the set of documents similar to a query document A is large, the system may present the user a few of the similar documents, allow the user to choose the most relevant few, and start a new search based on similarity to A and to the chosen documents.  The resultant set of documents is likely to be what the user intended to find. This idea is called relevance feedback.  Relevance feedback can also be used to help users find relevant documents from a large set of documents matching the given query keywords.  In such a situation, users may be allowed to identify one or a few of the returned documents as relevant.  The system then uses the identified documents to find other similar ones.  The resultant set of documents is likely to be what the user intended to find.  An alternative to the relevance feedback approach is to require users to modify the query by adding more keywords.  Relevance feedback can be easier to use, in addition to giving a better final set of documents as the answer.  In order to show the user a representative set of documents when the number of documents is very large, a search system may cluster the documents, based on their cosine similarity.  Then, a few documents from each cluster may be shown, so that more than one cluster is represented in the set of answers.  As a special case of similarity, there are often multiple copies of a document on the Web; this could happen, for example, if a Web site mirrors the contents of another Web site.  In this case, it makes no sense to return multiple copies of a highly ranked document as separate answers; duplicates should be detected, and only one copy should be returned as an answer.  Early Web-search engines ranked documents by using only TF–IDF based relevance measures.  However, these techniques had some limitations when used on very large collections of documents, such as the set of all Web pages.  In particular, many Web pages have all the keywords specified in a typical search engine query.  Further, some of the pages that users want as answers often have just a few occurrences of the query terms, and would not get a very high TF–IDF score.  However, researchers soon realized that Web pages have very important information that plain text documents do not have, namely hyperlinks.  These can be exploited to get better relevance ranking; in particular, the relevance ranking of a page is influenced greatly by hyperlinks that point to the page.  The basic idea of popularity ranking (also called prestige ranking) is to find pages that are popular, and to rank them higher than other pages that contain the specified keywords.  Since most searches are intended to find information from popular pages, ranking such pages higher is generally a good idea.  For instance, the term “google” may occur in vast numbers of pages, but the page google.com is the most popular among the pages that contain the term “google.”  The page google.com should therefore be ranked as the most relevant answer to a query consisting of the term “google”.  Traditional measures of relevance of a page such as the TF–IDF based measures can be combined with the popularity of the page to get an overall measure of the relevance of the page to the query.  Pages with the highest overall relevance value are returned as the top answers to a query.  But, how to define and find the popularity of a page.  One way would be to find how many times a page is accessed and use the number as a measure of the site’s popularity.  However, getting such information is impossible without the cooperation of the site, and while a few sites may be persuaded to reveal this information, it is difficult to get it for all sites.  Further, sites may lie about their access frequency, in order to get ranked higher.  Many Web sites have links to other related sites, which can also be used to infer the popularity of the linked sites.  A Web search engine can fetch Web pages (by a process called crawling), and analyze them to find links between the pages.  A first solution to estimating the popularity of a page is to use the number of pages that link to the page as a measure of its popularity.  However, this by itself has the drawback that many sites have a number of useful pages, yet external links often point only to the root page of the site.  The root page in turn has links to other pages in the site.  These other pages would then be wrongly inferred to be not very popular, and would have a low ranking in answering queries.  One alternative is to associate popularity with sites, rather than with pages.  All pages at a site then get the popularity of the site, and pages other than the root page of a popular site would also benefit from the site’s popularity.  However, the question of what constitutes a site then arises. In general the Internet address prefix of a page URL would constitute the site corresponding to the page.  However, there are many sites that host a large number of mostly unrelated pages, such as home page servers in universities and Web portals such as groups.yahoo.com or groups.google.com.  For such sites, the popularity of one part of the site does not imply popularity of another part of the site.  A simpler alternative is to allow transfer of prestige from popular pages to pages to which they link.  Under this scheme, in contrast to the one- person one-vote principles of democracy, a link from a popular page x to a page y is treated as conferring more prestige to page y than a link from a not-so-popular page z.  This notion of popularity is in fact circular, since the popularity of a page is defined by the popularity of other pages, and there may be cycles of links between pages.  However, the popularity of pages can be defined by a system of simultaneous linear equations, which can be solved by matrix manipulation techniques.  The linear equations can be defined in such a way that they have a unique and well-defined solution.  It is interesting to note that the basic idea underlying popularity ranking is actually quite old, and first appeared in a theory of social networking developed by sociologists in the 1950s.  In the social-networking context, the goal was to define the prestige of people.  For example, the president of the United States has high prestige since a large number of people know him.  If someone is known by multiple prestigious people, then she also has high prestige, even if she is not known by as large a number of people.  The use of a set of linear equations to define the popularity measure also dates back to this work  The Web search engine Google introduced PageRank, which is a measure of popularity of a page based on the popularity of pages that link to the page.  Using the PageRank popularity measure to rank answers to a query gave results so much better than previously used ranking techniques that Google became the most widely used search engine, in a rather short period of time.  PageRank can be understood intuitively using a random walk model.  Suppose a person browsing the Web performs a random walk (traversal) on Web pages as follows:  The first step starts at a random Web page, and in each step, the random walker does one of the following.  With a probability , the walker jumps to a randomly chosen Web page, and with a probability of 1− , the walker randomly chooses one of the outlinks from the current Web page and follows the link.  The PageRank of a page is then the probability that the random walker is visiting the page at any given point in time.  Note that pages that are pointed to from many Web pages are more likely to be visited, and thus will have a higher PageRank.  Similarly, pages pointed to by Web pages with a high PageRank will also have a higher probability of being visited, and thus will have a higher PageRank.  PageRank can be defined by a set of linear equations  PageRank can be defined by a set of linear equations, as follows:  First, Web pages are given integer identifiers.  The jump probability matrix T is defined with T[i, j] set to the probability that a random walker who is following a link out of page i follows the link to page j.  Assuming that each link from i has an equal probability of being followed T[i, j] = 1/Ni where Ni is the number of links out of page i  Most entries of T are 0 and it is best represented as an adjacency list.  Then the PageRank P[ j] for each page j can be defined as:  Where: δ: constant representing probability of a step being a random jump, N: number of pages.  The set of equations generated as above are usually solved by an an iterative technique, starting with each P[i] set to 1/N.  Each step of the iteration computes new values for each P[i] using the P values from the previous iteration.  The process stops when the maximum change in any P[i] value in an iteration goes below some cutoff value.  Page 1: 60% to 2, 30% to 3, 10% to 4  Page 2: 50% to 1, 50% to 3  Page 3: 33.33% to each other page  Page 4: 90% to 1, 5% each to 2 and 3.  =0.2 P = 0.2 / 4 + 0.8 * ( P * 0.5+ P * 1/3 + P * 0.9 ) P = 0.2/4 + 0.8 * ( P * 0.6 + P * 1/3 + P * 0.05 ) P = 0.2/4 + 0.8 * (P * 0.3 + P * 0.5 + P * 0.05) P= 0.2/4 + 0.8 * (P * 0.1 + P * 1/3 ) P1 P2 P3 P4 0.25 0.25 0.25 0.25 0.397 0.247 0.220 0.137 0.306 0.305 0.249 0.140 0.339 0.269 0.251 0.141 0.326 0.285 0.245 0.144 0.333 0.277 0.248 0.141 0.329 0.282 0.247 0.143 0.331 0.279 0.247 0.142 0.330 0.281 0.247 0.142

Document Details

Tags

Related

Summary

Full Transcript