Search Systems PDF - Computing & Information Sciences
Document Details
Uploaded by TopQualityBlessing7037
University of Strathclyde
W. H. Bell
Tags
Summary
These lecture notes cover various aspects of search systems, including expectations, full-text search, categories, recommendations, and indexing. Topics such as web page structure, downloading HTML files, and searching text files are also discussed. The document explores different approaches to information retrieval and various search tools and techniques.
Full Transcript
Search systems Computing & Information Sciences W. H. Bell Expectations Some search text... Search Search documents, using text content. Search citations or links to other documents. Search for documents that are similar: Similar topic, using tags. Other us...
Search systems Computing & Information Sciences W. H. Bell Expectations Some search text... Search Search documents, using text content. Search citations or links to other documents. Search for documents that are similar: Similar topic, using tags. Other users decided that they are similar. Realtime or fast search. Full-text search Costly processing and data input/output load. Need the results of the search quickly. Build indices, rather than search the same files many times. Query indices and return search results. Use background processes to build indices. Robots on the web crawl content to build up indices. Operating system indexes files, using background process. Citations and links Search documents for citations and links. Build up importance of certain documents. Can use the number of citations or the direction of the citations. One web page might be referred to by many others. Crawl the documents using a background process. Categories Users may specify categories. Generate categories: Full-text search – types of phrases or subject area. Pictures in document – colour shades. Crawl documents, images and other data. Recommendations Other users have searched for similar documents. User has searched for similar documents. Use correlation information. “Recommender Systems” lecture. Indexing Build a database that contains indices. Distributed database, containing lots of data. Use a NoSQL solution – e.g. OpenSearch. Web browser user interface queries NoSQL database. Full-text search Web page structure Web server returns page content to browser. Web pages comprise HTML, CSS and JavaScript. Web browser converts JavaScript into HTML. Web browser renders HTML and CSS. Displaying layout, text, images and active content. Need to search text after JavaScript evaluation. Run headless browser to evaluate JavaScript. Extract text with parser, such as BeautifulSoup. https://www.crummy.com/software/BeautifulSoup/ Download HTML files https://en.wikipedia.org/wiki/Programming_language https://en.wikipedia.org/wiki/Relational_database https://en.wikipedia.org/wiki/Recommender_system Download three HTML files. These are simple pages. Ignore JavaScript in this example. Convert these pages to text..html becomes.txt. Searching text files Find “software” in the text files. Programming_language.txt:software. There are, broadly, two approaches to programming language Programming_language.txt:than those that are interpreted in software.[better source needed] Programming_language.txt:computing field. Individual software projects commonly use five programming Programming_language.txt: * Software_engineering and List_of_software_engineering_topics Programming_language.txt: development and maintenance of such software systems. Relational_database.txt:data, as proposed by E._F._Codd in 1970. A software system used to maintain... Counting words Count “software” in the text files. Require the word “software”, rather than text. Relational_database.txt:7 Programming_language.txt:4 Recommender_system.txt:0 Scaling up Searching takes time: time bash -c 'grep -w -c "software" *.txt | sort -k 2 -t : -nr' real 0m0.007s user 0m0.005s sys 0m0.004s These files are tiny, comprising 3106 lines. There are billions of web pages. https://www.worldwidewebsize.com/ Conclusions Can build text search from standard tools. Search time is a function of: Search logic. File size. Number of files. Often too many files to use full-text search without indexing. Selecting tokens Process overview Start with files. Convert files to text. Split text into words. Create tokens. Build index. Input data Text version of three web pages: https://en.wikipedia.org/wiki/Programming_language https://en.wikipedia.org/wiki/Relational_database https://en.wikipedia.org/wiki/Recommender_system Downloaded and extracted text files. Saved as.txt files. Tokenisation Need tokens to build index. Split input text into tokens. Split text strings using white space. Remove punctuation. Avoid duplicate tokens. Convert to lower case. Avoid duplicate tokens. Optional special rules for nouns. Avoid removing nouns. Splitting text into tokens Split and find word frequency: the:747 of:556 a:470 and:425 Common words are not useful for search. to:315 in:299 Need to remove these words. is:225 that:166 Referred to as “stop words”. on:163 are:163 as:151 Many stop word lists, e.g.: for:149 https://www.ranks.nl/stopwords or:124 from:118 Word frequency: Zipf distribution Saif, Hassan, et al. “On stopwords, filtering and data sparsity for sentiment analysis of twitter.” (2014): 810-817. Stop word removal After stop word removal: languages:178 programming:161 language:139 Used default stop words from: systems:98 recommender:98 https://www.ranks.nl/stopwords data:72 system:67 may:66 Remaining words are more useful. can:62 relational:59 Could remove other words. user:57 used:56 archived:54 Still have singular and plural. isbn:48 Stemming Nouns may be singular or plural: Program, programs. Verbs may be in different forms: E.g. Send, sending, sent. Reduce tokens to the stem word. Stemming approaches Dictionary stemming. More accurate, but requires more data and processing time. Algorithms are often preferred, due to better performance. Algorithm stemming. Several algorithms exist and follow rules. Stem result may not always be correct. Porter stemming algorithm Developed by Martin Porter in 1980. Implements a series of rules to remove suffixes. Words are passed through algorithm. Words are converted to singular form: "sses" -> "ss" (Example: caresses -> caress) "ies" -> "i" (Example: ponies -> poni) "ss" -> "ss" (Example: caress -> caress) "s" -> "" (Example: cats -> cat) https://tartarus.org/martin/PorterStemmer/ Porter algorithm Can make mistakes: Missing connections "europe" vs "european". False connections "police" vs "policy". Does not handle proper nouns correctly. E.g.: thomas -> thoma Still popular and basis for non-English stemmers. Converting queries into tokens Queries are treated in the same manner Queries are treated in the same manner. “I want information on the semiotic importance of Daffy Duck and Daffy’s role in the political hagiography of Elmer Fudd” Stopword removal “information semiotic importance daffy duck daffy role political hagiography elmer fudd” Stemming “inform semiot import daffy duck daffy role polit hagiograph elmer fudd” Other indexing choices Can use phrases, rather than tokens. Input list of known phrases. Collect phrases from many documents. Add synonyms, e.g. shut and close. Care is needed, since words can be ambiguous. Weight terms according to their importance. Weighting & frequency Term weighting Terms – these are processed words. Early information retrieval systems - presence or absence. 1 – If the document contains the term. 0 – If the document does not contain term. This approach saved space. Was suitable for short documents or titles of documents. Term weighting Advanced information retrieval systems – use importance. Term weighting based on: Importance in individual documents – Term frequency ( tf ) Importance in collection of documents – Inverse document frequency ( idf ) Term frequency Weight terms based on importance in document. Higher frequency implies higher weight. Several possible equations and measurements. Raw frequency 𝑡𝑓 𝑡, 𝑑 is the simplest. 𝑛𝑡 𝑛𝑡 – Number of times term is found in document d. 𝑡𝑓 𝑡, 𝑑 = 𝑇𝑑 𝑇𝑑 – Total number of terms in document d, including duplicate terms. Using 𝑡𝑓(𝑡,𝑑) value affects query results: Query retrieves documents with high 𝑡𝑓 first. Inverse document frequency How common is term within documents? A rarer term is a better discriminant. Calculate after stopword removal. Weight rare terms higher than frequent words. Rarer words Saif, Hassan, et al. “On stopwords, filtering and data sparsity for sentiment analysis of twitter.” (2014): 810-817. Inverse document frequency 𝑁 𝑖𝑑𝑓 𝑡 = 𝑙𝑜𝑔 𝑛𝑡 𝑁 – Number of documents considered. 𝑛𝑡 – Number of documents that contain term t. Logarithm has a more linear response. Rarer terms have higher 𝑖𝑑𝑓 𝑡. A term that is found in all documents has 𝑖𝑑𝑓 𝑡 = 0. If 𝑁 = 1000 and 𝑛𝑡 = 1 then 𝑖𝑑𝑓 𝑡 = 3. TFIDF Combine term frequency and inverse document frequency. 𝑡𝑓𝑖𝑑𝑓 𝑡, 𝑑 = 𝑡𝑓 𝑡, 𝑑 ∙ 𝑖𝑑𝑓 𝑡 𝑡𝑓𝑖𝑑𝑓 𝑡, 𝑑 – Weight of term within document d. 𝑖𝑑𝑓(𝑡) – Inverse document frequency of term t in document collection. 𝑡𝑓 𝑡, 𝑑 – Term frequency of term t in document d. Importance to document and collection of documents. Information retrieval models Information retrieval models Method to match query to document. Several models used in information retrieval: Boolean – combine search terms using Boolean logic. Hidden Markov model (HMM) – statistical model, assuming Markov process. Latent semantic analysis (LSA) – relationship between documents and terms. Vector-space model – Use cosine similarity between vectors that represent weight of terms. Probabilistic relevance – Probability that a document is relevant to a query. Language model – Document is a good match to query if document is likely to generate query. Boolean retrieval Oldest model. Data description – presence of terms in documents. Does not use term weights. Retrieval is based on query formula. Boolean logic to require more than one term. Terms, formula and applications Terms are query words. Formula uses Boolean connectors: AND, OR, NOT. Similar to SQL. Used for very precise searching. E.g. Legal, medical and intellectual property searches. Example Boolean queries cat Documents containing the term “cat”. cat OR dog Documents containing the term “cat” or containing the term “dog”. cat AND dog Documents containing both the term “cat” and the term “dog”. (cat AND dog) OR budgie Documents containing both “cat” and “dog” or containing “budgie”. NOT ((cat AND dog AND budgie) OR (cat AND budgie) OR (cat AND dog) OR (cat)) Boolean retrieval usage Try to narrow down the set of documents, using a logical formula. Typically, no term weighting. Documents are presented as a set. Boolean retrieval problems Result set size. Queries with popular terms return large result sets. Queries with rare terms return small result sets. Using OR leads to big result set Using AND leads to small result set Bad combinations lead to huge result set or empty result set. Use trial and error to select better result set. Boolean retrieval problems Artificial language. Unintuitive. Misunderstood. Can mitigate this slightly by using a user interface to build query logic. Unordered output. May offer order by date. User interface may offer additional sorting options. User may have to read all returned documents. Boolean retrieval advantages Exact queries. No ambiguity. Simple to program. Boolean algebra. Experts can be very specific with search. Very precise results. Best match retrieval Best match (or partial-match) retrieval. Major alternative to Boolean retrieval. Most retrieval models are best match. Ranked output. Documents ranked according to how closely they match query. Easier to interpret results. First displayed document most likely to be relevant. Advantages over Boolean retrieval No query syntax. Natural language. Easier to modify query. Documents can exactly or partially match the query. More likely to find relevant documents. Users cannot always generate good queries. Best match retrieval example Vector-space model Vector-space model Terms in document as vector in a multidimensional space. One axis or dimension per term. One vector per document. Suggested by G. Salton in 1960. Continues to be an important model. Building the vectors Four documents and the term frequency for two terms. (TF values chosen to simplify the illustration.) Documents TF for apple TF for pear Vector D1 0.5 1 (0.5, 1) D2 1 0 (1, 0) D3 0 0.5 (0, 0.5) D4 0.25 0.25 (0.25, 0.25) Building the vectors Vectors define position in two dimensional space. Number of dimensions or axes from number of terms. 𝐷1 = (0.5, 1) 1 Documents Vector pear D1 (0.5, 1) D2 (1, 0) 𝐷3 = (0, 0.5) 𝐷4 = (0.25, 0.25) D3 (0, 0.5) D4 (0.25, 0.25) 𝐷2 = (1, 0) 0 apple 1 Building the vectors Query is expressed as 𝑡𝑓 vector. Closer vectors imply similar 𝐷1 documents. 1 𝑄 Best match to query is close pear document vector. 𝐷4 Retrieve documents that are 𝐷3 𝐷2 closest to query. 0 apple 1 Effect of stopword removal Before stopword removal After stopword removal 𝐷1 𝑄 the 𝐷3 𝐷4 𝐷2 𝑄 𝐷1 𝐷3 𝐷2 0 cat 0 𝐷4 cat Effect of stemming Before stemming After stemming 𝑄 walking 𝐷1 𝐷3 𝑄 𝐷1 𝐷3 𝐷2 𝐷4 walker 0 𝐷4 walk 𝐷2 Four orthogonal dimensions One dimension Frequency measures Inverse document frequency idf tf or tf.idf 𝐷1 𝐷2 cat cat 𝐷4 𝐷4 𝐷3 𝐷3 𝐷2 𝐷1 0 dog 0 dog Some vectors overlap Better separation Adding more terms Each term is orthogonal axis. Position on axis from 𝑡𝑓, 𝑖𝑑𝑓 , or 𝑡𝑓. 𝑖𝑑𝑓. dog 𝐷1 𝐷6 𝐷3 frog 𝐷5 𝐷4 Similarity measures Can compare vectors in several ways. Normally use simple coordinate-level matching. Query is close to document due to shared terms. Cosine coefficient provides useful evaluation tool. Compare angle between query and document vectors. Simple matching (dot product) Highest dot product between query and document vectors. Compute the dot product for each document and the query vector: 𝑛 Martin Steven David Ian Paul Isla D1 0.8 0.1 0.3 0 0 0 𝐷 ∙ 𝑄 = 𝐷𝑖 𝑄𝑖 𝑖=1 D2 0.2 0.2 0.7 0.8 0.8 0 Q 0.4 0 0.8 0 0 0 𝐷1 ∙ 𝑄 = 0.56 𝐷2 ∙ 𝑄 = 0.64 𝐷2 is a better match to the query. Simple matching Returns documents that: Have more terms in common with query. Share more higher weight terms (𝑖𝑑𝑓). Contains more occurrences of query terms (𝑡𝑓). Most likely to be useful to user. However, is more likely to retrieve long documents. Long documents contain more terms. Contain more occurrences of terms. Cosine matching (Cosine similarity) 𝐷∙𝑄 σ𝑛𝑖=1 𝐷𝑖 𝑄𝑖 cos 𝜃 = = 𝐷 𝑄 σ𝑛𝑖=1 𝐷𝑖 2 σ𝑛𝑖=1 𝑄𝑖 2 Takes into account length of document. Bigger difference implies smaller value. E.g. cosine(0) = 1, cosine(180) = -1 Cosine similarity 𝐷∙𝑄 σ𝑛𝑖=1 𝐷𝑖 𝑄𝑖 cos 𝜃 = = 𝐷 𝑄 σ𝑛𝑖=1 𝐷𝑖 2 σ𝑛𝑖=1 𝑄𝑖 2 Martin Steven David Ian Paul Isla D1 0.8 0.1 0.3 0 0 0 D2 0.2 0.2 0.7 0.8 0.8 0 Q 0.4 0 0.8 0 0 0 cos 𝜃𝐷1,𝑄 = 0.73 𝐷1 is a better match to the query. cos 𝜃𝐷2,𝑄 = 0.46 Simple and cosine compared D1 D2 Simple 0.56 0.64 Cosine 0.73 0.46 Matching algorithms yield different results. Advantages Easy to understand. Has a geometric interpretation. Works generally well with any similarity measure. Mostly, better performance than Boolean. Disadvantages No real theoretical basis. Other best-match approaches (e.g. probabilistic) have theoretical basis. No predictive power. Need experiments. Less control in terms of selection. Web searches Web search Similar to standard information retrieval: Obtain data. Index data. Obtain query. Match documents to query. Present to user. Data collection In standard information retrieval – have document collection. Web information retrieval – have to collect data. No design/co-ordination or central control. Distributed content creation, linking. Structured (databases), semi-structured. Scale is massive. Content can be dynamically generated. Web services may respond to commands. Data created from web services querying databases. Data collection Deploy spider (crawler/robot) – builds corpus. Collects web pages recursively. For each known URL: Fetch page, index it, find new URLs. Additional pages from direct submissions. Repeated every few days. Popular pages are checked more often. Adversarial information retrieval Content suppliers trying to manipulate search engine results. To promote their own pages and avoid paying for adverts. To hide the real content of their own pages. To cause user to look at their pages when searching for something else. Performed by companies, webmasters, consultants and malware. Some legitimate and some not. Spamming is popular. Static data Standard information retrieval – known data set. Data are reliable. Language is known. Format of data is known. Data are static. Web data Web information retrieval – unknown data set. Data in many formats (pdfs, word, html, encoded…). Data in many different languages. Data can be old, unreliable, false or manipulated. Data are dynamic. Web search engine requirements Include policies for indexing. The words that are indexed, capitalisation, stemming, phrases. Support for document encoding schemes. Complicated by variety of data. Dynamic pages – decide when to delete search results. Cached results may exist after page has been deleted. Query and information provided Web users give very little information to search engine. Average query length 2 or 3 terms. Most comprise 1 query term. Poor comprehension of syntax. Early search engines provided rich syntax in user interface. Current search engines hide syntax and modify user’s query. Users have no training, unlike legal, medical or librarian. Additional techniques needed. Provide real-time feedback to user. Matching web pages Wide range of web page quality. Relevance is not enough. Desire trustworthy or new information. Avoid duplicates. Select well maintained pages. Readable pages – display correctly and quickly loaded. Avoid pop-ups and other annoyances. Security requirements. Signed DNS record, SSL certificate. Web page ranking Combine scores to retrieve good page that matches query. Query specific scores – based on match to query. Non-query specific scores: Spam, porn, age of page, popularity. Search engines use hundreds of these score for each page. Build scores when crawling pages. Popularity is key feature of web retrieval. Based on linking of web pages – PageRank. PageRank of web pages Graph of web pages Pages refer to other pages. Isolated graphs. Incoming links indicate No links to some graphs. relevance. 7 1 10 6 9 4 2 5 8 3 PageRank – random walk Assume user searches randomly. Starts at random page. All outward links are equally likely – random walk. Calculate probability of visiting page. More important – higher probability. 7 1 6 4 2 5 3 PageRank – damping factor Need to visit isolated graphs. Introduce damping factor – probability of following link. If not following link – pick a point at random. PageRank – damping factor 15% probability of jumping to random page. Excluding dead end pages as the current page. Users do not always follow links. 85% probability of going out on random link. Assume probability of each link is the same. Dead ends The web is full of dead ends (pages with no links). Random browsing can get stuck at dead ends. At a dead end, jump to a random web page (teleporting). PageRank 1−𝑑 𝑃𝑅(𝑝𝑗 ) 𝑃𝑅 𝑝 = +𝑑 𝑁 𝐿(𝑝𝑗 ) 𝑝𝑗 ∈𝑀(𝑝) 𝑃𝑅 𝑝 – PageRank score of page p. 𝑁 – Total number of pages in the collection. 𝑑 – Tuneable parameter, probability that user will follow links e.g. 85%. 𝑀(𝑝) – Set of all pages linked to the page p. 𝐿(𝑝𝑗 ) – Total number of links from page pj. These are the outgoing links. Equation allows iterative computation of PageRank. “The anatomy of a large-scale hypertextual Web search engine”, S. Brin and L. Page, 1998. https://doi.org/10.1016/S0169-7552(98)00110-X PageRank – page quality Depends on PageRank score of pages that link to page. If a good page links to page – it is a good page. Each page contributes equally. Page A Page B Aggregated search Aggregated search Combine search results from different sources. Sources are referred to as verticals. Image, video, blog, news, etc. Present blended search results. Vertical selection and vertical presentation. Problematic to implement traditional performance measures. Search tools Lucene Free and open-source information retrieval software library. Supported by Apache Software Foundation. Originally written in Java, but ported to C#, C++, Perl, PHP, Python and Ruby. Supports range of information retrieval techniques. Mahout is derived from Lucene. Many users, including Twitter and Apple: https://cwiki.apache.org/confluence/display/LUCENE/PoweredBy Solr Based on Lucene. Supported by Apache Software Foundation. Written in Java. Features: Full-text search, faceted search, real-time indexing, dynamic clustering, database integration, NoSQL features and rich document (e.g. Word, PDF) handling. Designed for scalability and fault tolerance. Popular enterprise search engine. Elasticsearch Based on Lucene. Features: Full-text search engine with an HTTP web service interface. Uses schema-free JSON documents. Open source Apache license. Popular enterprise search engine. Used together with: Logstash – A data collection and log-parsing engine. Kibana – An analytics and visualisation platform. OpenSearch is a fork of Elasticsearch. Other toolkits Whoosh Full-text indexing and searching library, implemented in Python. Indri Information retrieval toolkit, with C++, Java and PHP programming interface. Terrier Toolkit written in Java, developed by the University of Glasgow. Online web search engines Some web search engines provide access to programming interface, e.g. Bing. Case Study: Company Connecting Company details stored in Microsoft Access database. Converted database to MySQL. Implemented facetted and keyword search system using Django and Elasticsearch, mostly in Python. Elasticsearch functionality was provided through Amazon Web Services (AWS). https://www.interface-online.org.uk/case-studies/company-connecting https://www.tangowithdjango.com/