Big Data Systems - Use Case - Search Engines PDF

Martin Boissier Big Data Systems Data Engineering Systems Use Case – Search Engines Hasso Plattner Institute Announcements § First meeting of the Machine Learning Systems seminar § 13:30 - 15:00 § F-1.11. § No prerequisites, I promise. 🫣 § Next week’s presentation in Lecture Series on Research Methods § Science: Institutions, Processes and Misconceptions § By Stefan Neubert § Wifi for non-HPI listeners: hpi_event / poud-WOMP-pseb § Valid for October 2 Timeline I Date Tuesday Wednesday 15.10. /16.10 Intro / Organizational Use Case - Search Engines 22.10. / 23.10. Performance Management Intro to GitHub Classroom 29.10. / 30.10. Map Reduce I Map Reduce II 5.11. / 6.11. Map Reduce III Exercise 12.11. / 13.11. Data Centers Cloud 19.11 / 20.11. File Systems Exercise 26.11. / 27.11. Key Value Stores I Key Value Stores II 3.12 / 4.12. Key Value Stores III Exercise 10.12. / 11.12. Stream Processing I Stream Processing II 17.12. / 18.12. ML Systems I Exercise Christmas Break 3 Timeline II Date Tuesday Wednesday 7.1./ 8.1. ML Systems II Modern Hardware I 14.1. / 15.1. Modern Hardware II Exercise 21.1. / 22.1. TBD Industry Talk 28.1. / 29.1. TBD Exercise 4.2. / 5.2. Exam Prep Buffer/Exam prep Week of 10-16th Exam 4 This Lecture 1. Big Data Applications Application / Query Language / 2. Full Stack User Story – Search Engine Analytics / Visualization § Architecture § Indexing Data Processing § Serving § Infrastructure Data Management § Monitoring § Ranking File System 3. Big Data Stack § Basic overview Virtualization / Container § Open-source stack OS / Scheduling Sources Hardware 1. W. B. Croft, D. Metzler, T. Strohman: Search Engines Information Retrieval in Practice § Book available online: https://ciir.cs.umass.edu/irbook/ 5 Where Traditional Databases are Unsuitable § Analysis over raw (unstructured) data: relational dbs esxpect orderd data and do not like text or Image data as much § Text processing § In general: If relational schema does not suit the problem well § XML, RDF (semantic web), graph processing, stream processing,... XML difficult to Access a specific Attribute in xml § Where cost-effective scalability is required: we want fast Systems but we also want them to scale § Use commodity hardware § Adaptive cluster size (horizontal scaling) § Incrementally growing, add computers without requirement for expensive reorganization that halts the system § In unreliable infrastructures: we know that Hardware will fail, and relational db Systems Need reliability § Must be able to deal with failures – hardware, software, network: § Failure is expected rather than exceptional § Transparent to applications: § Very expensive to build reliability into each application 6 Search Engines § Started in early 90s § Replaced yellow pages style indexes § Solution to the growing number of web pages § Around 2000, Google became popular § Now 90% market share Source: Wikipedia - https://en.wikipedia.org/wiki/Search_engine 7 Basic Web Search Interaction Document whats in it Store User Interaction where to find Index 8 More Detailed Interaction Key Value Stores Document MR Store Index Processing i want the best page ML Systems User Ranking Interaction Stream Processing they eg Analyse the to 10 search words Evaluation Log tracking clicks etc this also affects ranking 9 Basic Search Engine Architecture Search Engine Architecture before we can index them, we Need to find the webpages Crawler Indexer Search 11 Search Engine Components § Crawler § Crawl the internet and store relevant documents § Not part of this lecture § Indexer normal: ich look for spiegel and then find words in text inverted: i want words to Point to spiegel § Invert the files (word, [list of URLs]) § Compute a ranking (e.g., page rank), Crawler which requires an inverted graph: (Doc-URL, [URLs-pointing-to-it]) § Search Search Indexer § Read only access to index § Store user information 12 Building an Index Indexes § Data structure to find data item quickly Index file Data file § Key -> Data 10 10 § Often unique key 20 20 30 30 40 40 § Typical examples § Binary tree 50 50 § Hash table 60 60 § B-Tree 70 70 80 80 § In a search engine 90 90 § Find document that contains word(s) 100 100 § Data -> key § -> Inverted Index key found pointer to record in data file in data file 14 Inverted Index § Consider a text document collection as a relation § Each word in the text collection is a boolean attribute § An attribute is true if the document contains the word anywhere § Document(hadCat, hasDog, hasHouse, …) § Inverted Index § Build a secondary index on every attribute (word) § But: Only true values are indexed § Build index pointing from word to secondary index for that word i want to know where we found the word § Extensions § Combine with document markup: title, abstract, body, anchor, header,... § Store position of word 15 Inverted Indexes Type Position Pointer sorted title Position 5 … … Title 5 … cat‘s … Cat Header 10 Anchor 3 … … Text 57 Dog … cat chase dog … … … Title 100 Title 12 … likes the dog … Find documents that compare cats and dogs § Document mentions “dog” in title § Document mentions “cat” in an anchor text (link to another document) 16 Inverted Index § Pointers in buckets § To a document … … … cat‘s… § To a position in a document Cat … … § Extension: influences importance Dog … cat chased the dog … § Bucket does not only store position … … but also other metadata § Type (Title, Abstract, Text, Table, …) … likes the dog … § Formatting (bold, italic, …) … §… § Queries: AND, OR, NOT Inverted § Operating on pointer sets Index Buckets Documents which buckets Mention both 17 Building an Inverted Index § Input: Collection of documents § Step 1: Tokenization § Extract all words from each document § Remember the source document for each word stemming (cat _> cats) § Embarrassingly parallel it is easy to parallalise § Step 2: Inversion § Merge word lists and collect pointers to documents per unique word § Needs data exchange not as easy to parallize 18 Building an Inverted Index con‘d § Easy task § Tokenize documents (~30MB/s) § Sort and merge tokens (n log n) § At web scale, need to parallelize and distribute § ~200 million active web pages § 100 KB per page -> 18 TB (if all are one static html file) § ~ 2 weeks just for tokenization (read + write all documents) without sorting etc, this is way to slow § Parallelization and distribution is hard L § => Map Reduce framework 19 MapReduce § Programming model § Large scale, distributed data processing § Inspired by map and reduce functions in functional languages apply function to each item of a list or similar § Framework § Simple parallelization model § Shared nothing architectures (“commodity hardware”) no shared disk § By Google (Jeff Dean and Sanjay Ghemawat) § Presented 2004 at OSDI'04 § Used for Web index and many other data processing jobs § Reimplemented by Yahoo as Apache Hadoop 20 Smarter Result Ranking Result Ranking § Returning all URLs for a single web search is not good § First k in alphabetical order? § Restaurant, movie, family, peace § First hit: http://aaa.aaa/ § Can we do better? 22 Page Rank § Order web pages by their importance § Importance means § how many pages link to it § How important are these pages § Idea: How likely will a person randomly surfing the web end up on this page Newsweek cover § PageRank § Probability of reaching a certain web page by randomly clicking on links § PageRank (PR) of page A is the sum of PageRank of all pages (t1…tn) with inbound links divided by the number of their outbound links L(ti), multiplied with a damping factor d PR (t ) PR (t ) PR ( A) = (1 - d ) + d ( 1 + + n ) if page links to me but is has 1000 links it is less relevant L(t1 ) L(t n ) 23 Serving Requests Serving Requests § User access to inverted index § Input: search term User § Output: relevant URLs Interaction § For now: all URLs § Requirements § Many search terms Index § Many URLs § Frequent updates § Interactive speed ( 1 trillion (10^12) pages indexed ~ 10.000 queries per second Search must complete in ~200ms 26 Enter Key-Value Stores § Scalable container for key-value pairs § NoSQL semantics (non-relational) § KV-Stores offer simpler semantics (and syntax) in exchange for increased scalability, speed, availability, and flexibility § Small scale: Hash table § Main operations crud § Write/update put(key, value) § Read get(key) § Delete delete(key) § Usually no aggregation, no table joins, no transactions! 27 Infrastructure & Monitoring Infrastructure blade § Hardware § Network § Servers § Storage § Virtualization § Containers (Docker, Kata) 48 blades § Virtual machines (Xen, VMWare) § Scheduling we usually have more Tasks than servers By Steve Jurvetson - § Yarn, Kubernetes, Mesos https://www.flickr.com/photos/jurvetso n/157722937/, CC BY 2.0, https://commons.wikimedia.org/w/index.php?curid=1724760 29 At „Web-Scale“, Failures Are the Norm § Consider the failure rate of a single machine: § E.g., HDDs have a typical mean-time between failures of around 100k hours § This means we expect a disk failure roughly every 10 years § What happens if we scale to n machines? § We now have n disks running in parallel § Probability that none of the n disks !fail at a given time: 𝑃 𝑛𝑜 𝑑𝑖𝑠𝑘 𝑐𝑟𝑎𝑠ℎ = 1 − 𝑃 𝑑𝑖𝑠𝑘 𝑐𝑟𝑎𝑠ℎ low prop, but if you have large n you know that fail will happen à Shrinks exponentially, approaches 0! à With an increasing number of machines, crash probability approaches 1.0! § Jeff Dean: “Typical first year for a cluster at Google”: § ~0.5 overheating (power down most machines in = 50 ORDER BY pid1, cnt, pid2; 41 ML Systems § Software system that executes ML apps frü her apps oder libs die ml benutzt haben ML Apps & Algorithms § Many forms (in temporal order) § Libraries Language Abstractions § Parameter server Fault Tolerance § Graph based § Linear algebra systems Execution Strategies § DL System Data Representations § Trend: End-to-end system HW & Infrastructure Image: Matthias Boehm 42 Big Data Stack Application / Query Language / Analytics / Visualization Application Data Processing Data Management Big Data Systems File System Virtualization / Container OS / Scheduling Infrastructure Hardware 43 Big Data Systems § Storage § Analytical Processing § Operational Processing § Stream Processing § Graph Processing § Machine Learning 44 System Evolution § Competing trends § Specialization specialized Systems normally start to generalize some day § Generalization § Initial systems § New application (functionality/scale) § Special purpose § Leave optimizations to application § With broader use § Generalize concepts § Add DBMS concepts § New optimizations 45 Where are we heading? § Example: Porcella – Youtube SQL Engine § Problem: Many individual systems for analysis § Data silos § Complex infrastructures § Solution VLDB 2019 § Unified system § For analytics, reporting, dashboards, time series, … § SQL § DB optimizations § Modern (hardware) optimizations 46 Where are we? § Application: § Search engine provider § Examples for BDS usage § Distributed processing § Distributed storage § Stream processing § Machine learning systems § Big Data Stack 47 Next Part § Tuesday: Performance Management & Application / Query Measurement Language / Analytics / Visualization Application § Recording Data Processing § Wednesday: Big Data Data Management First exercise Systems File System Virtualization / Container OS / Scheduling Infrastructure Hardware 48 Thank you for your attention! § Questions? § In Moodle § Per email: [email protected] § In Q&A sessions 49

Big Data Systems - Use Case - Search Engines PDF

Document Details

Tags

Related

Summary

Full Transcript