Big Data Systems - Use Case - Search Engines PDF

Summary

The document provides an overview of Big Data Systems, focusing on the application of this technology within the context of search engines. It includes information on system architecture and components. The material is presented as structured lecture content.

Full Transcript

Martin Boissier Big Data Systems Data Engineering Systems Use Case – Search Engines Hasso Plattner Institute Announcements § First meeting of the Machine Learning Systems seminar § 13:30 - 15:00 § F-1.11. § No prerequi...

Martin Boissier Big Data Systems Data Engineering Systems Use Case – Search Engines Hasso Plattner Institute Announcements § First meeting of the Machine Learning Systems seminar § 13:30 - 15:00 § F-1.11. § No prerequisites, I promise. 🫣 § Next week’s presentation in Lecture Series on Research Methods § Science: Institutions, Processes and Misconceptions § By Stefan Neubert § Wifi for non-HPI listeners: hpi_event / poud-WOMP-pseb § Valid for October 2 Timeline I Date Tuesday Wednesday 15.10. /16.10 Intro / Organizational Use Case - Search Engines 22.10. / 23.10. Performance Management Intro to GitHub Classroom 29.10. / 30.10. Map Reduce I Map Reduce II 5.11. / 6.11. Map Reduce III Exercise 12.11. / 13.11. Data Centers Cloud 19.11 / 20.11. File Systems Exercise 26.11. / 27.11. Key Value Stores I Key Value Stores II 3.12 / 4.12. Key Value Stores III Exercise 10.12. / 11.12. Stream Processing I Stream Processing II 17.12. / 18.12. ML Systems I Exercise Christmas Break 3 Timeline II Date Tuesday Wednesday 7.1./ 8.1. ML Systems II Modern Hardware I 14.1. / 15.1. Modern Hardware II Exercise 21.1. / 22.1. TBD Industry Talk 28.1. / 29.1. TBD Exercise 4.2. / 5.2. Exam Prep Buffer/Exam prep Week of 10-16th Exam 4 This Lecture 1. Big Data Applications Application / Query Language / 2. Full Stack User Story – Search Engine Analytics / Visualization § Architecture § Indexing Data Processing § Serving § Infrastructure Data Management § Monitoring § Ranking File System 3. Big Data Stack § Basic overview Virtualization / Container § Open-source stack OS / Scheduling Sources Hardware 1. W. B. Croft, D. Metzler, T. Strohman: Search Engines Information Retrieval in Practice § Book available online: https://ciir.cs.umass.edu/irbook/ 5 Where Traditional Databases are Unsuitable § Analysis over raw (unstructured) data: relational dbs esxpect orderd data and do not like text or Image data as much § Text processing § In general: If relational schema does not suit the problem well § XML, RDF (semantic web), graph processing, stream processing,... XML difficult to Access a specific Attribute in xml § Where cost-effective scalability is required: we want fast Systems but we also want them to scale § Use commodity hardware § Adaptive cluster size (horizontal scaling) § Incrementally growing, add computers without requirement for expensive reorganization that halts the system § In unreliable infrastructures: we know that Hardware will fail, and relational db Systems Need reliability § Must be able to deal with failures – hardware, software, network: § Failure is expected rather than exceptional § Transparent to applications: § Very expensive to build reliability into each application 6 Search Engines § Started in early 90s § Replaced yellow pages style indexes § Solution to the growing number of web pages § Around 2000, Google became popular § Now 90% market share Source: Wikipedia - https://en.wikipedia.org/wiki/Search_engine 7 Basic Web Search Interaction Document whats in it Store User Interaction where to find Index 8 More Detailed Interaction Key Value Stores Document MR Store Index Processing i want the best page ML Systems User Ranking Interaction Stream Processing they eg Analyse the to 10 search words Evaluation Log tracking clicks etc this also affects ranking 9 Basic Search Engine Architecture Search Engine Architecture before we can index them, we Need to find the webpages Crawler Indexer Search 11 Search Engine Components § Crawler § Crawl the internet and store relevant documents § Not part of this lecture § Indexer normal: ich look for spiegel and then find words in text inverted: i want words to Point to spiegel § Invert the files (word, [list of URLs]) § Compute a ranking (e.g., page rank), Crawler which requires an inverted graph: (Doc-URL, [URLs-pointing-to-it]) § Search Search Indexer § Read only access to index § Store user information 12 Building an Index Indexes § Data structure to find data item quickly Index file Data file § Key -> Data 10 10 § Often unique key 20 20 30 30 40 40 § Typical examples § Binary tree 50 50 § Hash table 60 60 § B-Tree 70 70 80 80 § In a search engine 90 90 § Find document that contains word(s) 100 100 § Data -> key § -> Inverted Index key found pointer to record in data file in data file 14 Inverted Index § Consider a text document collection as a relation § Each word in the text collection is a boolean attribute § An attribute is true if the document contains the word anywhere § Document(hadCat, hasDog, hasHouse, …) § Inverted Index § Build a secondary index on every attribute (word) § But: Only true values are indexed § Build index pointing from word to secondary index for that word i want to know where we found the word § Extensions § Combine with document markup: title, abstract, body, anchor, header,... § Store position of word 15 Inverted Indexes Type Position Pointer sorted title Position 5 … … Title 5 … cat‘s … Cat Header 10 Anchor 3 … … Text 57 Dog … cat chase dog … … … Title 100 Title 12 … likes the dog … Find documents that compare cats and dogs § Document mentions “dog” in title § Document mentions “cat” in an anchor text (link to another document) 16 Inverted Index § Pointers in buckets § To a document … … … cat‘s… § To a position in a document Cat … … § Extension: influences importance Dog … cat chased the dog … § Bucket does not only store position … … but also other metadata § Type (Title, Abstract, Text, Table, …) … likes the dog … § Formatting (bold, italic, …) … §… § Queries: AND, OR, NOT Inverted § Operating on pointer sets Index Buckets Documents which buckets Mention both 17 Building an Inverted Index § Input: Collection of documents § Step 1: Tokenization § Extract all words from each document § Remember the source document for each word stemming (cat _> cats) § Embarrassingly parallel it is easy to parallalise § Step 2: Inversion § Merge word lists and collect pointers to documents per unique word § Needs data exchange not as easy to parallize 18 Building an Inverted Index con‘d § Easy task § Tokenize documents (~30MB/s) § Sort and merge tokens (n log n) § At web scale, need to parallelize and distribute § ~200 million active web pages § 100 KB per page -> 18 TB (if all are one static html file) § ~ 2 weeks just for tokenization (read + write all documents) without sorting etc, this is way to slow § Parallelization and distribution is hard L § => Map Reduce framework 19 MapReduce § Programming model § Large scale, distributed data processing § Inspired by map and reduce functions in functional languages apply function to each item of a list or similar § Framework § Simple parallelization model § Shared nothing architectures (“commodity hardware”) no shared disk § By Google (Jeff Dean and Sanjay Ghemawat) § Presented 2004 at OSDI'04 § Used for Web index and many other data processing jobs § Reimplemented by Yahoo as Apache Hadoop 20 Smarter Result Ranking Result Ranking § Returning all URLs for a single web search is not good § First k in alphabetical order? § Restaurant, movie, family, peace § First hit: http://aaa.aaa/ § Can we do better? 22 Page Rank § Order web pages by their importance § Importance means § how many pages link to it § How important are these pages § Idea: How likely will a person randomly surfing the web end up on this page Newsweek cover § PageRank § Probability of reaching a certain web page by randomly clicking on links § PageRank (PR) of page A is the sum of PageRank of all pages (t1…tn) with inbound links divided by the number of their outbound links L(ti), multiplied with a damping factor d PR (t ) PR (t ) PR ( A) = (1 - d ) + d ( 1 + + n ) if page links to me but is has 1000 links it is less relevant L(t1 ) L(t n ) 23 Serving Requests Serving Requests § User access to inverted index § Input: search term User § Output: relevant URLs Interaction § For now: all URLs § Requirements § Many search terms Index § Many URLs § Frequent updates § Interactive speed ( 1 trillion (10^12) pages indexed ~ 10.000 queries per second Search must complete in ~200ms 26 Enter Key-Value Stores § Scalable container for key-value pairs § NoSQL semantics (non-relational) § KV-Stores offer simpler semantics (and syntax) in exchange for increased scalability, speed, availability, and flexibility § Small scale: Hash table § Main operations crud § Write/update put(key, value) § Read get(key) § Delete delete(key) § Usually no aggregation, no table joins, no transactions! 27 Infrastructure & Monitoring Infrastructure blade § Hardware § Network § Servers § Storage § Virtualization § Containers (Docker, Kata) 48 blades § Virtual machines (Xen, VMWare) § Scheduling we usually have more Tasks than servers By Steve Jurvetson - § Yarn, Kubernetes, Mesos https://www.flickr.com/photos/jurvetso n/157722937/, CC BY 2.0, https://commons.wikimedia.org/w/index.php?curid=1724760 29 At „Web-Scale“, Failures Are the Norm § Consider the failure rate of a single machine: § E.g., HDDs have a typical mean-time between failures of around 100k hours § This means we expect a disk failure roughly every 10 years § What happens if we scale to n machines? § We now have n disks running in parallel § Probability that none of the n disks !fail at a given time: 𝑃 𝑛𝑜 𝑑𝑖𝑠𝑘 𝑐𝑟𝑎𝑠ℎ = 1 − 𝑃 𝑑𝑖𝑠𝑘 𝑐𝑟𝑎𝑠ℎ low prop, but if you have large n you know that fail will happen à Shrinks exponentially, approaches 0! à With an increasing number of machines, crash probability approaches 1.0! § Jeff Dean: “Typical first year for a cluster at Google”: § ~0.5 overheating (power down most machines in = 50 ORDER BY pid1, cnt, pid2; 41 ML Systems § Software system that executes ML apps frü her apps oder libs die ml benutzt haben ML Apps & Algorithms § Many forms (in temporal order) § Libraries Language Abstractions § Parameter server Fault Tolerance § Graph based § Linear algebra systems Execution Strategies § DL System Data Representations § Trend: End-to-end system HW & Infrastructure Image: Matthias Boehm 42 Big Data Stack Application / Query Language / Analytics / Visualization Application Data Processing Data Management Big Data Systems File System Virtualization / Container OS / Scheduling Infrastructure Hardware 43 Big Data Systems § Storage § Analytical Processing § Operational Processing § Stream Processing § Graph Processing § Machine Learning 44 System Evolution § Competing trends § Specialization specialized Systems normally start to generalize some day § Generalization § Initial systems § New application (functionality/scale) § Special purpose § Leave optimizations to application § With broader use § Generalize concepts § Add DBMS concepts § New optimizations 45 Where are we heading? § Example: Porcella – Youtube SQL Engine § Problem: Many individual systems for analysis § Data silos § Complex infrastructures § Solution VLDB 2019 § Unified system § For analytics, reporting, dashboards, time series, … § SQL § DB optimizations § Modern (hardware) optimizations 46 Where are we? § Application: § Search engine provider § Examples for BDS usage § Distributed processing § Distributed storage § Stream processing § Machine learning systems § Big Data Stack 47 Next Part § Tuesday: Performance Management & Application / Query Measurement Language / Analytics / Visualization Application § Recording Data Processing § Wednesday: Big Data Data Management First exercise Systems File System Virtualization / Container OS / Scheduling Infrastructure Hardware 48 Thank you for your attention! § Questions? § In Moodle § Per email: [email protected] § In Q&A sessions 49

Use Quizgecko on...
Browser
Browser