Podcast
Questions and Answers
Which of the following components is typically part of a search engine's architecture?
Which of the following components is typically part of a search engine's architecture?
Traditional relational databases are well-suited for handling unstructured data.
Traditional relational databases are well-suited for handling unstructured data.
False (B)
What term is used to describe data that does not follow a traditional schema, such as XML or RDF?
What term is used to describe data that does not follow a traditional schema, such as XML or RDF?
Unstructured data
In search engines, the process of _____ is critical for determining the relevance of search results.
In search engines, the process of _____ is critical for determining the relevance of search results.
Signup and view all the answers
Match the following technologies with their primary focus:
Match the following technologies with their primary focus:
Signup and view all the answers
What is a primary function of virtualization in computing?
What is a primary function of virtualization in computing?
Signup and view all the answers
The probability of none of the disks crashing increases with more disks.
The probability of none of the disks crashing increases with more disks.
Signup and view all the answers
What hardware components are mentioned in the context of infrastructure?
What hardware components are mentioned in the context of infrastructure?
Signup and view all the answers
The probability of at least one disk failure approaches _____ as the number of disks increases.
The probability of at least one disk failure approaches _____ as the number of disks increases.
Signup and view all the answers
Match the following virtualization tools with their types:
Match the following virtualization tools with their types:
Signup and view all the answers
What is the first meeting of the Machine Learning Systems seminar scheduled for?
What is the first meeting of the Machine Learning Systems seminar scheduled for?
Signup and view all the answers
There are prerequisites for the Machine Learning Systems seminar.
There are prerequisites for the Machine Learning Systems seminar.
Signup and view all the answers
Who is presenting in the next week's Lecture Series on Research Methods?
Who is presenting in the next week's Lecture Series on Research Methods?
Signup and view all the answers
The focus of the first week in the timeline is on ______.
The focus of the first week in the timeline is on ______.
Signup and view all the answers
Match the following key terms with their related topics:
Match the following key terms with their related topics:
Signup and view all the answers
What is the main topic covered on October 29th and 30th?
What is the main topic covered on October 29th and 30th?
Signup and view all the answers
There is an exercise scheduled on January 28th and 29th.
There is an exercise scheduled on January 28th and 29th.
Signup and view all the answers
What is the final week dedicated to in the timeline?
What is the final week dedicated to in the timeline?
Signup and view all the answers
What does PageRank help determine?
What does PageRank help determine?
Signup and view all the answers
PageRank considers only the quantity of inbound links to determine a page's importance.
PageRank considers only the quantity of inbound links to determine a page's importance.
Signup and view all the answers
What is the damping factor (d) used for in PageRank calculations?
What is the damping factor (d) used for in PageRank calculations?
Signup and view all the answers
A key-value store is a type of _____ database.
A key-value store is a type of _____ database.
Signup and view all the answers
Match the following operations with their description in key-value stores:
Match the following operations with their description in key-value stores:
Signup and view all the answers
Which of the following is NOT a characteristic of key-value stores?
Which of the following is NOT a characteristic of key-value stores?
Signup and view all the answers
PageRank guarantees that a page will always rank first if it has the most inbound links.
PageRank guarantees that a page will always rank first if it has the most inbound links.
Signup and view all the answers
What influences the output of a search term in terms of relevant URLs?
What influences the output of a search term in terms of relevant URLs?
Signup and view all the answers
Which of the following describes a trend in machine learning systems?
Which of the following describes a trend in machine learning systems?
Signup and view all the answers
Specialized systems in big data processing will never generalize.
Specialized systems in big data processing will never generalize.
Signup and view all the answers
What are the four types of processing included in big data systems?
What are the four types of processing included in big data systems?
Signup and view all the answers
The system that deals with machine learning is commonly referred to as an __________ system.
The system that deals with machine learning is commonly referred to as an __________ system.
Signup and view all the answers
Match the following components of the Big Data Stack to their functions:
Match the following components of the Big Data Stack to their functions:
Signup and view all the answers
What type of processing is specifically designed to handle real-time data flow?
What type of processing is specifically designed to handle real-time data flow?
Signup and view all the answers
Generalization in ML systems typically involves adding DBMS concepts.
Generalization in ML systems typically involves adding DBMS concepts.
Signup and view all the answers
Name one execution strategy used in ML applications.
Name one execution strategy used in ML applications.
Signup and view all the answers
What is a significant consideration in scaling systems for cost-effectiveness?
What is a significant consideration in scaling systems for cost-effectiveness?
Signup and view all the answers
Search engines began to replace yellow pages style indexes in the late 80s.
Search engines began to replace yellow pages style indexes in the late 80s.
Signup and view all the answers
What major search engine gained 90% market share around the year 2000?
What major search engine gained 90% market share around the year 2000?
Signup and view all the answers
An _____ is a data structure used to quickly find data items in a search engine.
An _____ is a data structure used to quickly find data items in a search engine.
Signup and view all the answers
Match the following components of a search engine with their descriptions:
Match the following components of a search engine with their descriptions:
Signup and view all the answers
In the context of search engines, what does the inverted index do?
In the context of search engines, what does the inverted index do?
Signup and view all the answers
Hardware failures are considered exceptional in unreliable infrastructures.
Hardware failures are considered exceptional in unreliable infrastructures.
Signup and view all the answers
What type of index would typically not allow duplicate keys?
What type of index would typically not allow duplicate keys?
Signup and view all the answers
The __________ component of a search engine reformats and organizes data for efficient searching.
The __________ component of a search engine reformats and organizes data for efficient searching.
Signup and view all the answers
Which of the following is NOT typically used as an example of an index data structure?
Which of the following is NOT typically used as an example of an index data structure?
Signup and view all the answers
User interaction is irrelevant to the ranking of pages in a search engine.
User interaction is irrelevant to the ranking of pages in a search engine.
Signup and view all the answers
What is the primary function of a search engine crawler?
What is the primary function of a search engine crawler?
Signup and view all the answers
Match the following data structures with their descriptions:
Match the following data structures with their descriptions:
Signup and view all the answers
The process of adjusting systems for scalability can involve _____ scaling.
The process of adjusting systems for scalability can involve _____ scaling.
Signup and view all the answers
What is one of the key features of data reliability in search engines?
What is one of the key features of data reliability in search engines?
Signup and view all the answers
Study Notes
Big Data Systems Use Case - Search Engines
- Search engines emerged in the early 1990s
- Initially, they worked like yellow pages, indexing web pages by content
- The proliferation of web pages necessitated better indexing methods
- Google gained prominence in the early 2000s, achieving a significant market share, nearly 90%
- Today's search engines handle hundreds of millions of products, and billions of page views and queries per day
Announcements
- The first meeting of the Machine Learning Systems seminar was scheduled for 13:30 - 15:00 in room F-1.11.
- No prerequisites are required for the seminar.
- A presentation on Research Methods (Science: Institutions, Processes and Misconceptions) by Stefan Neubert is planned for the following week.
- Wifi is available for non-HPI attendees connected to hpi_event / poud-WOMP-pseb in October.
Timeline I
-
15.10./16.10: Introduction/Organizational & Performance Management
-
22.10./23.10: Performance Management
-
29.10./30.10: Map Reduce I
-
5.11./6.11: Map Reduce III
-
12.11./13.11: Data Centers
-
19.11/20.11: File Systems
-
26.11./27.11: Key Value Stores I
-
3.12/4.12: Key Value Stores III
-
10.12./11.12: Stream Processing I
-
17.12./18.12: ML Systems I
-
Week of 10-16th: Examination week with Christmas break
Timeline II
- 7.1./8.1.: ML Systems II
- 14.1. / 15.1.: Modern Hardware II
- 21.1./22.1.: TBD
- 28.1./29.1.: TBD
- 4.2./5.2: Exam Prep
This Lecture
- Big Data Applications: Focus on applications built on top of big data
- Full Stack User Story - Search Engine: Architectural, indexing, serving, infrastructure, and monitoring of search engines
- Big Data Stack: An overview of the open-source stack used by search engines
Where Traditional Databases Are Unsuitable
- Analysis over raw, unstructured data: Relational databases are not suited to text processing, XML, RDF, graph, and stream processing due to their structured format
- Cost-effective scalability: The need for cost-effective scalable solutions that can easily expand by adding more computers without major rebuilding effort
- In unreliable hardware: The system architecture must be able to automatically handle failures without impacting the operation.
Search Engines
- Began in the early 1990s, replacing yellow page style indexes
- Solved the problem of the burgeoning number of web pages
- Google became popular in the early 2000s, and currently maintains a large market share
Basic Web Search Interaction
- The user interacts with the search query and the index
- This produces relevant documents to the user
More Detailed Interaction
- There is a flow between the user interaction, ranking, evaluation processes, and log
Basic Search Engine Architecture
- High-level architecture of a search engine
Search Engine Components
- The Crawler: Crawls the internet to collect relevant documents
- The Indexer: Inverts the documents and computes a ranking
- The Search Engine: Executes searches against the inverted index.
Building an Index
- Data Structure to find data quickly through keys
- Typical examples like binary tree, hash table, B-Tree
Indexes
- Data structure for quick data retrieval
- Keys uniquely map to data
- Common examples include binary trees, hash tables, and B-trees.
Inverted Index
- Represents a text document collection as a relation
- Each word is a boolean attribute
- An attribute is true if a document contains the word anywhere
Inverted Indexes
- Shows relationships between types, positions, and pointers in a database
- Documents can be found based on specific words within titles or in anchor texts.
Inverted Index
- Pointers in buckets, to a document, and to a position
- Metadata storage like type, title, text, tables, and formatting
- Queries like AND, OR, NOT, and operations on pointer sets
Building an Inverted Index
- The input is a collection of documents
- Tokenization converts documents to words
- Inversion creates pointers from words to documents
Building an Inverted Index con'd
- Easy task: tokenizing documents and sorting tokens
- Web-scale demands parallelization and distribution (e.g., using a MapReduce framework) to manage the sheer volume of data.
MapReduce
- Programming model, inspired by map and reduce functions, and used for large-scale, distributed data processing
- Framework: Simple parallelization model utilizing "commodity hardware"
- Google created this model
Smarter Result Ranking
- Returning all URLs alphabetically for a single web search is inefficient
- Page rank is a way to improve results, ordering by importance and probability of a user arriving at a page based on the number of inbound links, and outbound links multiplied by a damping factor
Page Rank
- Orders web pages according to their importance
- Importance is determined by the number of pages that link to it
- The PageRank of a page is estimated using an algorithm, combining the rank of linking web pages, and the number and importance of those outbound links
Serving Requests
- Users interact with the inverted index to find relevant URLs based on search terms
- High query volume and frequent updates are typical aspects of search engines.
More on Interaction
- Internet applications have enormous data amounts, billions of users, and frequent updates
- Data volumes and user queries are increasing rapidly
Enter Key-Value Stores
- Scalable container for pairs of keys and values
- Non-relational key-value stores offer simpler semantics and less complexity, in exchange for increased speed, scalability, availability, and flexibility
- Small-scale: Hash tables with operations like puts, gets, and deletes
- No aggregation, joins, or transactions
Infrastructure & Monitoring
- Hardware, network, servers, storage
- Virtualization using containers (Docker, Kata), and VMs (Xen, VMware)
- Scheduling frameworks (Yarn, Kubernetes, Mesos)
At "Web-Scale", Failures Are the Norm
- Web-scale systems must be designed to handle frequent hardware failures
- Failure is expected rather than an exceptional occurrence
- The probability of none of the 'n' machines failing simultaneously is exponentially decreasing with increasing 'n'
- Systems such as Google actively anticipate potential hardware failures and have operational processes to deal with such cases
Stream Processing
- Data streams can be potentially unlimited in size
- Results continuously need to be produced
Short Break
- A scheduled break, providing a rest between classes and activities
Big Data System Stack
- A stack of systems used to manage large data volumes
Google's Big Data Stack
- A wide range of services and technologies
- Hardware, Network, Operating System form the foundation
- Indexing, MapReduce, Borg, GFS, Gmail, BigTable, Pregel, Chubby are some key components
Hadoop Stack
- Based on the Google architecture
- PigLatin, Hive, Giraph, MapReduce, YARN, HDFS, HBase, ZooKeeper are some constituent pieces
- A widely used stack, based on many Google services
HDFS
- Hadoop Distributed File System (HDFS) - A clone of Google's File System
- Huge files, mainly appends, high concurrency, and huge bandwidth handling
- Large data blocks (64 or 128 MB)
- Primary-secondary architecture (NameNode, DataNode) for metadata and block mapping
YARN
- Yet Another Resource Negotiator based on Hadoop 2
- Manages the cluster's resources and scheduling applications according to their specifications
- Independent of the application, and has flexibility in the type of scheduler (FIFO, Capacity)
HBase
- BigTable clone, extensible row store, key-value store that runs on top of HDFS
- It has replication, primary-secondary architecture and primary/region nodes for metadata and data storage.
- Semi-Structured Data type that manages URLs, User data and geographic data
Hadoop MapReduce
- Clone of Google MapReduce and runs on HDFS
- Programming model that processes large-scale distributed data using the map and reduce functions
- Simple parallelization model often using a shared-nothing architecture on "commodity hardware".
- JobTracker and TaskTracker handle task assignment and execution
Hive
- Data warehouse built on top of Hadoop for data warehousing tasks
- Designed for executing large, data warehousing queries, typically batch-style DWH queries; these are mapped to MapReduce jobs.
- SQL-like syntax, indexes, and Derby database support for metadata
ML Systems
- Software system to run machine learning applications
- Includes libraries, parameter servers, graph-based, linear algebra systems, and Deep Learning systems
Big Data Stack
- Visualization, application, big data systems, and infrastructure.
- A hierarchical stack to illustrate big data systems
Big Data Systems
- Storage, analytical processing, operational processing, and stream processing
System Evolution
- Competing trends in system design include specialization and generalization
- Initial systems are often specialized to a specific functionality or scale, and later generalize into more versatile, applicable frameworks.
- Systems with broader use incorporate and adapt existing database concepts, optimize for performance and efficiency.
Where are we heading?
- Unified analytical systems, e.g., Porcella, at YouTube, for analytics, reporting, and data analysis
- Includes SQL for analysis, database optimizations, and hardware advancements
Where are we?
- Topics include applications for search engines, distributed processing, storage, stream processing, and machine learning, as well as the big data stack.
Next Part
- Tuesday: Performance management and measurement
- Wednesday: First exercise
Questions?
- Questions can be submitted via Moodle or email to specified address.
- Q&A sessions are available on campus.
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.
Related Documents
Description
Test your knowledge on the components and concepts related to search engine architecture and virtualization in computing. This quiz covers data handling, hardware components, and upcoming seminar details. Perfect for those interested in machine learning and information systems!