Machine Learning Systems Seminar Overview

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

Which of the following components is typically part of a search engine's architecture?

Text Processing (correct)

User Interface Design

Data Analysis

Indexing (correct)

Traditional relational databases are well-suited for handling unstructured data.

False (B)

What term is used to describe data that does not follow a traditional schema, such as XML or RDF?

Unstructured data

In search engines, the process of _____ is critical for determining the relevance of search results.

ranking Signup and view all the answers

Match the following technologies with their primary focus:

Indexing = Organizing data for quick retrieval Virtualization = Creating virtual instances of resources Monitoring = Tracking the performance of systems Data Processing = Transforming raw data into usable information Signup and view all the answers

What is a primary function of virtualization in computing?

To run multiple operating systems on a single hardware platform (B) Signup and view all the answers

The probability of none of the disks crashing increases with more disks.

False (B) Signup and view all the answers

What hardware components are mentioned in the context of infrastructure?

Network, servers, storage Signup and view all the answers

The probability of at least one disk failure approaches _____ as the number of disks increases.

1.0 Signup and view all the answers

Match the following virtualization tools with their types:

Docker = Containers Xen = Virtual machines Kubernetes = Scheduling Mesos = Scheduling Signup and view all the answers

What is the first meeting of the Machine Learning Systems seminar scheduled for?

13:30 - 15:00 (D) Signup and view all the answers

There are prerequisites for the Machine Learning Systems seminar.

False (B) Signup and view all the answers

Who is presenting in the next week's Lecture Series on Research Methods?

Stefan Neubert Signup and view all the answers

The focus of the first week in the timeline is on ______.

Intro / Organizational Signup and view all the answers

Match the following key terms with their related topics:

Map Reduce I = 29.10./30.10. Data Centers = 12.11./13.11. Stream Processing I = 10.12./11.12. Machine Learning Systems II = 7.1./8.1. Signup and view all the answers

What is the main topic covered on October 29th and 30th?

Map Reduce I (C) Signup and view all the answers

There is an exercise scheduled on January 28th and 29th.

True (A) Signup and view all the answers

What is the final week dedicated to in the timeline?

Exam Signup and view all the answers

What does PageRank help determine?

The importance of web pages based on link structure (D) Signup and view all the answers

PageRank considers only the quantity of inbound links to determine a page's importance.

False (B) Signup and view all the answers

What is the damping factor (d) used for in PageRank calculations?

To adjust the PageRank value considering the likelihood of randomly jumping to a different page. Signup and view all the answers

A key-value store is a type of _____ database.

NoSQL Signup and view all the answers

Match the following operations with their description in key-value stores:

put(key, value) = Write/update a value get(key) = Read a value delete(key) = Delete a value crud = Create, Read, Update, Delete operations Signup and view all the answers

Which of the following is NOT a characteristic of key-value stores?

Support for aggregations (A) Signup and view all the answers

PageRank guarantees that a page will always rank first if it has the most inbound links.

False (B) Signup and view all the answers

What influences the output of a search term in terms of relevant URLs?

Inverted index Signup and view all the answers

Which of the following describes a trend in machine learning systems?

End-to-end system (A) Signup and view all the answers

Specialized systems in big data processing will never generalize.

False (B) Signup and view all the answers

What are the four types of processing included in big data systems?

Storage, Analytical Processing, Operational Processing, Stream Processing Signup and view all the answers

The system that deals with machine learning is commonly referred to as an __________ system.

ML Signup and view all the answers

Match the following components of the Big Data Stack to their functions:

Data Management = Enables efficient data storage and access Analytics = Processes the data for insights File System = Organizes data files on storage Virtualization = Separates applications from hardware Signup and view all the answers

What type of processing is specifically designed to handle real-time data flow?

Stream Processing (C) Signup and view all the answers

Generalization in ML systems typically involves adding DBMS concepts.

True (A) Signup and view all the answers

Name one execution strategy used in ML applications.

Parameter server Signup and view all the answers

What is a significant consideration in scaling systems for cost-effectiveness?

Horizontal scaling with an adaptive cluster size (D) Signup and view all the answers

Search engines began to replace yellow pages style indexes in the late 80s.

False (B) Signup and view all the answers

What major search engine gained 90% market share around the year 2000?

Google Signup and view all the answers

An _____ is a data structure used to quickly find data items in a search engine.

index Signup and view all the answers

Match the following components of a search engine with their descriptions:

Crawler = Stores relevant documents by crawling the internet Indexer = Builds and suggests a ranking for documents Search = Provides read-only access to the indexed content User Interaction = The process of users searching for information Signup and view all the answers

In the context of search engines, what does the inverted index do?

Points words to a list of URLs containing that word (C) Signup and view all the answers

Hardware failures are considered exceptional in unreliable infrastructures.

False (B) Signup and view all the answers

What type of index would typically not allow duplicate keys?

Binary Tree Signup and view all the answers

The __________ component of a search engine reformats and organizes data for efficient searching.

Indexer Signup and view all the answers

Which of the following is NOT typically used as an example of an index data structure?

Web page (B) Signup and view all the answers

User interaction is irrelevant to the ranking of pages in a search engine.

False (B) Signup and view all the answers

What is the primary function of a search engine crawler?

To crawl the internet and store relevant documents Signup and view all the answers

Match the following data structures with their descriptions:

Hash Table = Associative array for quick key-value lookups B-Tree = A balanced tree data structure for storage Inverted Index = Points words to lists of documents Binary Tree = A tree structure where each node has at most two children Signup and view all the answers

The process of adjusting systems for scalability can involve _____ scaling.

horizontal Signup and view all the answers

What is one of the key features of data reliability in search engines?

Dealing with failures Signup and view all the answers

Study Notes

Big Data Systems Use Case - Search Engines

Search engines emerged in the early 1990s
Initially, they worked like yellow pages, indexing web pages by content
The proliferation of web pages necessitated better indexing methods
Google gained prominence in the early 2000s, achieving a significant market share, nearly 90%
Today's search engines handle hundreds of millions of products, and billions of page views and queries per day

Announcements

The first meeting of the Machine Learning Systems seminar was scheduled for 13:30 - 15:00 in room F-1.11.
No prerequisites are required for the seminar.
A presentation on Research Methods (Science: Institutions, Processes and Misconceptions) by Stefan Neubert is planned for the following week.
Wifi is available for non-HPI attendees connected to hpi_event / poud-WOMP-pseb in October.

Timeline I

15.10./16.10: Introduction/Organizational & Performance Management
22.10./23.10: Performance Management
29.10./30.10: Map Reduce I
5.11./6.11: Map Reduce III
12.11./13.11: Data Centers
19.11/20.11: File Systems
26.11./27.11: Key Value Stores I
3.12/4.12: Key Value Stores III
10.12./11.12: Stream Processing I
17.12./18.12: ML Systems I
Week of 10-16th: Examination week with Christmas break

Timeline II

7.1./8.1.: ML Systems II
14.1. / 15.1.: Modern Hardware II
21.1./22.1.: TBD
28.1./29.1.: TBD
4.2./5.2: Exam Prep

This Lecture

Big Data Applications: Focus on applications built on top of big data
Full Stack User Story - Search Engine: Architectural, indexing, serving, infrastructure, and monitoring of search engines
Big Data Stack: An overview of the open-source stack used by search engines

Where Traditional Databases Are Unsuitable

Analysis over raw, unstructured data: Relational databases are not suited to text processing, XML, RDF, graph, and stream processing due to their structured format
Cost-effective scalability: The need for cost-effective scalable solutions that can easily expand by adding more computers without major rebuilding effort
In unreliable hardware: The system architecture must be able to automatically handle failures without impacting the operation.

Search Engines

Began in the early 1990s, replacing yellow page style indexes
Solved the problem of the burgeoning number of web pages
Google became popular in the early 2000s, and currently maintains a large market share

Basic Web Search Interaction

The user interacts with the search query and the index
This produces relevant documents to the user

More Detailed Interaction

There is a flow between the user interaction, ranking, evaluation processes, and log

Basic Search Engine Architecture

High-level architecture of a search engine

Search Engine Components

The Crawler: Crawls the internet to collect relevant documents
The Indexer: Inverts the documents and computes a ranking
The Search Engine: Executes searches against the inverted index.

Building an Index

Data Structure to find data quickly through keys
Typical examples like binary tree, hash table, B-Tree

Indexes

Data structure for quick data retrieval
Keys uniquely map to data
Common examples include binary trees, hash tables, and B-trees.

Inverted Index

Represents a text document collection as a relation
Each word is a boolean attribute
An attribute is true if a document contains the word anywhere

Inverted Indexes

Shows relationships between types, positions, and pointers in a database
Documents can be found based on specific words within titles or in anchor texts.

Inverted Index

Pointers in buckets, to a document, and to a position
Metadata storage like type, title, text, tables, and formatting
Queries like AND, OR, NOT, and operations on pointer sets

Building an Inverted Index

The input is a collection of documents
Tokenization converts documents to words
Inversion creates pointers from words to documents

Building an Inverted Index con'd

Easy task: tokenizing documents and sorting tokens
Web-scale demands parallelization and distribution (e.g., using a MapReduce framework) to manage the sheer volume of data.

MapReduce

Programming model, inspired by map and reduce functions, and used for large-scale, distributed data processing
Framework: Simple parallelization model utilizing "commodity hardware"
Google created this model

Smarter Result Ranking

Returning all URLs alphabetically for a single web search is inefficient
Page rank is a way to improve results, ordering by importance and probability of a user arriving at a page based on the number of inbound links, and outbound links multiplied by a damping factor

Page Rank

Orders web pages according to their importance
Importance is determined by the number of pages that link to it
The PageRank of a page is estimated using an algorithm, combining the rank of linking web pages, and the number and importance of those outbound links

Serving Requests

Users interact with the inverted index to find relevant URLs based on search terms
High query volume and frequent updates are typical aspects of search engines.

Enter Key-Value Stores

Scalable container for pairs of keys and values
Non-relational key-value stores offer simpler semantics and less complexity, in exchange for increased speed, scalability, availability, and flexibility
Small-scale: Hash tables with operations like puts, gets, and deletes
No aggregation, joins, or transactions

Infrastructure & Monitoring

Hardware, network, servers, storage
Virtualization using containers (Docker, Kata), and VMs (Xen, VMware)
Scheduling frameworks (Yarn, Kubernetes, Mesos)

At "Web-Scale", Failures Are the Norm

Web-scale systems must be designed to handle frequent hardware failures
Failure is expected rather than an exceptional occurrence
The probability of none of the 'n' machines failing simultaneously is exponentially decreasing with increasing 'n'
Systems such as Google actively anticipate potential hardware failures and have operational processes to deal with such cases

Stream Processing

Data streams can be potentially unlimited in size
Results continuously need to be produced

Short Break

A scheduled break, providing a rest between classes and activities

Big Data System Stack

A stack of systems used to manage large data volumes

Google's Big Data Stack

A wide range of services and technologies
Hardware, Network, Operating System form the foundation
Indexing, MapReduce, Borg, GFS, Gmail, BigTable, Pregel, Chubby are some key components

Hadoop Stack

Based on the Google architecture
PigLatin, Hive, Giraph, MapReduce, YARN, HDFS, HBase, ZooKeeper are some constituent pieces
A widely used stack, based on many Google services

HDFS

Hadoop Distributed File System (HDFS) - A clone of Google's File System
Huge files, mainly appends, high concurrency, and huge bandwidth handling
Large data blocks (64 or 128 MB)
Primary-secondary architecture (NameNode, DataNode) for metadata and block mapping

YARN

Yet Another Resource Negotiator based on Hadoop 2
Manages the cluster's resources and scheduling applications according to their specifications
Independent of the application, and has flexibility in the type of scheduler (FIFO, Capacity)

HBase

BigTable clone, extensible row store, key-value store that runs on top of HDFS
It has replication, primary-secondary architecture and primary/region nodes for metadata and data storage.
Semi-Structured Data type that manages URLs, User data and geographic data

Hadoop MapReduce

Clone of Google MapReduce and runs on HDFS
Programming model that processes large-scale distributed data using the map and reduce functions
Simple parallelization model often using a shared-nothing architecture on "commodity hardware".
JobTracker and TaskTracker handle task assignment and execution

Hive

Data warehouse built on top of Hadoop for data warehousing tasks
Designed for executing large, data warehousing queries, typically batch-style DWH queries; these are mapped to MapReduce jobs.
SQL-like syntax, indexes, and Derby database support for metadata

ML Systems

Software system to run machine learning applications
Includes libraries, parameter servers, graph-based, linear algebra systems, and Deep Learning systems

Big Data Stack

Visualization, application, big data systems, and infrastructure.
A hierarchical stack to illustrate big data systems

Big Data Systems

Storage, analytical processing, operational processing, and stream processing

System Evolution

Competing trends in system design include specialization and generalization
Initial systems are often specialized to a specific functionality or scale, and later generalize into more versatile, applicable frameworks.
Systems with broader use incorporate and adapt existing database concepts, optimize for performance and efficiency.

Where are we heading?

Unified analytical systems, e.g., Porcella, at YouTube, for analytics, reporting, and data analysis
Includes SQL for analysis, database optimizations, and hardware advancements

Where are we?

Topics include applications for search engines, distributed processing, storage, stream processing, and machine learning, as well as the big data stack.

Next Part

Tuesday: Performance management and measurement
Wednesday: First exercise

Questions?

Questions can be submitted via Moodle or email to specified address.
Q&A sessions are available on campus.

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Description

Test your knowledge on the components and concepts related to search engine architecture and virtualization in computing. This quiz covers data handling, hardware components, and upcoming seminar details. Perfect for those interested in machine learning and information systems!

Machine Learning Systems Seminar Overview

Choose a study mode

Podcast

Questions and Answers

Which of the following components is typically part of a search engine's architecture?

Traditional relational databases are well-suited for handling unstructured data.

What term is used to describe data that does not follow a traditional schema, such as XML or RDF?

In search engines, the process of _____ is critical for determining the relevance of search results.

Match the following technologies with their primary focus:

What is a primary function of virtualization in computing?

The probability of none of the disks crashing increases with more disks.

What hardware components are mentioned in the context of infrastructure?

The probability of at least one disk failure approaches _____ as the number of disks increases.

Match the following virtualization tools with their types:

What is the first meeting of the Machine Learning Systems seminar scheduled for?

There are prerequisites for the Machine Learning Systems seminar.

Who is presenting in the next week's Lecture Series on Research Methods?

The focus of the first week in the timeline is on ______.

Match the following key terms with their related topics:

What is the main topic covered on October 29th and 30th?

There is an exercise scheduled on January 28th and 29th.

What is the final week dedicated to in the timeline?

What does PageRank help determine?

PageRank considers only the quantity of inbound links to determine a page's importance.

What is the damping factor (d) used for in PageRank calculations?

A key-value store is a type of _____ database.

Match the following operations with their description in key-value stores:

Which of the following is NOT a characteristic of key-value stores?

PageRank guarantees that a page will always rank first if it has the most inbound links.

What influences the output of a search term in terms of relevant URLs?

Which of the following describes a trend in machine learning systems?

Specialized systems in big data processing will never generalize.

What are the four types of processing included in big data systems?

The system that deals with machine learning is commonly referred to as an __________ system.

Match the following components of the Big Data Stack to their functions:

What type of processing is specifically designed to handle real-time data flow?

Generalization in ML systems typically involves adding DBMS concepts.

Name one execution strategy used in ML applications.

What is a significant consideration in scaling systems for cost-effectiveness?

Search engines began to replace yellow pages style indexes in the late 80s.

What major search engine gained 90% market share around the year 2000?

An _____ is a data structure used to quickly find data items in a search engine.

Match the following components of a search engine with their descriptions:

In the context of search engines, what does the inverted index do?

Hardware failures are considered exceptional in unreliable infrastructures.

What type of index would typically not allow duplicate keys?

The __________ component of a search engine reformats and organizes data for efficient searching.

Which of the following is NOT typically used as an example of an index data structure?

User interaction is irrelevant to the ranking of pages in a search engine.

What is the primary function of a search engine crawler?

Match the following data structures with their descriptions:

The process of adjusting systems for scalability can involve _____ scaling.

What is one of the key features of data reliability in search engines?

Study Notes

Big Data Systems Use Case - Search Engines

Announcements

Timeline I

Timeline II

This Lecture

Where Traditional Databases Are Unsuitable

Search Engines

Basic Web Search Interaction

More Detailed Interaction

Basic Search Engine Architecture

Search Engine Components

Building an Index

Indexes

Inverted Index

Inverted Indexes

Inverted Index

Building an Inverted Index

Building an Inverted Index con'd

MapReduce

Smarter Result Ranking

Page Rank

Serving Requests

More on Interaction

Enter Key-Value Stores

Infrastructure & Monitoring

At "Web-Scale", Failures Are the Norm