Machine Learning Systems Seminar Overview
49 Questions
0 Views

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

Which of the following components is typically part of a search engine's architecture?

  • Text Processing (correct)
  • User Interface Design
  • Data Analysis
  • Indexing (correct)
  • Traditional relational databases are well-suited for handling unstructured data.

    False (B)

    What term is used to describe data that does not follow a traditional schema, such as XML or RDF?

    Unstructured data

    In search engines, the process of _____ is critical for determining the relevance of search results.

    <p>ranking</p> Signup and view all the answers

    Match the following technologies with their primary focus:

    <p>Indexing = Organizing data for quick retrieval Virtualization = Creating virtual instances of resources Monitoring = Tracking the performance of systems Data Processing = Transforming raw data into usable information</p> Signup and view all the answers

    What is a primary function of virtualization in computing?

    <p>To run multiple operating systems on a single hardware platform (B)</p> Signup and view all the answers

    The probability of none of the disks crashing increases with more disks.

    <p>False (B)</p> Signup and view all the answers

    What hardware components are mentioned in the context of infrastructure?

    <p>Network, servers, storage</p> Signup and view all the answers

    The probability of at least one disk failure approaches _____ as the number of disks increases.

    <p>1.0</p> Signup and view all the answers

    Match the following virtualization tools with their types:

    <p>Docker = Containers Xen = Virtual machines Kubernetes = Scheduling Mesos = Scheduling</p> Signup and view all the answers

    What is the first meeting of the Machine Learning Systems seminar scheduled for?

    <p>13:30 - 15:00 (D)</p> Signup and view all the answers

    There are prerequisites for the Machine Learning Systems seminar.

    <p>False (B)</p> Signup and view all the answers

    Who is presenting in the next week's Lecture Series on Research Methods?

    <p>Stefan Neubert</p> Signup and view all the answers

    The focus of the first week in the timeline is on ______.

    <p>Intro / Organizational</p> Signup and view all the answers

    Match the following key terms with their related topics:

    <p>Map Reduce I = 29.10./30.10. Data Centers = 12.11./13.11. Stream Processing I = 10.12./11.12. Machine Learning Systems II = 7.1./8.1.</p> Signup and view all the answers

    What is the main topic covered on October 29th and 30th?

    <p>Map Reduce I (C)</p> Signup and view all the answers

    There is an exercise scheduled on January 28th and 29th.

    <p>True (A)</p> Signup and view all the answers

    What is the final week dedicated to in the timeline?

    <p>Exam</p> Signup and view all the answers

    What does PageRank help determine?

    <p>The importance of web pages based on link structure (D)</p> Signup and view all the answers

    PageRank considers only the quantity of inbound links to determine a page's importance.

    <p>False (B)</p> Signup and view all the answers

    What is the damping factor (d) used for in PageRank calculations?

    <p>To adjust the PageRank value considering the likelihood of randomly jumping to a different page.</p> Signup and view all the answers

    A key-value store is a type of _____ database.

    <p>NoSQL</p> Signup and view all the answers

    Match the following operations with their description in key-value stores:

    <p>put(key, value) = Write/update a value get(key) = Read a value delete(key) = Delete a value crud = Create, Read, Update, Delete operations</p> Signup and view all the answers

    Which of the following is NOT a characteristic of key-value stores?

    <p>Support for aggregations (A)</p> Signup and view all the answers

    PageRank guarantees that a page will always rank first if it has the most inbound links.

    <p>False (B)</p> Signup and view all the answers

    What influences the output of a search term in terms of relevant URLs?

    <p>Inverted index</p> Signup and view all the answers

    Which of the following describes a trend in machine learning systems?

    <p>End-to-end system (A)</p> Signup and view all the answers

    Specialized systems in big data processing will never generalize.

    <p>False (B)</p> Signup and view all the answers

    What are the four types of processing included in big data systems?

    <p>Storage, Analytical Processing, Operational Processing, Stream Processing</p> Signup and view all the answers

    The system that deals with machine learning is commonly referred to as an __________ system.

    <p>ML</p> Signup and view all the answers

    Match the following components of the Big Data Stack to their functions:

    <p>Data Management = Enables efficient data storage and access Analytics = Processes the data for insights File System = Organizes data files on storage Virtualization = Separates applications from hardware</p> Signup and view all the answers

    What type of processing is specifically designed to handle real-time data flow?

    <p>Stream Processing (C)</p> Signup and view all the answers

    Generalization in ML systems typically involves adding DBMS concepts.

    <p>True (A)</p> Signup and view all the answers

    Name one execution strategy used in ML applications.

    <p>Parameter server</p> Signup and view all the answers

    What is a significant consideration in scaling systems for cost-effectiveness?

    <p>Horizontal scaling with an adaptive cluster size (D)</p> Signup and view all the answers

    Search engines began to replace yellow pages style indexes in the late 80s.

    <p>False (B)</p> Signup and view all the answers

    What major search engine gained 90% market share around the year 2000?

    <p>Google</p> Signup and view all the answers

    An _____ is a data structure used to quickly find data items in a search engine.

    <p>index</p> Signup and view all the answers

    Match the following components of a search engine with their descriptions:

    <p>Crawler = Stores relevant documents by crawling the internet Indexer = Builds and suggests a ranking for documents Search = Provides read-only access to the indexed content User Interaction = The process of users searching for information</p> Signup and view all the answers

    In the context of search engines, what does the inverted index do?

    <p>Points words to a list of URLs containing that word (C)</p> Signup and view all the answers

    Hardware failures are considered exceptional in unreliable infrastructures.

    <p>False (B)</p> Signup and view all the answers

    What type of index would typically not allow duplicate keys?

    <p>Binary Tree</p> Signup and view all the answers

    The __________ component of a search engine reformats and organizes data for efficient searching.

    <p>Indexer</p> Signup and view all the answers

    Which of the following is NOT typically used as an example of an index data structure?

    <p>Web page (B)</p> Signup and view all the answers

    User interaction is irrelevant to the ranking of pages in a search engine.

    <p>False (B)</p> Signup and view all the answers

    What is the primary function of a search engine crawler?

    <p>To crawl the internet and store relevant documents</p> Signup and view all the answers

    Match the following data structures with their descriptions:

    <p>Hash Table = Associative array for quick key-value lookups B-Tree = A balanced tree data structure for storage Inverted Index = Points words to lists of documents Binary Tree = A tree structure where each node has at most two children</p> Signup and view all the answers

    The process of adjusting systems for scalability can involve _____ scaling.

    <p>horizontal</p> Signup and view all the answers

    What is one of the key features of data reliability in search engines?

    <p>Dealing with failures</p> Signup and view all the answers

    Study Notes

    Big Data Systems Use Case - Search Engines

    • Search engines emerged in the early 1990s
    • Initially, they worked like yellow pages, indexing web pages by content
    • The proliferation of web pages necessitated better indexing methods
    • Google gained prominence in the early 2000s, achieving a significant market share, nearly 90%
    • Today's search engines handle hundreds of millions of products, and billions of page views and queries per day

    Announcements

    • The first meeting of the Machine Learning Systems seminar was scheduled for 13:30 - 15:00 in room F-1.11.
    • No prerequisites are required for the seminar.
    • A presentation on Research Methods (Science: Institutions, Processes and Misconceptions) by Stefan Neubert is planned for the following week.
    • Wifi is available for non-HPI attendees connected to hpi_event / poud-WOMP-pseb in October.

    Timeline I

    • 15.10./16.10: Introduction/Organizational & Performance Management

    • 22.10./23.10: Performance Management

    • 29.10./30.10: Map Reduce I

    • 5.11./6.11: Map Reduce III

    • 12.11./13.11: Data Centers

    • 19.11/20.11: File Systems

    • 26.11./27.11: Key Value Stores I

    • 3.12/4.12: Key Value Stores III

    • 10.12./11.12: Stream Processing I

    • 17.12./18.12: ML Systems I

    • Week of 10-16th: Examination week with Christmas break

    Timeline II

    • 7.1./8.1.: ML Systems II
    • 14.1. / 15.1.: Modern Hardware II
    • 21.1./22.1.: TBD
    • 28.1./29.1.: TBD
    • 4.2./5.2: Exam Prep

    This Lecture

    • Big Data Applications: Focus on applications built on top of big data
    • Full Stack User Story - Search Engine: Architectural, indexing, serving, infrastructure, and monitoring of search engines
    • Big Data Stack: An overview of the open-source stack used by search engines

    Where Traditional Databases Are Unsuitable

    • Analysis over raw, unstructured data: Relational databases are not suited to text processing, XML, RDF, graph, and stream processing due to their structured format
    • Cost-effective scalability: The need for cost-effective scalable solutions that can easily expand by adding more computers without major rebuilding effort
    • In unreliable hardware: The system architecture must be able to automatically handle failures without impacting the operation.

    Search Engines

    • Began in the early 1990s, replacing yellow page style indexes
    • Solved the problem of the burgeoning number of web pages
    • Google became popular in the early 2000s, and currently maintains a large market share

    Basic Web Search Interaction

    • The user interacts with the search query and the index
    • This produces relevant documents to the user

    More Detailed Interaction

    • There is a flow between the user interaction, ranking, evaluation processes, and log

    Basic Search Engine Architecture

    • High-level architecture of a search engine

    Search Engine Components

    • The Crawler: Crawls the internet to collect relevant documents
    • The Indexer: Inverts the documents and computes a ranking
    • The Search Engine: Executes searches against the inverted index.

    Building an Index

    • Data Structure to find data quickly through keys
    • Typical examples like binary tree, hash table, B-Tree

    Indexes

    • Data structure for quick data retrieval
    • Keys uniquely map to data
    • Common examples include binary trees, hash tables, and B-trees.

    Inverted Index

    • Represents a text document collection as a relation
    • Each word is a boolean attribute
    • An attribute is true if a document contains the word anywhere

    Inverted Indexes

    • Shows relationships between types, positions, and pointers in a database
    • Documents can be found based on specific words within titles or in anchor texts.

    Inverted Index

    • Pointers in buckets, to a document, and to a position
    • Metadata storage like type, title, text, tables, and formatting
    • Queries like AND, OR, NOT, and operations on pointer sets

    Building an Inverted Index

    • The input is a collection of documents
    • Tokenization converts documents to words
    • Inversion creates pointers from words to documents

    Building an Inverted Index con'd

    • Easy task: tokenizing documents and sorting tokens
    • Web-scale demands parallelization and distribution (e.g., using a MapReduce framework) to manage the sheer volume of data.

    MapReduce

    • Programming model, inspired by map and reduce functions, and used for large-scale, distributed data processing
    • Framework: Simple parallelization model utilizing "commodity hardware"
    • Google created this model

    Smarter Result Ranking

    • Returning all URLs alphabetically for a single web search is inefficient
    • Page rank is a way to improve results, ordering by importance and probability of a user arriving at a page based on the number of inbound links, and outbound links multiplied by a damping factor

    Page Rank

    • Orders web pages according to their importance
    • Importance is determined by the number of pages that link to it
    • The PageRank of a page is estimated using an algorithm, combining the rank of linking web pages, and the number and importance of those outbound links

    Serving Requests

    • Users interact with the inverted index to find relevant URLs based on search terms
    • High query volume and frequent updates are typical aspects of search engines.

    More on Interaction

    • Internet applications have enormous data amounts, billions of users, and frequent updates
    • Data volumes and user queries are increasing rapidly

    Enter Key-Value Stores

    • Scalable container for pairs of keys and values
    • Non-relational key-value stores offer simpler semantics and less complexity, in exchange for increased speed, scalability, availability, and flexibility
    • Small-scale: Hash tables with operations like puts, gets, and deletes
    • No aggregation, joins, or transactions

    Infrastructure & Monitoring

    • Hardware, network, servers, storage
    • Virtualization using containers (Docker, Kata), and VMs (Xen, VMware)
    • Scheduling frameworks (Yarn, Kubernetes, Mesos)

    At "Web-Scale", Failures Are the Norm

    • Web-scale systems must be designed to handle frequent hardware failures
    • Failure is expected rather than an exceptional occurrence
    • The probability of none of the 'n' machines failing simultaneously is exponentially decreasing with increasing 'n'
    • Systems such as Google actively anticipate potential hardware failures and have operational processes to deal with such cases

    Stream Processing

    • Data streams can be potentially unlimited in size
    • Results continuously need to be produced

    Short Break

    • A scheduled break, providing a rest between classes and activities

    Big Data System Stack

    • A stack of systems used to manage large data volumes

    Google's Big Data Stack

    • A wide range of services and technologies
    • Hardware, Network, Operating System form the foundation
    • Indexing, MapReduce, Borg, GFS, Gmail, BigTable, Pregel, Chubby are some key components

    Hadoop Stack

    • Based on the Google architecture
    • PigLatin, Hive, Giraph, MapReduce, YARN, HDFS, HBase, ZooKeeper are some constituent pieces
    • A widely used stack, based on many Google services

    HDFS

    • Hadoop Distributed File System (HDFS) - A clone of Google's File System
    • Huge files, mainly appends, high concurrency, and huge bandwidth handling
    • Large data blocks (64 or 128 MB)
    • Primary-secondary architecture (NameNode, DataNode) for metadata and block mapping

    YARN

    • Yet Another Resource Negotiator based on Hadoop 2
    • Manages the cluster's resources and scheduling applications according to their specifications
    • Independent of the application, and has flexibility in the type of scheduler (FIFO, Capacity)

    HBase

    • BigTable clone, extensible row store, key-value store that runs on top of HDFS
    • It has replication, primary-secondary architecture and primary/region nodes for metadata and data storage.
    • Semi-Structured Data type that manages URLs, User data and geographic data

    Hadoop MapReduce

    • Clone of Google MapReduce and runs on HDFS
    • Programming model that processes large-scale distributed data using the map and reduce functions
    • Simple parallelization model often using a shared-nothing architecture on "commodity hardware".
    • JobTracker and TaskTracker handle task assignment and execution

    Hive

    • Data warehouse built on top of Hadoop for data warehousing tasks
    • Designed for executing large, data warehousing queries, typically batch-style DWH queries; these are mapped to MapReduce jobs.
    • SQL-like syntax, indexes, and Derby database support for metadata

    ML Systems

    • Software system to run machine learning applications
    • Includes libraries, parameter servers, graph-based, linear algebra systems, and Deep Learning systems

    Big Data Stack

    • Visualization, application, big data systems, and infrastructure.
    • A hierarchical stack to illustrate big data systems

    Big Data Systems

    • Storage, analytical processing, operational processing, and stream processing

    System Evolution

    • Competing trends in system design include specialization and generalization
    • Initial systems are often specialized to a specific functionality or scale, and later generalize into more versatile, applicable frameworks.
    • Systems with broader use incorporate and adapt existing database concepts, optimize for performance and efficiency.

    Where are we heading?

    • Unified analytical systems, e.g., Porcella, at YouTube, for analytics, reporting, and data analysis
    • Includes SQL for analysis, database optimizations, and hardware advancements

    Where are we?

    • Topics include applications for search engines, distributed processing, storage, stream processing, and machine learning, as well as the big data stack.

    Next Part

    • Tuesday: Performance management and measurement
    • Wednesday: First exercise

    Questions?

    • Questions can be submitted via Moodle or email to specified address.
    • Q&A sessions are available on campus.

    Studying That Suits You

    Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

    Quiz Team

    Related Documents

    Description

    Test your knowledge on the components and concepts related to search engine architecture and virtualization in computing. This quiz covers data handling, hardware components, and upcoming seminar details. Perfect for those interested in machine learning and information systems!

    More Like This

    Search Engine Optimization Quiz
    10 questions
    Search Engine Basics
    55 questions

    Search Engine Basics

    AbundantConnemara2736 avatar
    AbundantConnemara2736
    Search Engine Indexing and Web Crawling Quiz
    21 questions
    Search Engine Marketing-6
    19 questions
    Use Quizgecko on...
    Browser
    Browser