AI/ML Systems and Data Engineering Fundamentals

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson
Download our mobile app to listen on the go
Get App

Questions and Answers

In the context of deploying machine learning models, what is a key difference between the 'Expectation' and 'Reality' phases?

  • The 'Expectation' phase involves complex data handling and feedback loops, while 'Reality' is a straightforward process.
  • The 'Expectation' phase focuses on continuous data collection, while 'Reality' relies on a one-time dataset.
  • 'Expectation' focuses on selecting the wrong metric, while 'Reality' focuses on selecting the right metric to optimize
  • The 'Expectation' phase assumes minimal issues and seamless model performance, whereas 'Reality' often involves iterative adjustments and data quality issues. (correct)

What is the MOST important consideration when choosing a metric to optimize for a machine learning project?

  • The metric should align with project objectives and business needs. (correct)
  • The metric should minimize computational resources.
  • The metric should be popular and widely used in the industry.
  • The metric should be easy to calculate.

How does continuous monitoring contribute to the scaling of machine learning models in production?

  • It helps reveal new bottlenecks and detect drifts in model accuracy, prompting necessary updates or retraining. (correct)
  • It ensures that the model always performs optimally without the need of drift detection.
  • It eliminates biases in new data.
  • It reduces the need for retraining the model.

Which of the following BEST describes the role of 'project framing' in the context of deploying machine learning models?

<p>It sets the foundation by addressing challenges like optimizing performance, handling data, and ensuring operational stability. (C)</p>
Signup and view all the answers

Why is it important for objectives in a machine learning project to align with business needs and stakeholder expectations?

<p>To ensure the project delivers tangible value and addresses relevant business problems. (B)</p>
Signup and view all the answers

Which of the following is an example of an 'Operational Constraint' in ML deployment projects?

<p>Time limitations for deployment and user expectations. (A)</p>
Signup and view all the answers

In the typical phases of an ML project, what is the PRIMARY focus of Phase 3: Evaluation & Testing?

<p>Validating the model on test data and ensuring performance benchmarks are met. (A)</p>
Signup and view all the answers

What is a key difference between Multiclass and Multilabel classification tasks?

<p>Multiclass classification allows an instance to belong to only one class, whereas Multilabel allows an instance to belong to multiple classes. (B)</p>
Signup and view all the answers

When handling multilabel tasks, what is one approach to transform the problem into a set of simpler problems?

<p>Decomposing it into a set of multiple binary classification problems. (C)</p>
Signup and view all the answers

Why is creating ground truth labels considered more challenging in multilabel tasks compared to multiclass tasks?

<p>Because it requires determining decision boundaries for multiple labels simultaneously. (C)</p>
Signup and view all the answers

In addition to ML objectives such as performance and latency, what other objectives must be considered for ML projects?

<p>Business objectives, like cost, ROI, and regulatory compliance. (A)</p>
Signup and view all the answers

What is the MOST direct way an ML project can increase a company's profits?

<p>Cutting costs or increasing sales (ads, conversion rates). (D)</p>
Signup and view all the answers

In mapping ML to business objectives, why is defining the baselines important?

<p>To establish a reference point against existing solutions and decide if the ML project is good enough. (B)</p>
Signup and view all the answers

Why is it crucial to consider false negatives vs. false positives in an ML project?

<p>Because the cost and impact of each type of error can vary significantly depending on the application. (D)</p>
Signup and view all the answers

What is the role of 'Serialization' in data engineering?

<p>It is performed when storing data for later access. (D)</p>
Signup and view all the answers

What is the MAIN purpose of agreeing upon standards for data formats?

<p>To serialize data so that it can be reliably transmitted and reconstructed. (D)</p>
Signup and view all the answers

Which question relates MOST to access patterns when considering data formats?

<p>How frequently will the data be accessed? (C)</p>
Signup and view all the answers

If your use case involves more frequent access to the rows, which of the following formats are MOST suitable?

<p>Row-major formats. (C)</p>
Signup and view all the answers

Which statement BEST describes the difference between Text and Binary data formats?

<p>Text formats represent data as character strings, making them human-readable, while binary formats use a more compact, non-human-readable representation. (D)</p>
Signup and view all the answers

What is the primary goal of 'normalization' in a relational database model?

<p>Reducing data redundancy and improving data integrity. (A)</p>
Signup and view all the answers

In a relational database, which of the following is true of rows and columns?

<p>Neither rows nor columns are ordered. (D)</p>
Signup and view all the answers

How does the SQL model differ from the relational model?

<p>The SQL model is a query language, while the relational model is about data structure. (C)</p>
Signup and view all the answers

What is the PRIMARY role of a query optimizer in SQL?

<p>To decide the most efficient way to execute a SQL query. (C)</p>
Signup and view all the answers

What is a key reason driving the adoption of NoSQL databases?

<p>They address challenges related to schema management. (A)</p>
Signup and view all the answers

In NoSQL databases, how does the 'document model' represent data?

<p>As individual documents, relationships are less common. (B)</p>
Signup and view all the answers

In a Graph data model what is given the HIGHEST priority?

<p>Complex relationships between data points. (D)</p>
Signup and view all the answers

What is a PRIMARY contrast between structured and unstructured data?

<p>Structured data mandates a predefined schema, while unstructured doesn't. (C)</p>
Signup and view all the answers

If "structure is assumed at read", is the data structured or unstructured?

<p>Unstructured data (C)</p>
Signup and view all the answers

What is the main operational characteristic of OnLine Transaction Processing (OLTP) systems?

<p>Low latency. (C)</p>
Signup and view all the answers

In the context of OnLine Transaction Processing (OLTP), 'Atomicity' in ACID is MOSTLY important because it ensures:

<p>All steps in a transaction succeed or fail as a group. (A)</p>
Signup and view all the answers

What is the PRIMARY focus of OnLine Analytical Processing (OLAP) systems?

<p>Aggregating information from a large amount of data. (D)</p>
Signup and view all the answers

What is the typical data storage format approach of OnLine Analytical Processing (OLAP)?

<p>Column-major (A)</p>
Signup and view all the answers

In modern data engineering, how does the 'decoupling' of storage and processing benefit data systems?

<p>By enabling processing engines to be optimized for different query types, accessing a central data storage. (C)</p>
Signup and view all the answers

What does 'Transform' mean in the context of ETL?

<p>It involves cleaning, validating, transposing, and deriving values. (A)</p>
Signup and view all the answers

What is the key difference between ETL and ELT?

<p>In ETL, data is transformed before loading, while in ELT, data is loaded before transformation. (B)</p>
Signup and view all the answers

Which of the following are benefits to data relational models?

<p>Data normalization leads to standardized data, meaning fewer mistakes, faster database updates, along with easier localization (C)</p>
Signup and view all the answers

Third-party data is:

<p>Useful for learning features. (B)</p>
Signup and view all the answers

Users' behavioral data (clicks, time spent, etc.) is often system-generated but is considered:

<p>User data (A)</p>
Signup and view all the answers

Beyond simply applying algorithms, what is a crucial aspect of the 'Train Model' step in the 'Expectation' phase of ML deployment?

<p>Verifying that the collected dataset is representative and of sufficient quality. (C)</p>
Signup and view all the answers

What is a significant challenge that arises when scaling machine learning models in production environments?

<p>The potential exposure of new bottlenecks in the model's performance. (C)</p>
Signup and view all the answers

When defining the 'Framing' of a machine learning project, what central question should be addressed to ensure the project's relevance?

<p>Why is this project essential, and what issues does it aim to resolve? (C)</p>
Signup and view all the answers

How should objectives in a machine learning project be established to maximize their impact?

<p>They should be clear, measurable, and aligned with business needs and stakeholder expectations. (A)</p>
Signup and view all the answers

What characterizes 'Data Constraints' in the context of deploying machine learning models?

<p>Challenges related to labeled data, data privacy, and dataset diversity. (B)</p>
Signup and view all the answers

In the typical phases of an ML project, what is the purpose of Phase 1: Data Collection & Preparation?

<p>Gathering relevant data, preprocessing, and handling missing values. (B)</p>
Signup and view all the answers

What is a key characteristic of a multilabel classification task?

<p>Each sample can belong to multiple classes simultaneously. (B)</p>
Signup and view all the answers

When adapting a multilabel task for simpler processing, how can the problem be transformed?

<p>Transform the problem into a set of binary-labeled problems. (D)</p>
Signup and view all the answers

What makes creating ground truth labels more challenging in multilabel tasks compared to multiclass tasks?

<p>The need to determine all relevant labels that apply to each sample. (B)</p>
Signup and view all the answers

What is a key consideration when establishing objectives for ML projects, beyond performance and latency?

<p>Balancing ML factors with cost, ROI, and regulatory compliance. (C)</p>
Signup and view all the answers

Besides increased efficiency and automation, how else can an ML project increase a company's profits?

<p>By increasing customer satisfaction or decreasing operational costs. (D)</p>
Signup and view all the answers

Why is defining baselines important when mapping ML to business objectives?

<p>To establish a reference point for measuring project impact. (B)</p>
Signup and view all the answers

Why is it crucial to analyze/weigh false negatives vs. false positives in an ML project?

<p>To better tailor the model to business objectives. (A)</p>
Signup and view all the answers

In data engineering, what is the role of 'serialization'?

<p>Converting data structures into a format for storage or transmission. (C)</p>
Signup and view all the answers

What is the main advantage of agreeing upon standards for data formats?

<p>To ensure data can be transmitted and reconstructed later. (D)</p>
Signup and view all the answers

What aspect of data formats is MOST directly related to 'access patterns'?

<p>How frequently the data will be accessed. (D)</p>
Signup and view all the answers

If your use case requires more frequent access to the columns, which of the following formats are MOST suitable?

<p>Parquet (C)</p>
Signup and view all the answers

What is a key difference between Text and Binary data formats regarding storage space?

<p>Binary Formats are always more compact. (A)</p>
Signup and view all the answers

Why is normalization used in relational database models?

<p>To reduce duplication and improve data integrity (C)</p>
Signup and view all the answers

In a relational database, how are rows and columns characterized?

<p>Both rows and columns are unordered. (B)</p>
Signup and view all the answers

How does the SQL model differ from the relational model in handling duplicate data?

<p>SQL can contain row duplicates, but true relations cannot. (C)</p>
Signup and view all the answers

What is the PRIMARY function of a query optimizer in SQL?

<p>Determining the most efficient execution strategy for a SQL query. (C)</p>
Signup and view all the answers

What primarily drives the movement towards the adoption of NoSQL databases?

<p>The inefficiency of SQL in managing schema flexibility. (D)</p>
Signup and view all the answers

In NoSQL databases, what constitutes the central mode of data representation in the 'document model'?

<p>Individual documents. (D)</p>
Signup and view all the answers

What is the HIGHEST priority in a Graph data model?

<p>Defining relationships between data points. (C)</p>
Signup and view all the answers

What is a PRIMARY characteristic of structured data in contrast to unstructured data?

<p>Its schema is clearly defined. (B)</p>
Signup and view all the answers

When dealing with data where "structure is assumed at read," what type of data is it?

<p>Unstructured data. (A)</p>
Signup and view all the answers

What characterizes the typical operational pattern of OnLine Transaction Processing (OLTP) systems?

<p>Basic operations generated during a procedure, such as tweets or ride orders. (C)</p>
Signup and view all the answers

Within the scope of OnLine Transaction Processing (OLTP), why is 'Atomicity' in ACID essential?

<p>It ensures transactions succeed or fail as a group. (A)</p>
Signup and view all the answers

What is the PRIMARY objective of OnLine Analytical Processing (OLAP) systems?

<p>To obtain aggregated information from a large amount of processed data. (A)</p>
Signup and view all the answers

What is a standard storage format approach used by OnLine Analytical Processing (OLAP) systems?

<p>Column-oriented. (D)</p>
Signup and view all the answers

How does 'decoupling' storage and processing impact modern data systems?

<p>It enables more flexibility to optimize for different query types. (A)</p>
Signup and view all the answers

Flashcards

Collect Data

Gather a relevant dataset for training your model.

Train Model

Apply machine learning algorithms to train the model using the collected dataset.

Deploy Model

Move the trained model to production to perform optimally in a real-world environment.

Choosing a Metric to Optimize

A performance metric's appropriateness for aligning with project objectives.

Signup and view all the flashcards

Data Collection in Production

The on-going process of gathering, processing, and integrating new data to maintain model accuracy in production.

Signup and view all the flashcards

Training and Retraining the Model

The iterative refinement of a model by relabeling, cleaning data, and resolving data quality issues to improve performance.

Signup and view all the flashcards

Handling Edge Cases and Biases

A situation where user feedback and new data reveal model biases or poor performance on edge cases, often requiring a rollback or retraining.

Signup and view all the flashcards

Scaling and Monitoring

Monitoring a deployed model's performance to detect drifts in accuracy and addressing bottlenecks when scaling to handle larger workloads.

Signup and view all the flashcards

Dealing with Setbacks

Unexpected events require revisiting earlier steps, adjusting metrics, or starting from scratch.

Signup and view all the flashcards

Iterate and Improve

An ongoing cycle of data collection, retraining, deploying, and monitoring aimed at refining the model in dynamic, real-world conditions.

Signup and view all the flashcards

Framing the Project

Effective project framing sets the foundation by addressing challenges, optimizing performance, and ensuring operational stability.

Signup and view all the flashcards

Data Constraints

Limited labeled data, data privacy concerns, and obtaining diverse, high-quality datasets that can affect ML model performance and reliability

Signup and view all the flashcards

Resource Constraints

Computational power, storage, and budgetary limits common in ML deployment projects.

Signup and view all the flashcards

Operational Constraints

Time limitations for deployment, user expectations, and the need for seamless integration in machine learning projects.

Signup and view all the flashcards

Regression

A type of supervised learning where the output is a continuous value.

Signup and view all the flashcards

Classification

Task of assigning an input to one of several categories.

Signup and view all the flashcards

Binary Classification

Classification task with two mutually exclusive classes.

Signup and view all the flashcards

Multiclass Classification

Classification task with more than two mutually exclusive classes.

Signup and view all the flashcards

Multilabel Classification

Classification task where each input can be assigned to multiple classes.

Signup and view all the flashcards

User Behavioral Data

A type of data that is often system-generated but classified as user data due to its reflection of user actions and preferences.

Signup and view all the flashcards

Serialization

Converting data into a format for efficient storage or transmission.

Signup and view all the flashcards

Deserialization

Restoring serialized data back to its original format so it can be used.

Signup and view all the flashcards

JSON (JavaScript Object Notation)

Text-based data format that is human-readable and widely used for data interchange.

Signup and view all the flashcards

CSV (Comma-Separated Values)

Text-based, human-readable data format that stores data in rows and columns.

Signup and view all the flashcards

Parquet

Binary data storage format that is efficient for reading and writing large amounts of data.

Signup and view all the flashcards

Row-major Format

Data can be stored such that it is optimized to read and retrieve whole rows of data.

Signup and view all the flashcards

Column-major Format

Data is stored such that it is best to quickly read and retrieve different columns of data.

Signup and view all the flashcards

Relational Model

Data is organized into tables with structured relations that are key for database management.

Signup and view all the flashcards

SQL

A query language used for managing and manipulating data stored in relational databases.

Signup and view all the flashcards

NoSQL Databases

A database that provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases.

Signup and view all the flashcards

Document Model

Documents stored as individual records.

Signup and view all the flashcards

Graph Model

A database organized around nodes (entities) and edges (relationships).

Signup and view all the flashcards

Structured Data

A type of data that conforms to a defined schema, making it easy to search and analyze.

Signup and view all the flashcards

Unstructured Data

A type of data that does not conform to a pre-defined schema, offering flexibility but requiring more effort to analyze.

Signup and view all the flashcards

OnLine Transaction Processing (OLTP)

A database systems optimized for handling real-time operational data and transactional processing.

Signup and view all the flashcards

OnLine Analytical Processing (OLAP)

A database optimized for complex queries and aggregated information from large data volumes for analytics.

Signup and view all the flashcards

ETL

Extract, Transform, Load: Move data from databases to other databases.

Signup and view all the flashcards

Study Notes

Lecture 1: AI/ML Systems Fundamentals and Data Engineering 101

Introductions

  • Course attendees introduce themselves by stating their name, background/profession, what they hope to get out of the course, and any fun thing to share.

ML in Production: Expectation vs Reality

  • Ideally, deploying ML models should be straightforward, beginning with collecting data, training a model via ML algorithms, and deploying for optimal performance
  • In reality, it is an iterative process with continuous adjustments and feedback loops.
  • Optimizing ML models begins with choosing performance metrics aligning with project objectives
  • Choosing the wrong metric can lead to misleading results
  • New data must be continuously collected, processed, and integrated to maintain model accuracy
  • Training reveals data quality issues that may require relabeling, cleaning, or starting over.
  • Iterations aim to improve the model, but new issues can arise, such as poor performance on specific classes or overfitting.
  • User feedback and new data may expose biases or poor performance on edge cases after deployment
  • It may necessitate a rollback to previous versions or extensive retraining.
  • Scaling the model to handle larger workloads may reveal new bottlenecks, so continuous monitoring is essential
  • Continuous monitoring is essential to detect drifts in model accuracy, as well as require frequent updates or a complete retraining pipeline
  • Deployment rarely goes as planned, which mean drops in performance, revenue impacts, or operational issues

Project Considerations

  • Effective project framing sets the foundation and addresses the challenges of deploying ML models in real-world environments.
  • Focus on optimizing performance, handling data challenges, and ensuring operational stability.
  • It is essential to ask why a project is essential where efforts are aimed at identifying and mitigating post-deployment issues, improving model robustness and reliability.
  • Objectives should include enhancing model accuracy across datasets, reduced resources for retraining, and building continuous feedback loops
  • Common constraints for ML deployment include data, resources, and operations
  • Limited labeled data, data privacy, storage, computational power, time constraints, and user expectations are typical constraints
  • Structuring the project into phases includes the phases Data Collection & Preparation, Model Development & Training, Evaluation & Testing
  • Evaluation & Testing validates the model and ensures performance benchmarks are met
  • Setting-up monitoring systems and schedules comes last

Framing the Problem

  • Task types include regression, classification, binary, multiclass and multilabel
  • For multilabel problems, a label can belong to multiple classes, unlike multiclass problems
  • Multilabel tasks can be handled using a set of multiple binary problems instead of a multiclass problem
  • It is harder to create ground truth labels and decide decision boundaries for multilabel tasks

Project Objectives

  • ML objectives target performance and latency
  • Business objectives target to reduce cost, improve return on investment (ROI), and ensure regulatory compliance.
  • ML projects should aim to increase profits directly through sales, conversion rates, or indirectly through improved client satisfaction
  • It is important to establish baselines, set usefulness thresholds, and evaluate false negatives vs false positives
  • It is also impotant to measure and consider interpretability and confidence in prediction

Data Engineering 101

  • Involves data sources, formats, models, storage engines, and processing
  • Data sources include those user and system generated
  • Internal databases and third-party data also act as data sources

Data Sources

  • User-generated data includes user inputs.
  • System-generated data includes logs, metadata, and predictions.
  • Third party data has social media data, income data, and demographic groups
  • Important questions include the use cases for the data, storage considerations, and accessibility requirements

Data Storage

  • Data serialization is critical for storing and accessing data
  • Data formats are serialized if they can be reconstructed later

Data Formats

  • Questions to consider are how frequently data is accessed and hardware to be used to run data
  • Human-readable formats are JSON and CSV
  • Binary formats include Parquet, Avro, Protobuf, and Pickle
  • Binary (Parquet) format in Amazon S3 can unload 2x faster from Amazon Redshift and consume up to 6x less storage versus text.
  • Column-major storage retrieves column-by-column for feature access
  • Row-major storage retrieves row-by-row for accessing samples

Row-Major vs Column-Major

  • Pandas DataFrames use a column-major format to access a column and NumPy
  • NumPy ndarrays use a row-major format by default, with the option to specify column-based storage.

Data Models

  • Data models describes how data is represented
  • Two main paradigms exist relational and NoSQL

Relational Model

  • It is similar to SQL model and have formats such as CSV and Parquet
  • Normalization involves structuring the database to reduce redundancy and dependency

Relational Model(Normalization)

  • Pros includes minimize mistakes, Easier to update, Easier localization
  • Cons includes slow to join across multiple large tables
  • Differences exist between SQL Model and Relational Model
  • SQL is a declarative language that specifies what data is wanted and relies on optimization to figure out how to execute the query

SQL

  • Understanding and using SQL is essential for data scientists
  • Additional challenges exist if there is a need to add a column or change a column type.

NoSQL

  • NoSQL (Not Only SQL) includes document and graph models.
  • There was negative responses to relational database technology when NoSQL launched
  • The main pain point when adopting NoSQL adoption is schema management

NoSQL (Document and Graph Models)

  • Document model has documents as a central concept.
  • In Document models, relationships between documents are usually rare
  • Graph models emphasize the relationships between nodes and edges

Structured vs. Unstructured Data

  • Structured data has defined schemas and is easy to search / analyze
  • Unstructured data is flexible to changes and handles any source.
  • Structural data is assumed to write when storing data
  • Unstructural data is assumed to read when storing data

Choosing the Right Data Storage Engine

  • Online transaction processing(OLTP) such as requesting a ride when using Lyft
  • Online analytical processing(OLAP) is performed such as getting aggregations from data when using services like Lyft
  • Newer paradigms support decoupling storage from processing
  • Data can be stored in the same place as the processing layer on top which can be optimized for different query types

Differences between OLTP and OLAP

OLTP:

  • Low latency
  • High availability
  • Atomicity, Consistency, Isolation, Durability (ACID) not necessary
  • Typically row-major OLAP:
  • Complex queries on large volumes of data
  • Okay response time (seconds, minutes, even hours)
  • Column-major

ETL (Extract, Transform, Load)

  • Extract, Transform and Load. ETL is everywhere.
  • Extract, Transform, Load and is commonly done in batches.
  • Transform: the meaty part involves cleaning, validating, transposing, deriving values, joining from multiple sources, deduplicating, splitting, aggregating, etc
  • It typically works by extracting standardized data to unstructured data, then changing it back through tools and infrastructure

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

More Like This

Machine Learning Lecture Summaries
48 questions
Software Engineering in ML Models
10 questions
Use Quizgecko on...
Browser
Browser