Podcast
Questions and Answers
In the context of deploying machine learning models, what is a key difference between the 'Expectation' and 'Reality' phases?
In the context of deploying machine learning models, what is a key difference between the 'Expectation' and 'Reality' phases?
- The 'Expectation' phase involves complex data handling and feedback loops, while 'Reality' is a straightforward process.
- The 'Expectation' phase focuses on continuous data collection, while 'Reality' relies on a one-time dataset.
- 'Expectation' focuses on selecting the wrong metric, while 'Reality' focuses on selecting the right metric to optimize
- The 'Expectation' phase assumes minimal issues and seamless model performance, whereas 'Reality' often involves iterative adjustments and data quality issues. (correct)
What is the MOST important consideration when choosing a metric to optimize for a machine learning project?
What is the MOST important consideration when choosing a metric to optimize for a machine learning project?
- The metric should align with project objectives and business needs. (correct)
- The metric should minimize computational resources.
- The metric should be popular and widely used in the industry.
- The metric should be easy to calculate.
How does continuous monitoring contribute to the scaling of machine learning models in production?
How does continuous monitoring contribute to the scaling of machine learning models in production?
- It helps reveal new bottlenecks and detect drifts in model accuracy, prompting necessary updates or retraining. (correct)
- It ensures that the model always performs optimally without the need of drift detection.
- It eliminates biases in new data.
- It reduces the need for retraining the model.
Which of the following BEST describes the role of 'project framing' in the context of deploying machine learning models?
Which of the following BEST describes the role of 'project framing' in the context of deploying machine learning models?
Why is it important for objectives in a machine learning project to align with business needs and stakeholder expectations?
Why is it important for objectives in a machine learning project to align with business needs and stakeholder expectations?
Which of the following is an example of an 'Operational Constraint' in ML deployment projects?
Which of the following is an example of an 'Operational Constraint' in ML deployment projects?
In the typical phases of an ML project, what is the PRIMARY focus of Phase 3: Evaluation & Testing?
In the typical phases of an ML project, what is the PRIMARY focus of Phase 3: Evaluation & Testing?
What is a key difference between Multiclass and Multilabel classification tasks?
What is a key difference between Multiclass and Multilabel classification tasks?
When handling multilabel tasks, what is one approach to transform the problem into a set of simpler problems?
When handling multilabel tasks, what is one approach to transform the problem into a set of simpler problems?
Why is creating ground truth labels considered more challenging in multilabel tasks compared to multiclass tasks?
Why is creating ground truth labels considered more challenging in multilabel tasks compared to multiclass tasks?
In addition to ML objectives such as performance and latency, what other objectives must be considered for ML projects?
In addition to ML objectives such as performance and latency, what other objectives must be considered for ML projects?
What is the MOST direct way an ML project can increase a company's profits?
What is the MOST direct way an ML project can increase a company's profits?
In mapping ML to business objectives, why is defining the baselines important?
In mapping ML to business objectives, why is defining the baselines important?
Why is it crucial to consider false negatives vs. false positives in an ML project?
Why is it crucial to consider false negatives vs. false positives in an ML project?
What is the role of 'Serialization' in data engineering?
What is the role of 'Serialization' in data engineering?
What is the MAIN purpose of agreeing upon standards for data formats?
What is the MAIN purpose of agreeing upon standards for data formats?
Which question relates MOST to access patterns when considering data formats?
Which question relates MOST to access patterns when considering data formats?
If your use case involves more frequent access to the rows, which of the following formats are MOST suitable?
If your use case involves more frequent access to the rows, which of the following formats are MOST suitable?
Which statement BEST describes the difference between Text and Binary data formats?
Which statement BEST describes the difference between Text and Binary data formats?
What is the primary goal of 'normalization' in a relational database model?
What is the primary goal of 'normalization' in a relational database model?
In a relational database, which of the following is true of rows and columns?
In a relational database, which of the following is true of rows and columns?
How does the SQL model differ from the relational model?
How does the SQL model differ from the relational model?
What is the PRIMARY role of a query optimizer in SQL?
What is the PRIMARY role of a query optimizer in SQL?
What is a key reason driving the adoption of NoSQL databases?
What is a key reason driving the adoption of NoSQL databases?
In NoSQL databases, how does the 'document model' represent data?
In NoSQL databases, how does the 'document model' represent data?
In a Graph data model what is given the HIGHEST priority?
In a Graph data model what is given the HIGHEST priority?
What is a PRIMARY contrast between structured and unstructured data?
What is a PRIMARY contrast between structured and unstructured data?
If "structure is assumed at read", is the data structured or unstructured?
If "structure is assumed at read", is the data structured or unstructured?
What is the main operational characteristic of OnLine Transaction Processing (OLTP) systems?
What is the main operational characteristic of OnLine Transaction Processing (OLTP) systems?
In the context of OnLine Transaction Processing (OLTP), 'Atomicity' in ACID is MOSTLY important because it ensures:
In the context of OnLine Transaction Processing (OLTP), 'Atomicity' in ACID is MOSTLY important because it ensures:
What is the PRIMARY focus of OnLine Analytical Processing (OLAP) systems?
What is the PRIMARY focus of OnLine Analytical Processing (OLAP) systems?
What is the typical data storage format approach of OnLine Analytical Processing (OLAP)?
What is the typical data storage format approach of OnLine Analytical Processing (OLAP)?
In modern data engineering, how does the 'decoupling' of storage and processing benefit data systems?
In modern data engineering, how does the 'decoupling' of storage and processing benefit data systems?
What does 'Transform' mean in the context of ETL?
What does 'Transform' mean in the context of ETL?
What is the key difference between ETL and ELT?
What is the key difference between ETL and ELT?
Which of the following are benefits to data relational models?
Which of the following are benefits to data relational models?
Third-party data is:
Third-party data is:
Users' behavioral data (clicks, time spent, etc.) is often system-generated but is considered:
Users' behavioral data (clicks, time spent, etc.) is often system-generated but is considered:
Beyond simply applying algorithms, what is a crucial aspect of the 'Train Model' step in the 'Expectation' phase of ML deployment?
Beyond simply applying algorithms, what is a crucial aspect of the 'Train Model' step in the 'Expectation' phase of ML deployment?
What is a significant challenge that arises when scaling machine learning models in production environments?
What is a significant challenge that arises when scaling machine learning models in production environments?
When defining the 'Framing' of a machine learning project, what central question should be addressed to ensure the project's relevance?
When defining the 'Framing' of a machine learning project, what central question should be addressed to ensure the project's relevance?
How should objectives in a machine learning project be established to maximize their impact?
How should objectives in a machine learning project be established to maximize their impact?
What characterizes 'Data Constraints' in the context of deploying machine learning models?
What characterizes 'Data Constraints' in the context of deploying machine learning models?
In the typical phases of an ML project, what is the purpose of Phase 1: Data Collection & Preparation?
In the typical phases of an ML project, what is the purpose of Phase 1: Data Collection & Preparation?
What is a key characteristic of a multilabel classification task?
What is a key characteristic of a multilabel classification task?
When adapting a multilabel task for simpler processing, how can the problem be transformed?
When adapting a multilabel task for simpler processing, how can the problem be transformed?
What makes creating ground truth labels more challenging in multilabel tasks compared to multiclass tasks?
What makes creating ground truth labels more challenging in multilabel tasks compared to multiclass tasks?
What is a key consideration when establishing objectives for ML projects, beyond performance and latency?
What is a key consideration when establishing objectives for ML projects, beyond performance and latency?
Besides increased efficiency and automation, how else can an ML project increase a company's profits?
Besides increased efficiency and automation, how else can an ML project increase a company's profits?
Why is defining baselines important when mapping ML to business objectives?
Why is defining baselines important when mapping ML to business objectives?
Why is it crucial to analyze/weigh false negatives vs. false positives in an ML project?
Why is it crucial to analyze/weigh false negatives vs. false positives in an ML project?
In data engineering, what is the role of 'serialization'?
In data engineering, what is the role of 'serialization'?
What is the main advantage of agreeing upon standards for data formats?
What is the main advantage of agreeing upon standards for data formats?
What aspect of data formats is MOST directly related to 'access patterns'?
What aspect of data formats is MOST directly related to 'access patterns'?
If your use case requires more frequent access to the columns, which of the following formats are MOST suitable?
If your use case requires more frequent access to the columns, which of the following formats are MOST suitable?
What is a key difference between Text and Binary data formats regarding storage space?
What is a key difference between Text and Binary data formats regarding storage space?
Why is normalization used in relational database models?
Why is normalization used in relational database models?
In a relational database, how are rows and columns characterized?
In a relational database, how are rows and columns characterized?
How does the SQL model differ from the relational model in handling duplicate data?
How does the SQL model differ from the relational model in handling duplicate data?
What is the PRIMARY function of a query optimizer in SQL?
What is the PRIMARY function of a query optimizer in SQL?
What primarily drives the movement towards the adoption of NoSQL databases?
What primarily drives the movement towards the adoption of NoSQL databases?
In NoSQL databases, what constitutes the central mode of data representation in the 'document model'?
In NoSQL databases, what constitutes the central mode of data representation in the 'document model'?
What is the HIGHEST priority in a Graph data model?
What is the HIGHEST priority in a Graph data model?
What is a PRIMARY characteristic of structured data in contrast to unstructured data?
What is a PRIMARY characteristic of structured data in contrast to unstructured data?
When dealing with data where "structure is assumed at read," what type of data is it?
When dealing with data where "structure is assumed at read," what type of data is it?
What characterizes the typical operational pattern of OnLine Transaction Processing (OLTP) systems?
What characterizes the typical operational pattern of OnLine Transaction Processing (OLTP) systems?
Within the scope of OnLine Transaction Processing (OLTP), why is 'Atomicity' in ACID essential?
Within the scope of OnLine Transaction Processing (OLTP), why is 'Atomicity' in ACID essential?
What is the PRIMARY objective of OnLine Analytical Processing (OLAP) systems?
What is the PRIMARY objective of OnLine Analytical Processing (OLAP) systems?
What is a standard storage format approach used by OnLine Analytical Processing (OLAP) systems?
What is a standard storage format approach used by OnLine Analytical Processing (OLAP) systems?
How does 'decoupling' storage and processing impact modern data systems?
How does 'decoupling' storage and processing impact modern data systems?
Flashcards
Collect Data
Collect Data
Gather a relevant dataset for training your model.
Train Model
Train Model
Apply machine learning algorithms to train the model using the collected dataset.
Deploy Model
Deploy Model
Move the trained model to production to perform optimally in a real-world environment.
Choosing a Metric to Optimize
Choosing a Metric to Optimize
Signup and view all the flashcards
Data Collection in Production
Data Collection in Production
Signup and view all the flashcards
Training and Retraining the Model
Training and Retraining the Model
Signup and view all the flashcards
Handling Edge Cases and Biases
Handling Edge Cases and Biases
Signup and view all the flashcards
Scaling and Monitoring
Scaling and Monitoring
Signup and view all the flashcards
Dealing with Setbacks
Dealing with Setbacks
Signup and view all the flashcards
Iterate and Improve
Iterate and Improve
Signup and view all the flashcards
Framing the Project
Framing the Project
Signup and view all the flashcards
Data Constraints
Data Constraints
Signup and view all the flashcards
Resource Constraints
Resource Constraints
Signup and view all the flashcards
Operational Constraints
Operational Constraints
Signup and view all the flashcards
Regression
Regression
Signup and view all the flashcards
Classification
Classification
Signup and view all the flashcards
Binary Classification
Binary Classification
Signup and view all the flashcards
Multiclass Classification
Multiclass Classification
Signup and view all the flashcards
Multilabel Classification
Multilabel Classification
Signup and view all the flashcards
User Behavioral Data
User Behavioral Data
Signup and view all the flashcards
Serialization
Serialization
Signup and view all the flashcards
Deserialization
Deserialization
Signup and view all the flashcards
JSON (JavaScript Object Notation)
JSON (JavaScript Object Notation)
Signup and view all the flashcards
CSV (Comma-Separated Values)
CSV (Comma-Separated Values)
Signup and view all the flashcards
Parquet
Parquet
Signup and view all the flashcards
Row-major Format
Row-major Format
Signup and view all the flashcards
Column-major Format
Column-major Format
Signup and view all the flashcards
Relational Model
Relational Model
Signup and view all the flashcards
SQL
SQL
Signup and view all the flashcards
NoSQL Databases
NoSQL Databases
Signup and view all the flashcards
Document Model
Document Model
Signup and view all the flashcards
Graph Model
Graph Model
Signup and view all the flashcards
Structured Data
Structured Data
Signup and view all the flashcards
Unstructured Data
Unstructured Data
Signup and view all the flashcards
OnLine Transaction Processing (OLTP)
OnLine Transaction Processing (OLTP)
Signup and view all the flashcards
OnLine Analytical Processing (OLAP)
OnLine Analytical Processing (OLAP)
Signup and view all the flashcards
ETL
ETL
Signup and view all the flashcards
Study Notes
Lecture 1: AI/ML Systems Fundamentals and Data Engineering 101
Introductions
- Course attendees introduce themselves by stating their name, background/profession, what they hope to get out of the course, and any fun thing to share.
ML in Production: Expectation vs Reality
- Ideally, deploying ML models should be straightforward, beginning with collecting data, training a model via ML algorithms, and deploying for optimal performance
- In reality, it is an iterative process with continuous adjustments and feedback loops.
- Optimizing ML models begins with choosing performance metrics aligning with project objectives
- Choosing the wrong metric can lead to misleading results
- New data must be continuously collected, processed, and integrated to maintain model accuracy
- Training reveals data quality issues that may require relabeling, cleaning, or starting over.
- Iterations aim to improve the model, but new issues can arise, such as poor performance on specific classes or overfitting.
- User feedback and new data may expose biases or poor performance on edge cases after deployment
- It may necessitate a rollback to previous versions or extensive retraining.
- Scaling the model to handle larger workloads may reveal new bottlenecks, so continuous monitoring is essential
- Continuous monitoring is essential to detect drifts in model accuracy, as well as require frequent updates or a complete retraining pipeline
- Deployment rarely goes as planned, which mean drops in performance, revenue impacts, or operational issues
Project Considerations
- Effective project framing sets the foundation and addresses the challenges of deploying ML models in real-world environments.
- Focus on optimizing performance, handling data challenges, and ensuring operational stability.
- It is essential to ask why a project is essential where efforts are aimed at identifying and mitigating post-deployment issues, improving model robustness and reliability.
- Objectives should include enhancing model accuracy across datasets, reduced resources for retraining, and building continuous feedback loops
- Common constraints for ML deployment include data, resources, and operations
- Limited labeled data, data privacy, storage, computational power, time constraints, and user expectations are typical constraints
- Structuring the project into phases includes the phases Data Collection & Preparation, Model Development & Training, Evaluation & Testing
- Evaluation & Testing validates the model and ensures performance benchmarks are met
- Setting-up monitoring systems and schedules comes last
Framing the Problem
- Task types include regression, classification, binary, multiclass and multilabel
- For multilabel problems, a label can belong to multiple classes, unlike multiclass problems
- Multilabel tasks can be handled using a set of multiple binary problems instead of a multiclass problem
- It is harder to create ground truth labels and decide decision boundaries for multilabel tasks
Project Objectives
- ML objectives target performance and latency
- Business objectives target to reduce cost, improve return on investment (ROI), and ensure regulatory compliance.
- ML projects should aim to increase profits directly through sales, conversion rates, or indirectly through improved client satisfaction
- It is important to establish baselines, set usefulness thresholds, and evaluate false negatives vs false positives
- It is also impotant to measure and consider interpretability and confidence in prediction
Data Engineering 101
- Involves data sources, formats, models, storage engines, and processing
- Data sources include those user and system generated
- Internal databases and third-party data also act as data sources
Data Sources
- User-generated data includes user inputs.
- System-generated data includes logs, metadata, and predictions.
- Third party data has social media data, income data, and demographic groups
- Important questions include the use cases for the data, storage considerations, and accessibility requirements
Data Storage
- Data serialization is critical for storing and accessing data
- Data formats are serialized if they can be reconstructed later
Data Formats
- Questions to consider are how frequently data is accessed and hardware to be used to run data
- Human-readable formats are JSON and CSV
- Binary formats include Parquet, Avro, Protobuf, and Pickle
- Binary (Parquet) format in Amazon S3 can unload 2x faster from Amazon Redshift and consume up to 6x less storage versus text.
- Column-major storage retrieves column-by-column for feature access
- Row-major storage retrieves row-by-row for accessing samples
Row-Major vs Column-Major
- Pandas DataFrames use a column-major format to access a column and NumPy
- NumPy ndarrays use a row-major format by default, with the option to specify column-based storage.
Data Models
- Data models describes how data is represented
- Two main paradigms exist relational and NoSQL
Relational Model
- It is similar to SQL model and have formats such as CSV and Parquet
- Normalization involves structuring the database to reduce redundancy and dependency
Relational Model(Normalization)
- Pros includes minimize mistakes, Easier to update, Easier localization
- Cons includes slow to join across multiple large tables
- Differences exist between SQL Model and Relational Model
- SQL is a declarative language that specifies what data is wanted and relies on optimization to figure out how to execute the query
SQL
- Understanding and using SQL is essential for data scientists
- Additional challenges exist if there is a need to add a column or change a column type.
NoSQL
- NoSQL (Not Only SQL) includes document and graph models.
- There was negative responses to relational database technology when NoSQL launched
- The main pain point when adopting NoSQL adoption is schema management
NoSQL (Document and Graph Models)
- Document model has documents as a central concept.
- In Document models, relationships between documents are usually rare
- Graph models emphasize the relationships between nodes and edges
Structured vs. Unstructured Data
- Structured data has defined schemas and is easy to search / analyze
- Unstructured data is flexible to changes and handles any source.
- Structural data is assumed to write when storing data
- Unstructural data is assumed to read when storing data
Choosing the Right Data Storage Engine
- Online transaction processing(OLTP) such as requesting a ride when using Lyft
- Online analytical processing(OLAP) is performed such as getting aggregations from data when using services like Lyft
- Newer paradigms support decoupling storage from processing
- Data can be stored in the same place as the processing layer on top which can be optimized for different query types
Differences between OLTP and OLAP
OLTP:
- Low latency
- High availability
- Atomicity, Consistency, Isolation, Durability (ACID) not necessary
- Typically row-major OLAP:
- Complex queries on large volumes of data
- Okay response time (seconds, minutes, even hours)
- Column-major
ETL (Extract, Transform, Load)
- Extract, Transform and Load. ETL is everywhere.
- Extract, Transform, Load and is commonly done in batches.
- Transform: the meaty part involves cleaning, validating, transposing, deriving values, joining from multiple sources, deduplicating, splitting, aggregating, etc
- It typically works by extracting standardized data to unstructured data, then changing it back through tools and infrastructure
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.