Data Preparation for Machine Learning
50 Questions
1 Views

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

A data scientist is preparing data for a machine learning model and needs to handle missing values in a numeric feature. Which of the following methods is most appropriate when the data is not missing completely at random and has potential bias?

  • Imputing missing values using a model-based approach, such as k-NN or regression imputation. (correct)
  • Imputing missing values with the mean.
  • Imputing missing values with the median.
  • Removing rows with missing values.

When constructing a data ingestion pipeline for a real-time streaming application on AWS, which combination of services would provide a scalable and cost-effective solution?

  • Amazon S3 and AWS Glue
  • Amazon Kinesis and Apache Flink (correct)
  • Amazon FSx for NetApp ONTAP and AWS Lambda
  • Amazon EFS and Apache Spark

A machine learning engineer is tasked with transforming a categorical feature with high cardinality (many unique values) into a numerical format suitable for model training. Which encoding method is most likely to cause the 'curse of dimensionality'?

  • One-Hot Encoding (correct)
  • Label Encoding
  • Binary Encoding
  • Target Encoding

A data scientist is preparing a dataset for training a classification model and observes a significant class imbalance. Which of the following techniques is most likely to improve model performance without introducing bias?

<p>Applying a cost-sensitive learning algorithm that penalizes misclassification of the minority class more heavily. (D)</p> Signup and view all the answers

When choosing between object, shared file, and high-performance file storage on AWS for machine learning workloads, which of the following scenarios is most suitable for using Amazon EFS?

<p>Providing a shared file system for multiple instances to access and modify the same data concurrently. (D)</p> Signup and view all the answers

During model development, you discover a substantial difference in proportions of labels (DPL) between your training and validation datasets. What is the most appropriate action to take?

<p>Investigate for potential selection bias and consider resampling or re-weighting the datasets. (D)</p> Signup and view all the answers

You're building a machine learning model using Amazon SageMaker and need to ensure data privacy and compliance with regulations such as GDPR. Which of the following techniques is most appropriate for protecting sensitive customer data during the model training process?

<p>Implementing data classification, anonymization, and masking. (C)</p> Signup and view all the answers

A company needs to combine customer data from a SQL database with website activity logs stored in S3 for a machine learning model. Which approach would ensure efficient data merging during the data preparation phase?

<p>Create a Spark cluster using Amazon EMR and perform a distributed join operation. (B)</p> Signup and view all the answers

A data scientist is working on detecting outliers in a dataset. Which of the following techniques is most appropriate when they want to identify data points that deviate significantly from the norm in multiple dimensions?

<p>Using clustering algorithms like DBSCAN to identify dense regions and outliers. (B)</p> Signup and view all the answers

Which of the following AWS services would you primarily use to facilitate the process of labeling a large image dataset for training a computer vision model?

<p>Amazon SageMaker Ground Truth (D)</p> Signup and view all the answers

When preparing a dataset for a fraud detection model, which data transformation technique would be most suitable to convert categorical variables like 'city' into a numerical format that the model can process effectively?

<p>One-Hot Encoding (C)</p> Signup and view all the answers

During data preparation, an ML engineer identifies a skewed distribution in a numerical feature. Which transformation technique would you employ to make the distribution more symmetrical and improve model performance?

<p>Log Transformation (C)</p> Signup and view all the answers

A machine learning engineer aims to implement feature binning on a continuous numerical feature. What is the primary reason for applying this technique?

<p>To convert the continuous feature into a categorical feature, potentially capturing non-linear relationships. (D)</p> Signup and view all the answers

A machine learning team is preparing data for a sentiment analysis model. They need to label a large dataset of customer reviews but want to minimize manual effort. Which approach would be most effective for this task?

<p>Use active learning to prioritize the most informative reviews for manual labeling. (C)</p> Signup and view all the answers

A financial institution is developing a model to predict loan defaults. They are concerned about potential bias in their training data related to demographic features. Which technique could they use to assess whether pre-training bias exists in their dataset?

<p>Use disparate impact analysis to compare outcomes across different demographic groups. (A)</p> Signup and view all the answers

A healthcare provider is training a model to predict patient readmission rates. They need to ensure that the model complies with HIPAA regulations and protects patient privacy during the data preparation phase. Which strategy should they prioritize?

<p>Remove all patient identifiers and apply differential privacy techniques. (C)</p> Signup and view all the answers

An e-commerce company wants to use real-time website activity data to personalize product recommendations. Which AWS service is best suited for ingesting and processing streaming data for this use case?

<p>Amazon Kinesis (A)</p> Signup and view all the answers

Consider a scenario where raw data contains a feature 'transaction_amount' with several extreme outliers. Which of the following data preparation techniques would be MOST effective in mitigating the impact of these outliers on a machine learning model's performance?

<p>Cap the 'transaction_amount' at a reasonable percentile (e.g., 99th percentile). (A)</p> Signup and view all the answers

You're building a credit risk model and discover that your training dataset contains a disproportionately low number of defaults (highly imbalanced data). What data preparation technique would be LEAST suitable to address this imbalance?

<p>Applying Z-score normalization to all numerical features. (D)</p> Signup and view all the answers

A machine learning model is underperforming due to inconsistent data quality across different data sources. Specifically, some values for a 'product_price' feature are stored in USD while others are in EUR. What data preparation step is MOST crucial?

<p>Converting all 'product_price' values to a single currency using current exchange rates. (A)</p> Signup and view all the answers

Which of the following is a critical consideration when choosing between real-time and batch deployment strategies for a machine learning model?

<p>The acceptable latency for generating predictions and the volume of data to be processed. (D)</p> Signup and view all the answers

A data scientist needs to automate the build process for their machine learning application within a CI/CD pipeline. Which AWS service is most suitable for this task?

<p>AWS CodeBuild (A)</p> Signup and view all the answers

Which deployment strategy minimizes downtime and risk by gradually shifting user traffic from an old version of a model to a new version?

<p>Blue/green deployment (D)</p> Signup and view all the answers

What is the primary benefit of using Infrastructure as Code (IaC) tools like AWS CloudFormation or AWS CDK in machine learning deployments?

<p>Automating and versioning infrastructure provisioning, ensuring consistency and repeatability. (D)</p> Signup and view all the answers

A machine learning engineer is observing high latency in a deployed model. Which metric should they primarily monitor to diagnose the issue?

<p>CPU utilization of the instance (C)</p> Signup and view all the answers

Which AWS service can be used to establish a private and isolated network environment for deploying SageMaker endpoints, enhancing security and compliance?

<p>Amazon Virtual Private Cloud (VPC) (C)</p> Signup and view all the answers

In the context of MLOps, what is the purpose of implementing unit tests, integration tests, and end-to-end tests for machine learning models and code?

<p>To validate the functionality, reliability, and performance of the ML pipeline components. (B)</p> Signup and view all the answers

Which of the options are key performance metrics for ML infrastructure?

<p>Utilization, throughput, availability. (C)</p> Signup and view all the answers

A company wants to track the costs associated with different machine learning projects. Which AWS service feature helps in cost tracking and allocation using metadata?

<p>Resource Tagging (B)</p> Signup and view all the answers

What is the purpose of 'least privilege access' when configuring IAM roles and policies for machine learning workflows on AWS?

<p>To minimize the potential impact of security breaches by granting only necessary permissions. (A)</p> Signup and view all the answers

A machine learning model's performance degrades significantly after deployment due to changes in the incoming data. Which of the following techniques can be used detect these changes?

<p>Using SageMaker Clarify to monitor for data drift in feature distributions. (C)</p> Signup and view all the answers

Which AWS service should a machine learning engineer use to monitor and log all API calls made to their AWS resources, aiding in security auditing and compliance?

<p>AWS CloudTrail (A)</p> Signup and view all the answers

When should a machine learning engineer consider using SageMaker Savings Plans for deploying ML models?

<p>When the model deployment has consistent compute requirements over a 1 or 3 year term. (B)</p> Signup and view all the answers

Which of the following factors should be considered when right-sizing instances for machine learning inference?

<p>Model Size, Expected Request Volume, and Latency Requirements. (A)</p> Signup and view all the answers

What is the primary benefit of using multi-model endpoints in SageMaker?

<p>Reducing the cost of hosting multiple models by sharing resources. (B)</p> Signup and view all the answers

Which of the following is NOT a primary consideration when selecting a deployment infrastructure for a machine learning model within an existing AWS architecture?

<p>The familiarity of the deployment team with new technologies. (A)</p> Signup and view all the answers

When deciding between real-time versus batch model serving strategies, which scenario would most likely benefit from a real-time serving approach?

<p>Providing instant credit score assessments to users during an application process. (B)</p> Signup and view all the answers

What is the primary advantage of using Infrastructure as Code (IaC) for managing machine learning deployments on AWS?

<p>It enables version-controlled, repeatable, and automated infrastructure provisioning. (B)</p> Signup and view all the answers

Which of the following is a key benefit of containerizing machine learning models for deployment?

<p>Providing consistency and portability across different computing environments. (C)</p> Signup and view all the answers

When configuring VPC networking for SageMaker, which of the following is a primary security benefit?

<p>Control over network access to SageMaker resources, enhancing isolation. (B)</p> Signup and view all the answers

Which of the following is a critical consideration when selecting metrics for auto scaling a SageMaker endpoint?

<p>Selecting metrics that directly correlate with business outcomes and resource utilization. (A)</p> Signup and view all the answers

In the context of CI/CD for machine learning, what is the primary purpose of automated testing within the pipeline?

<p>To validate model performance, data integrity, and code quality before deployment. (D)</p> Signup and view all the answers

Which of the following is a key consideration when designing a retraining strategy for a deployed machine learning model?

<p>Automate the retraining process to adapt to data drift and maintain model accuracy. (A)</p> Signup and view all the answers

What is the role of a version control system (like Git) in a machine learning CI/CD pipeline?

<p>To track changes to code, configurations, and data, enabling reproducibility and collaboration. (B)</p> Signup and view all the answers

When choosing a deployment orchestrator for machine learning models on AWS, which of the following factors should be prioritized?

<p>The orchestrator's integration with AWS services, scalability, and support for complex workflows. (D)</p> Signup and view all the answers

In the context of multi-model deployments, what is a common strategy for efficiently serving multiple models from a single endpoint?

<p>Dynamically loading models into memory based on request parameters. (B)</p> Signup and view all the answers

When deploying machine learning models to edge devices, what is a primary optimization goal?

<p>Reducing the model's size and computational requirements to fit within device constraints. (D)</p> Signup and view all the answers

What is a typical trade-off to consider between performance and cost when deploying ML models?

<p>Increasing batch size to reduce costs can lead to higher latency. (D)</p> Signup and view all the answers

How do you automate resource provisioning for ML model deployment?

<p>By using Infrastructure as Code (IaC) tools like AWS CloudFormation or Terraform. (A)</p> Signup and view all the answers

You aim to update your machine learning model in production without disrupting the service. Which deployment strategy is MOST suitable?

<p>Blue/Green deployment (C)</p> Signup and view all the answers

Flashcards

Data Ingestion

The process of bringing data into a system for storage and processing.

Data Formats

Structured, semi-structured, and unstructured formats for storing data.

AWS Data Storage Services

Services like S3, Glacier, and EBS used to store data on AWS.

Streaming Data Ingestion

Ingesting data continuously as it is produced, often in real-time.

Signup and view all the flashcards

AWS Storage Trade-offs

Trade-offs to consider include cost, performance, durability, and access frequency when choosing storage solutions.

Signup and view all the flashcards

Data Extraction Methods

Techniques like APIs, web scraping, and database queries to retrieve data from sources.

Signup and view all the flashcards

Data Transformation

The process of cleaning, transforming, and structuring data so it is suitable for machine learning models.

Signup and view all the flashcards

Feature Engineering

Creating new input features from existing data to improve model performance.

Signup and view all the flashcards

Encoding Techniques

Converting categorical data into numerical format for ML models.

Signup and view all the flashcards

Bias Mitigation Strategies

Identifying and mitigating biases present in training data.

Signup and view all the flashcards

Non-Validated Formats

Data formats like Parquet, JSON, CSV, ORC, Avro, and RecordIO that don't enforce a strict schema on write.

Signup and view all the flashcards

SageMaker Ground Truth

A service to label datasets using human annotators. Includes features like automatic data labeling and workforce management.

Signup and view all the flashcards

Class Imbalance (CI)

A measure that indicates whether one class is disproportionately represented compared to others. Can skew model performance.

Signup and view all the flashcards

Difference in Proportions of Labels (DPL)

Differences in the proportions of labels across different subgroups or sensitive groups in a dataset.

Signup and view all the flashcards

Synthetic Data Generation

Techniques to create artificial data that mimics the characteristics of real data, used to augment datasets.

Signup and view all the flashcards

Selection Bias

Biases that arise from how individuals are selected to participate in a study or dataset.

Signup and view all the flashcards

Measurement Bias

Inaccuracies in how data is measured or collected, leading to distorted results.

Signup and view all the flashcards

Dataset Splitting

Splitting a dataset into distinct subsets for training, validation, and testing the model.

Signup and view all the flashcards

Dataset Shuffling

The process of randomizing the order of data points to reduce bias from data collection order during training.

Signup and view all the flashcards

Dataset Augmentation

This involves creating new data from existing data, for training purposes. This can include cropping, rotating, or flipping images.

Signup and view all the flashcards

Epoch

Number of passes of the entire training dataset through the algorithm.

Signup and view all the flashcards

Batch Size

The size of data subsets used in each training iteration.

Signup and view all the flashcards

Early Stopping

Halting training when model performance stops improving on a validation set.

Signup and view all the flashcards

Distributed Training

Distributing the training workload across multiple machines or GPUs.

Signup and view all the flashcards

Regularization

Techniques that prevent a model from memorizing training data.

Signup and view all the flashcards

Feature Selection

Selecting a subset of relevant features to simplify the model and improve performance.

Signup and view all the flashcards

Ensembling

Combining multiple models to improve overall predictive performance.

Signup and view all the flashcards

F1 Score

A performance metric that balances precision and recall.

Signup and view all the flashcards

Confusion Matrix

A table that describes the performance of a classification model.

Signup and view all the flashcards

Heat Maps

A visual representation of the confusion matrix to show the performance of a classification model.

Signup and view all the flashcards

Accuracy

The proportion of correctly predicted instances out of all instances.

Signup and view all the flashcards

Recall

The proportion of actual positives that are correctly identified.

Signup and view all the flashcards

RMSE

Root Mean Squared Error - measures the average magnitude of the errors in a set of predictions.

Signup and view all the flashcards

AUC

Area Under the Curve - performance measurement for classification problems at various threshold settings.

Signup and view all the flashcards

Identify Overfitting and Underfitting

The process of determining if a model is overfitting or underfitting the data.

Signup and view all the flashcards

Batch Processing

Processing data in large, predefined groups or batches, typically without real-time interaction.

Signup and view all the flashcards

CI/CD

Automating the process of building, testing, and deploying code changes.

Signup and view all the flashcards

Canary Deployment

A deployment strategy where traffic is gradually shifted from the old version to the new version.

Signup and view all the flashcards

Blue/Green Deployment

A deployment strategy where two identical environments (blue and green) are running, and traffic is switched from one to the other.

Signup and view all the flashcards

Auto Scaling

Automatic adjustment of compute resources (e.g., EC2 instances) based on demand.

Signup and view all the flashcards

AWS CloudFormation

A tool that allows you to define and provision AWS infrastructure as code.

Signup and view all the flashcards

Amazon ECS

A service that allows you to create, manage, and run containers.

Signup and view all the flashcards

Amazon EKS

A managed Kubernetes service that makes it easier to run Kubernetes on AWS.

Signup and view all the flashcards

Amazon ECR

A service to store, share, and deploy container images.

Signup and view all the flashcards

Amazon SageMaker

A fully managed machine learning service that allows you to build, train, and deploy ML models.

Signup and view all the flashcards

Model Drift

Unexplained changes in model performance metrics.

Signup and view all the flashcards

AWS CloudTrail

A service for auditing the API calls made to AWS services.

Signup and view all the flashcards

IAM Role

A set of permissions that define what AWS resources an entity can access.

Signup and view all the flashcards

Least Privilege

Granting only the necessary permissions to perform a task.

Signup and view all the flashcards

Study Notes

Data Preparation for ML

  • Data preparation for ML is a domain that involves preparing data for machine learning
  • Key steps include ingestion and storage, transformation and feature engineering

Ingest and Store Data

  • Important factors include data formats, ingestion methods, and choosing suitable AWS data storage services
  • Streaming data ingestion and assessing AWS storage trade-offs are considered
  • Data extraction techniques and troubleshooting storage issues are also important

Transform Data and Perform Feature Engineering

  • Includes cleaning and transforming data and feature engineering
  • Encoding techniques and data exploration are applied
  • Tools for streaming data transformation, data annotation, and labeling are used
  • Feature management, assessing data integrity, labeling, validation and mitigating bias in the data are assessed

ML Model Development

  • ML Model Development involves creating effective ML models
  • Considers algorithm selection and assessing feasibility
  • Optimizes models by refining them.
  • Models are analyzed to gauge and improve performance

Choosing a Modeling Approach

  • Key aspects are ML algorithm selection and leveraging AWS AI services
  • Assessing ML feasibility and comparing models and algorithms are performed
  • Pre-trained models are considered in relation to cost

Train and Refine Models

  • Training process fundamentals and techniques for reducing training time are important
  • Regularization techniques and hyperparameter tuning
  • The effects of hyperparameters and model integration with SageMaker are assessed

Analyze Model Performance

  • Overfitting and underfitting are addressed and model evaluation techniques are applied
  • Performance baselines are created
  • Tools like SageMaker Clarify are used for insights
  • Evaluation metrics interpreted for reproducible experimentation

Deployment and Orchestration of ML Workflows

  • Deployment and orchestration involves infrastructure setup, automation, and managing ML workflows

Select Deployment Infrastructure

  • Best practices are followed when selecting infrastructure based on existing architecture
  • Considerations for real-time, batch model serving, compute resource provisioning, and endpoint deployments are made
  • Container selection, edge model optimization, and trade-offs related to performance, cost, and latency are addressed

Create and Script Infrastructure

  • Infrastructure is created based on existing architecture using containerization and auto scaling concepts
  • Solutions are scalable and cost-effective
  • Automating resource provisioning and building/maintaining containers are key
  • Configuring VPC networks for SageMaker and deploying models with the SageMaker SDK is important

Use Automated Orchestration Tools

  • Automated orchestration tools are used to set up CI/CD pipelines
  • AWS tools for CI/CD and version control systems must be used
  • ML workflows use automated orchestration that use AWS and automated testing in CI/CD pipelines

ML Solution Monitoring, Maintenance, and Security

  • Monitoring includes model inference, optimizing infrastructure, and securing AWS resources

Monitor Model Inference

  • Drift in ML models is monitored for performance
  • Data quality is monitored along with models in production and monitoring workflows
  • Techniques used for detecting data changes and tools for performance monitoring are utilized

Monitor and Optimize Infrastructure and Costs

  • Key performance metrics are monitored and observability tools are used
  • AWS CloudTrail, cost analysis tools, and setting up dashboards are used
  • Rightsizing instances and monitoring latency is part of preparing for cost monitoring

Secure AWS Resources

  • IAM roles, SageMaker security features, and network controls are configured
  • Security practices for CI/CD, least privilege configuration, and monitoring/auditing of security measures are performed
  • Troubleshooting security issues and building secure networks are essential

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

Description

Explore data preparation for machine learning, covering ingestion, storage, transformation, and feature engineering. Learn data extraction, cleaning, and encoding techniques. Use tools for streaming data transformation, annotation, and feature management.

More Like This

Feature Engineering Cycle Overview
10 questions
Kỹ thuật Feature Engineering
8 questions
Feature Engineering Techniques
8 questions
Use Quizgecko on...
Browser
Browser