Data Preparation for Machine Learning

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

A data scientist is preparing data for a machine learning model and needs to handle missing values in a numeric feature. Which of the following methods is most appropriate when the data is not missing completely at random and has potential bias?

Imputing missing values using a model-based approach, such as k-NN or regression imputation. (correct)
Imputing missing values with the mean.
Imputing missing values with the median.
Removing rows with missing values.

When constructing a data ingestion pipeline for a real-time streaming application on AWS, which combination of services would provide a scalable and cost-effective solution?

Amazon S3 and AWS Glue
Amazon Kinesis and Apache Flink (correct)
Amazon FSx for NetApp ONTAP and AWS Lambda
Amazon EFS and Apache Spark

A machine learning engineer is tasked with transforming a categorical feature with high cardinality (many unique values) into a numerical format suitable for model training. Which encoding method is most likely to cause the 'curse of dimensionality'?

One-Hot Encoding (correct)
Label Encoding
Binary Encoding
Target Encoding

A data scientist is preparing a dataset for training a classification model and observes a significant class imbalance. Which of the following techniques is most likely to improve model performance without introducing bias?

Applying a cost-sensitive learning algorithm that penalizes misclassification of the minority class more heavily. (D) Signup and view all the answers

When choosing between object, shared file, and high-performance file storage on AWS for machine learning workloads, which of the following scenarios is most suitable for using Amazon EFS?

Providing a shared file system for multiple instances to access and modify the same data concurrently. (D) Signup and view all the answers

During model development, you discover a substantial difference in proportions of labels (DPL) between your training and validation datasets. What is the most appropriate action to take?

Investigate for potential selection bias and consider resampling or re-weighting the datasets. (D) Signup and view all the answers

You're building a machine learning model using Amazon SageMaker and need to ensure data privacy and compliance with regulations such as GDPR. Which of the following techniques is most appropriate for protecting sensitive customer data during the model training process?

Implementing data classification, anonymization, and masking. (C) Signup and view all the answers

A company needs to combine customer data from a SQL database with website activity logs stored in S3 for a machine learning model. Which approach would ensure efficient data merging during the data preparation phase?

Create a Spark cluster using Amazon EMR and perform a distributed join operation. (B) Signup and view all the answers

A data scientist is working on detecting outliers in a dataset. Which of the following techniques is most appropriate when they want to identify data points that deviate significantly from the norm in multiple dimensions?

Using clustering algorithms like DBSCAN to identify dense regions and outliers. (B) Signup and view all the answers

Which of the following AWS services would you primarily use to facilitate the process of labeling a large image dataset for training a computer vision model?

Amazon SageMaker Ground Truth (D) Signup and view all the answers

When preparing a dataset for a fraud detection model, which data transformation technique would be most suitable to convert categorical variables like 'city' into a numerical format that the model can process effectively?

One-Hot Encoding (C) Signup and view all the answers

During data preparation, an ML engineer identifies a skewed distribution in a numerical feature. Which transformation technique would you employ to make the distribution more symmetrical and improve model performance?

Log Transformation (C) Signup and view all the answers

A machine learning engineer aims to implement feature binning on a continuous numerical feature. What is the primary reason for applying this technique?

To convert the continuous feature into a categorical feature, potentially capturing non-linear relationships. (D) Signup and view all the answers

A machine learning team is preparing data for a sentiment analysis model. They need to label a large dataset of customer reviews but want to minimize manual effort. Which approach would be most effective for this task?

Use active learning to prioritize the most informative reviews for manual labeling. (C) Signup and view all the answers

A financial institution is developing a model to predict loan defaults. They are concerned about potential bias in their training data related to demographic features. Which technique could they use to assess whether pre-training bias exists in their dataset?

Use disparate impact analysis to compare outcomes across different demographic groups. (A) Signup and view all the answers

A healthcare provider is training a model to predict patient readmission rates. They need to ensure that the model complies with HIPAA regulations and protects patient privacy during the data preparation phase. Which strategy should they prioritize?

Remove all patient identifiers and apply differential privacy techniques. (C) Signup and view all the answers

An e-commerce company wants to use real-time website activity data to personalize product recommendations. Which AWS service is best suited for ingesting and processing streaming data for this use case?

Amazon Kinesis (A) Signup and view all the answers

Consider a scenario where raw data contains a feature 'transaction_amount' with several extreme outliers. Which of the following data preparation techniques would be MOST effective in mitigating the impact of these outliers on a machine learning model's performance?

Cap the 'transaction_amount' at a reasonable percentile (e.g., 99th percentile). (A) Signup and view all the answers

You're building a credit risk model and discover that your training dataset contains a disproportionately low number of defaults (highly imbalanced data). What data preparation technique would be LEAST suitable to address this imbalance?

Applying Z-score normalization to all numerical features. (D) Signup and view all the answers

A machine learning model is underperforming due to inconsistent data quality across different data sources. Specifically, some values for a 'product_price' feature are stored in USD while others are in EUR. What data preparation step is MOST crucial?

Converting all 'product_price' values to a single currency using current exchange rates. (A) Signup and view all the answers

Which of the following is a critical consideration when choosing between real-time and batch deployment strategies for a machine learning model?

The acceptable latency for generating predictions and the volume of data to be processed. (D) Signup and view all the answers

A data scientist needs to automate the build process for their machine learning application within a CI/CD pipeline. Which AWS service is most suitable for this task?

AWS CodeBuild (A) Signup and view all the answers

Which deployment strategy minimizes downtime and risk by gradually shifting user traffic from an old version of a model to a new version?

Blue/green deployment (D) Signup and view all the answers

What is the primary benefit of using Infrastructure as Code (IaC) tools like AWS CloudFormation or AWS CDK in machine learning deployments?

Automating and versioning infrastructure provisioning, ensuring consistency and repeatability. (D) Signup and view all the answers

A machine learning engineer is observing high latency in a deployed model. Which metric should they primarily monitor to diagnose the issue?

CPU utilization of the instance (C) Signup and view all the answers

Which AWS service can be used to establish a private and isolated network environment for deploying SageMaker endpoints, enhancing security and compliance?

Amazon Virtual Private Cloud (VPC) (C) Signup and view all the answers

In the context of MLOps, what is the purpose of implementing unit tests, integration tests, and end-to-end tests for machine learning models and code?

To validate the functionality, reliability, and performance of the ML pipeline components. (B) Signup and view all the answers

Which of the options are key performance metrics for ML infrastructure?

Utilization, throughput, availability. (C) Signup and view all the answers

A company wants to track the costs associated with different machine learning projects. Which AWS service feature helps in cost tracking and allocation using metadata?

Resource Tagging (B) Signup and view all the answers

What is the purpose of 'least privilege access' when configuring IAM roles and policies for machine learning workflows on AWS?

To minimize the potential impact of security breaches by granting only necessary permissions. (A) Signup and view all the answers

A machine learning model's performance degrades significantly after deployment due to changes in the incoming data. Which of the following techniques can be used detect these changes?

Using SageMaker Clarify to monitor for data drift in feature distributions. (C) Signup and view all the answers

Which AWS service should a machine learning engineer use to monitor and log all API calls made to their AWS resources, aiding in security auditing and compliance?

AWS CloudTrail (A) Signup and view all the answers

When should a machine learning engineer consider using SageMaker Savings Plans for deploying ML models?

When the model deployment has consistent compute requirements over a 1 or 3 year term. (B) Signup and view all the answers

Which of the following factors should be considered when right-sizing instances for machine learning inference?

Model Size, Expected Request Volume, and Latency Requirements. (A) Signup and view all the answers

What is the primary benefit of using multi-model endpoints in SageMaker?

Reducing the cost of hosting multiple models by sharing resources. (B) Signup and view all the answers

Which of the following is NOT a primary consideration when selecting a deployment infrastructure for a machine learning model within an existing AWS architecture?

The familiarity of the deployment team with new technologies. (A) Signup and view all the answers

When deciding between real-time versus batch model serving strategies, which scenario would most likely benefit from a real-time serving approach?

Providing instant credit score assessments to users during an application process. (B) Signup and view all the answers

What is the primary advantage of using Infrastructure as Code (IaC) for managing machine learning deployments on AWS?

It enables version-controlled, repeatable, and automated infrastructure provisioning. (B) Signup and view all the answers

Which of the following is a key benefit of containerizing machine learning models for deployment?

Providing consistency and portability across different computing environments. (C) Signup and view all the answers

When configuring VPC networking for SageMaker, which of the following is a primary security benefit?

Control over network access to SageMaker resources, enhancing isolation. (B) Signup and view all the answers

Which of the following is a critical consideration when selecting metrics for auto scaling a SageMaker endpoint?

Selecting metrics that directly correlate with business outcomes and resource utilization. (A) Signup and view all the answers

In the context of CI/CD for machine learning, what is the primary purpose of automated testing within the pipeline?

To validate model performance, data integrity, and code quality before deployment. (D) Signup and view all the answers

Which of the following is a key consideration when designing a retraining strategy for a deployed machine learning model?

Automate the retraining process to adapt to data drift and maintain model accuracy. (A) Signup and view all the answers

What is the role of a version control system (like Git) in a machine learning CI/CD pipeline?

To track changes to code, configurations, and data, enabling reproducibility and collaboration. (B) Signup and view all the answers

When choosing a deployment orchestrator for machine learning models on AWS, which of the following factors should be prioritized?

The orchestrator's integration with AWS services, scalability, and support for complex workflows. (D) Signup and view all the answers

In the context of multi-model deployments, what is a common strategy for efficiently serving multiple models from a single endpoint?

Dynamically loading models into memory based on request parameters. (B) Signup and view all the answers

When deploying machine learning models to edge devices, what is a primary optimization goal?

Reducing the model's size and computational requirements to fit within device constraints. (D) Signup and view all the answers

What is a typical trade-off to consider between performance and cost when deploying ML models?

Increasing batch size to reduce costs can lead to higher latency. (D) Signup and view all the answers

How do you automate resource provisioning for ML model deployment?

By using Infrastructure as Code (IaC) tools like AWS CloudFormation or Terraform. (A) Signup and view all the answers

You aim to update your machine learning model in production without disrupting the service. Which deployment strategy is MOST suitable?

Blue/Green deployment (C) Signup and view all the answers

Flashcards

Data Ingestion

The process of bringing data into a system for storage and processing.

Data Formats

Structured, semi-structured, and unstructured formats for storing data.