Machine Learning Data Preparation Techniques

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

A company stores historical data in .csv files in Amazon S3. Only some of the rows and columns in the .csv files are populated. The columns are not labeled. An ML engineer needs to prepare and store the data so that the company can use the data to train ML models. Select and order the correct steps from the following list to perform this task. Each step should be selected one time or not at all. (Select and order three.)

Create an Amazon SageMaker batch transform job for data cleaning and feature engineering.

Store the resulting data back in Amazon S3.

Use Amazon Athena to infer the schemas and available columns.

Use AWS Glue crawlers to infer the schemas and available columns.

Use AWS Glue DataBrew for data cleaning and feature engineering.

Use AWS Glue DataBrew for data cleaning and feature engineering. (correct)

Create an Amazon SageMaker batch transform job for data cleaning and feature engineering.

Use Amazon Athena to infer the schemas and available columns.

Use AWS Glue crawlers to infer the schemas and available columns. (correct)

Store the resulting data back in Amazon S3. (correct)

An ML engineer needs to use Amazon SageMaker Feature Store to create and manage features to train a model. Select and order the correct steps from the following list to create and use the features in Features Store. Each step should be selected one time. (Select and order three.)

Access the store to build datasets for training.

Create a feature group.

Ingest the records.

Ingest the records. (correct)

Create a feature group. (correct)

Access the store to build datasets for training. (correct)

A company wants to host an ML model on Amazon SageMaker. An ML engineer is configuring a continuous integration and continuous delivery (CI/CD) pipeline in AWS CodePipeline to deploy the model. The pipeline must run automatically when new training data for the model is uploaded to an Amazon S3 bucket. Select and order the pipeline's correct steps from the following list. Each step should be selected one time or not at all. (Select and order three.)

An S3 event notification invokes the pipeline when new data is uploaded.

S3 Lifecycle rule invokes the pipeline when new data is uploaded.

SageMaker retrains the model by using the data in the S3 bucket.

The pipeline deploys the model to a SageMaker endpoint.

The pipeline deploys the model to SageMaker Model Registry.

SageMaker retrains the model by using the data in the S3 bucket. (correct)

S3 Lifecycle rule invokes the pipeline when new data is uploaded.

The pipeline deploys the model to a SageMaker endpoint. (correct)

The pipeline deploys the model to SageMaker Model Registry.

An S3 event notification invokes the pipeline when new data is uploaded. (correct)

An ML engineer is building a generative AI application on Amazon Bedrock by using large language models (LLMs). Select the correct generative AI term from the following list. For each description, Each term should be selected one time or not at all. (Select and order three.) Text representation of basic units of data processed by LLMs High-dimensional vectors that contain the semantic meaning of text Enrichment of information from additional data sources to improve a generated response

Embedding

Retrieval Augmented Generation (RAG)

Temperature

Token

<p>Embedding (A), Retrieval Augmented Generation (RAG) (B), Token (D)</p> Signup and view all the answers

An ML engineer is working on an ML model to predict the prices of similarly sized homes. The model will base predictions on several features. The ML engineer will use the following feature engineering techniques to estimate the prices of the homes:

Feature splitting

Logarithmic transformation

One-hot encoding

Standardized distribution Select the correct feature engineering techniques for the following list of features. Each feature engineering technique should be selected one time or not at all (Select three.) City (name) Type_year (type of home and year the home was built) Size of the building (square feet or square meters)

<p>One-hot encoding (A), Standardized distribution (B), Feature splitting (D)</p> Signup and view all the answers

Study Notes

Data Preparation Steps for Machine Learning Models

Data Source: Historical data in .csv files stored in Amazon S3.
Data Quality: Some rows and columns contain missing data; columns are unlabeled.
Goal: Prepare the data for machine learning models.
Step 1: Create an Amazon SageMaker batch transform job for data cleaning and feature engineering.
Step 2: Store the resulting data back in Amazon S3.
Step 3: Use AWS Glue crawlers to infer the schemas and available columns(optional).

Feature Store Creation Steps

Goal: Create and manage features to train a machine learning model using Amazon SageMaker Feature Store.
Step 1: Access the feature store to prepare for training data. Create a feature group. Ingest data records.
Step 2: Access the feature store. Create a feature group. Ingest data records.
Step 3: Access the feature store. Create a feature group. Ingest data records.

Continuous Integration and Continuous Delivery (CI/CD) Pipeline for ML Model Deployment

Goal: Configure a CI/CD pipeline in AWS CodePipeline for automatic deployment of an ML model hosted in Amazon SageMaker. The pipeline triggers upon new data upload into Amazon S3.
Step 1: An S3 event notification invokes the pipeline when new data is uploaded.
Step 2: SageMaker retrains the model using the data from S3.
Step 3: The pipeline deploys the model to a SageMaker Endpoint.
Additional Step: The pipeline deploys the model to SageMaker Model Registry (optional).

Generative AI Terms

Token: Text representation of basic units of data processed by LLMs (Large Language Models).
Embedding: High-dimensional vectors containing the semantic meaning of text.
Retrieval Augmented Generation (RAG): Enrichment of information from additional data sources to improve a generated response.

Feature Engineering Techniques for Home Price Prediction

Feature Splitting: Splitting features (optional).
Logarithmic Transformation: Logarithmic transformation for numerical features (optional).
One-Hot Encoding: Transforming categorical features (optional).
Standardized Distribution: Standardizing numerical features (optional)

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Description

This quiz covers essential steps in preparing data for machine learning models, including data cleaning, feature engineering, and using Amazon SageMaker Feature Store. You'll learn how to manage data effectively and ensure high quality for your machine learning projects. Test your understanding of these crucial processes.