Sagemaker Data Wrangler Overview
20 Questions
1 Views

Sagemaker Data Wrangler Overview

Created by
@FieryBasilisk

Questions and Answers

What functionality does Sagemaker Feature Store provide?

  • Data validation
  • Real-time monitoring of AWS resources
  • Storage and management of ML features (correct)
  • Version control for ML models
  • Which statement accurately describes AWS CloudTrail?

  • Creates a visual representation of API usage.
  • Requires a subscription fee for event history access.
  • Records only data events like get/delete actions.
  • Records user and API activity by default. (correct)
  • What is a limitation of Amazon Macie?

  • It works independently of S3.
  • It can mask PII in data. (correct)
  • It cannot discover sensitive data.
  • It cannot provide detailed reports.
  • In AWS Step Functions, what does a Pass state do?

    <p>Does not perform any action and passes input directly to output.</p> Signup and view all the answers

    What is a primary benefit of using Apache Parquet files over CSV?

    <p>Parquet files can store schema information automatically.</p> Signup and view all the answers

    What is managed by Amazon Managed Workflows for Apache Airflow (MWAA)?

    <p>Orchestration of data pipelines in the cloud.</p> Signup and view all the answers

    Which operation does Sagemaker processing NOT perform?

    <p>Historical data tracking</p> Signup and view all the answers

    What distinguishes AWS CloudWatch from AWS CloudTrail?

    <p>CloudWatch monitors API usage metrics and logs.</p> Signup and view all the answers

    Which aspect of Sagemaker ML Lineage Tracking is crucial?

    <p>It establishes model governance and audit standards.</p> Signup and view all the answers

    What is a feature of Amazon Managed Service for Apache Flink?

    <p>Can analyze both streaming and static data sources.</p> Signup and view all the answers

    What is the primary function of the Fail state in a state machine?

    <p>To stop execution and mark it as a failure.</p> Signup and view all the answers

    Which of the following best describes the role of AWS CloudHSM?

    <p>To provide secure storage for cryptographic keys and process cryptographic operations.</p> Signup and view all the answers

    What is the primary purpose of the Wait state in a state machine?

    <p>To pause the execution for a specified period.</p> Signup and view all the answers

    What is the benefit of using Local Secondary Indexes (LSIs) in DynamoDB?

    <p>To allow querying using different sort keys along with the primary key.</p> Signup and view all the answers

    Which AWS service is primarily used to automate the movement of large datasets between on-premises storage and AWS?

    <p>AWS DataSync</p> Signup and view all the answers

    In terms of Kafka terminology, what role do producers play?

    <p>They deliver data to Kafka brokers.</p> Signup and view all the answers

    What does AWS AppFlow primarily enable?

    <p>The integration of SaaS applications with AWS services.</p> Signup and view all the answers

    How does AWS Graviton improve performance for cloud services?

    <p>By offering better price performance.</p> Signup and view all the answers

    What does the term 'MSK serverless' refer to in the context of Kafka?

    <p>A service that allows Kafka to operate without the need for server provisioning.</p> Signup and view all the answers

    What is the function of Cipher Message Authentication Codes (CMACs) in CloudHSM?

    <p>To authenticate messages and ensure their integrity over insecure networks.</p> Signup and view all the answers

    Study Notes

    Sagemaker Data Tools

    • Sagemaker Data Wrangler facilitates import, preparation, transformation, featurization, and analysis of data, including running Exploratory Data Analysis (EDA).
    • Lacks built-in functionality to ensure data accuracy, completeness, and trustworthiness or to identify and mask Personally Identifiable Information (PII).

    Sagemaker Feature Store

    • Provides a storage and data management layer for machine learning (ML).
    • Enables creation, storage, and sharing of features for ML models.

    Sagemaker ML Lineage Tracking

    • Creates and stores information regarding steps in the ML workflow.
    • Supports model governance and auditing standards while ensuring data accuracy and trustworthiness.

    Sagemaker Processing

    • Managed service for executing processing workloads, data validation, and model evaluation.

    Amazon Macie

    • Utilizes machine learning to automatically discover sensitive data and privacy issues within datasets.
    • Can discover but not mask PII and is restricted to working on Amazon S3.
    • Generates detailed reports on data findings, including source information.

    AWS AppFlow

    • Connects Software as a Service (SaaS) applications with AWS services for seamless data flow.

    Amazon Managed Workflows for Apache Airflow (MWAA)

    • Managed orchestration service for Apache Airflow, which is an open-source tool for programmatic authoring, scheduling, and monitoring workflows.
    • Used to set up and operate scalable data pipelines in the cloud.

    Amazon Aurora DB

    • Fully managed relational database service that is compatible with MySQL and PostgreSQL, providing high performance.

    Fault Injection Techniques

    • Can simulate different fault scenarios in Amazon Aurora to test systems, e.g., invoking read replica failures or disk failures.

    AWS CloudWatch

    • Monitors AWS resources and applications by storing logs and tracking metrics.
    • Supports delivering real-time log events from CloudWatch logs to other services such as Kinesis Data Streams (KDS) or OpenSearch.

    AWS CloudTrail

    • Records user and API activity across AWS services, with a default activation status.
    • Provides an event history as a searchable and downloadable immutable record for up to 90 days, with no charges for viewing.

    CloudTrail Lake

    • Managed data lake dedicated to storing user and API activities.
    • CloudTrail Insights analyzes normal usage patterns, generating insights when anomalies, such as abnormal volume or errors, occur.

    Management vs Data Events

    • Management events pertain to administrative actions, while data events relate to AWS resource actions such as get, put, or invoke, which incur additional costs.
    • A managed service for running Apache Flink, an open-source framework for stream and batch processing that supports various programming languages (Java, Scala, Python, SQL).
    • Capable of processing streaming and static data for time-series analytics.

    Apache Parquet vs CSV

    • Parquet is a columnar data format, enhancing efficiency and speed of data read operations compared to the row-oriented CSV format.
    • Supports schema representation, predicate pushdown, and is stored in a binary format.

    AWS Step Functions

    • Serverless orchestration service enabling integration with AWS Lambda and other services to build applications with visual workflows.
    • Workflows consist of various states, including Pass, Task, Choice, Wait, Success, Fail, Parallel, and Map states.

    AWS CloudHSM

    • Combines cloud infrastructure with the security of Hardware Security Modules (HSMs) for cryptographic operations and key storage.
    • Use cases include managing private keys, encrypting/decrypting data, and supporting message authentication via CMACs and HMACs.

    Managed Streaming for Apache Kafka (MSK)

    • Fully managed service for building and running applications using Kafka, facilitating operations such as creating, updating, and deleting clusters.
    • Features replication of messages between brokers for fault tolerance and utilizes Zookeeper for broker management.

    AWS DataSync

    • Automatically transfers large datasets between on-premises storage and AWS services (S3, EFS, FSX).

    AWS Schema Conversion Tool (SCT)

    • Converts database schemas from one engine to another, facilitating database migration.

    DynamoDB

    • Fast, NoSQL key-value database that employs both partition keys and sort keys.
    • Local Secondary Indexes (LSIs) allow additional sorting options for enhanced query flexibility.

    Amazon Managed Grafana

    • Fully managed service for data visualization, providing instant queries and operational metrics visualization.

    Workflow Types

    • Step Functions offer broader AWS service integration, Glue Workflows focus on ETL tasks, and Apache Airflow is a more complex option.
    • AWS AppFlow is specifically designed to link SaaS applications with AWS services.
    • Facilitates private network connections between consumer Virtual Private Clouds (VPC) and service provider VPCs.

    CloudShell

    • Provides a browser-based, pre-authenticated shell accessible from the AWS Management Console for executing AWS CLI commands without local installation.

    AWS Cloud9

    • Web-based Integrated Development Environment (IDE) for coding and collaboration.

    Ephemeral Volume

    • Temporary storage local to individual instances, providing low-latency access.

    AWS Graviton

    • Offers up to 40% better price performance compared to other instance types.

    Studying That Suits You

    Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

    Quiz Team

    Description

    This quiz covers the functionalities of Sagemaker Data Wrangler, including data import, preparation, transformation, and analysis for machine learning workflows. It also highlights features such as Feature Store and ML Lineage Tracking, critical for model governance and data trustworthiness.

    More Quizzes Like This

    Use Quizgecko on...
    Browser
    Browser