AWS Data Ingestion

Podcast

Play an AI-generated podcast conversation about this lesson

Download our mobile app to listen on the go

Get App

Questions and Answers

In the context of data ingestion, what is a primary factor to consider when deciding between batch and stream ingestion?

The cost of storage.
The number of team members available.
The programming language used.
Data volume and velocity. (correct)

Which of the following best describes a typical use case for batch ingestion?

Ingesting sales transaction data from multiple retail locations for overnight analysis. (correct)
Analyzing clickstream data from a website to provide product recommendations in real-time.
Processing real-time stock market data for immediate trading decisions.
Monitoring sensor data from IoT devices for immediate anomaly detection.

In stream processing, what role do producers play?

They store the data in a database.
They analyze the data in real-time.
They transform the data into a usable format.
They put records onto a stream. (correct)

Which feature is most characteristic of data streams?

They are designed to handle high-velocity data and real-time processing. (A) Signup and view all the answers

When building a batch processing pipeline, which of the following tasks is involved in the 'Transform/Load' stage?

Identifying the source and target schemas. (D) Signup and view all the answers

Which of the following is a key characteristic of a well-designed batch processing pipeline?

It provides alerting on job failure. (D) Signup and view all the answers

What role does workflow orchestration play in batch ingestion processing?

It handles interdependencies between jobs and manages failures. (C) Signup and view all the answers

Which of the following is a purpose-built AWS tool best suited for ingesting data from SaaS applications?

Amazon AppFlow. (D) Signup and view all the answers

If you need to ingest on-premises genome sequencing data to Amazon S3, which AWS service is most appropriate?

AWS DataSync. (D) Signup and view all the answers

What is the primary purpose of AWS Data Exchange?

To integrate third-party datasets into your pipeline. (D) Signup and view all the answers

What is one key benefit of using AWS Glue for batch ingestion tasks?

Schema identification. (C) Signup and view all the answers

What is the role of AWS Glue crawlers in schema identification and data cataloging?

They derive schemas from data stores and provide them to the AWS Glue Data Catalog. (C) Signup and view all the answers

Which of the following is a key feature of AWS Glue Studio?

Low-code job creation and management. (D) Signup and view all the answers

In AWS Glue, how are jobs processed in a serverless environment?

Through the AWS Glue Spark runtime engine. (C) Signup and view all the answers

What is the purpose of AWS Glue workflows?

They provide ETL orchestration. (C) Signup and view all the answers

Which AWS service provides integrated monitoring and logging for AWS Glue, including job run insights?

Amazon CloudWatch. (D) Signup and view all the answers

When scaling AWS Glue jobs, what is the effect of increasing the number of workers?

Horizontal Scaling. (A) Signup and view all the answers

For what type of batch processing workload is it most beneficial to choose a larger worker type in AWS Glue?

Processing Machine Learning transformations. (B) Signup and view all the answers

When building a real-time stream processing pipeline, what do 'producers' primarily do?

They put records on the stream. (C) Signup and view all the answers

Which of the following is a key characteristic of stream ingestion and processing?

It uses loose coupling. (A) Signup and view all the answers

In the context of Kinesis Data Streams, what is a shard?

A uniquely identified sequence of data records. (B) Signup and view all the answers

How does a partition key affect data records in Amazon Kinesis Data Streams?

It determines which shard to use. (B) Signup and view all the answers

What is a benefit of using Amazon Data Firehose for stream processing?

It performs no-code or low-code streaming ETL. (D) Signup and view all the answers

What is the main purpose of Amazon Managed Service for Apache Flink?

Query and analyze streaming data. (A) Signup and view all the answers

What is the role of the Kinesis Producer Library (KPL) in stream processing?

It simplifies the work of writing producers for Kinesis Data Streams. (B) Signup and view all the answers

Which components are included in the data records on Kinesis Streams?

Sequence number, partition key, and data blob. (B) Signup and view all the answers

Which action can be performed by the AWS IoT Core rules engine?

Transform and route incoming messages to AWS services. (D) Signup and view all the answers

Which availability metric can be tracked with CloudWatch for Kinesis

Read and Write Failures (A) Signup and view all the answers

What is a purpose of AWS IoT?

Securely connect, process, and act on IoT device data. (A) Signup and view all the answers

What components would you find in the AWS IoT universe?

Devices, Interfaces, Communications and Cloud Services (D) Signup and view all the answers

What communications protocols are used with AWS Iot?

MQTT and Pub/Sub (D) Signup and view all the answers

A data engineer is tasked to create a Stream Processing Pipeline to reformat a .csv file to .json and deliver it to an S3 bucket, while minimizing the amount of code. Which service should they use?

Amazon Data Firehose (B) Signup and view all the answers

True or False. Kinesis Data Streams allows applications running on consumer services such as EC2 to consume the ingested data.

True (B) Signup and view all the answers

True or False. AWS Glue requires you to manually manage and maintain servers in order for it to run.

False (A) Signup and view all the answers

You are using AWS Glue and need to run many jobs in parallel. Your data comes in the form of large, splittable files. What should you use to let the AWS Glue Spark runtime engine run many jobs in parallel?

Make sure the file is large and splittable (D) Signup and view all the answers

You need to ingest large amounts of data to data stores, data lakes, and analytics services. What is the best method of doing this?

Amazon Data Firehose (D) Signup and view all the answers

What is a scaling option for Kinesis Data Streams?

All of the above (D) Signup and view all the answers

What functionality is Amazon CloudWatch used for?

All of the above (D) Signup and view all the answers

Which AWS service has the main feature of real time data ingestion?

Amazon Kinesis Data Streams (A) Signup and view all the answers

A company needs to ingest sales transaction data and also sensor data from IoT devices. Choose ONE Primary AWS service for EACH data type, in order:

AWS DMS and Amazon Kinesis Data Streams (A) Signup and view all the answers

Flashcards

Batch Ingestion

Ingest and process records as a dataset; can be run on demand, schedule or event-based.

Streaming Ingestion

Ingest records continually, processing sets as they arrive on the stream.