AWS Data Ingestion: Batch and Streaming

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson
Download our mobile app to listen on the go
Get App

Questions and Answers

When designing a data ingestion strategy, what factors are most influential in determining whether to use batch or stream ingestion?

  • The volume of the data and the speed at which it needs to be ingested and processed. (correct)
  • The compliance requirements for data governance and the size of the data engineering team.
  • The cost of the chosen AWS services and the availability of pre-built connectors.
  • The number of different data sources and the complexity of the required transformations.

A company is migrating from an on-premises data warehouse to AWS. They need to transfer large volumes of data from their local file system to Amazon S3 for further processing. Which AWS service is purpose-built for this task?

  • AWS Data Pipeline
  • AWS Storage Gateway
  • AWS DataSync (correct)
  • AWS Transfer Family

Which of the following is a key characteristic of stream processing that distinguishes it from batch processing?

  • Processing data in large, predefined datasets.
  • Analyzing data overnight and generating reports in the morning.
  • Querying a source, transforming the data, and loading it into a pipeline.
  • Ingesting and processing records continually as they arrive. (correct)

When using AWS Glue for batch ingestion, which feature helps in automatically discovering the schema of your data?

<p>AWS Glue Crawlers (D)</p> Signup and view all the answers

An organization needs to ingest customer data from a third-party marketing platform into their data lake on AWS. They want a solution that simplifies the process of finding and subscribing to the required datasets. Which AWS service should they use?

<p>AWS Data Exchange (D)</p> Signup and view all the answers

When designing a batch processing pipeline with AWS Glue, what is the primary benefit of using Glue workflows?

<p>To handle interdependencies between jobs and manage failures. (A)</p> Signup and view all the answers

A data engineer is tasked with building a stream processing application that requires real-time analytics on data as it passes through the stream. Which AWS service is best suited for this purpose?

<p>Amazon Managed Service for Apache Flink (C)</p> Signup and view all the answers

What is the role of the Kinesis Producer Library (KPL) in the context of AWS Kinesis Data Streams?

<p>It simplifies the work of writing producers that send data to Kinesis streams. (D)</p> Signup and view all the answers

Which of the following is a key scaling consideration when using Amazon Kinesis Data Streams for stream processing?

<p>Managing the number of shards to handle throughput. (B)</p> Signup and view all the answers

An IoT platform collects data from numerous sensors in real-time. Which protocol is commonly used for communication with IoT devices in AWS IoT Core?

<p>MQTT (C)</p> Signup and view all the answers

A data engineer is setting up a new Amazon Kinesis Data Stream. What does the 'retention period' determine?

<p>The duration for which data records are stored in the stream. (B)</p> Signup and view all the answers

A financial services company needs to ingest sales transaction data from retailers around the world. The data is sent periodically to a central location, analyzed overnight, and reports are sent to branches in the morning. Which type of data ingestion is most suitable for this use case?

<p>Batch ingestion (B)</p> Signup and view all the answers

What is the main purpose of 'workflow orchestration' in a batch data ingestion pipeline?

<p>To handle interdependencies between jobs and manage failures effectively. (D)</p> Signup and view all the answers

A company wants to use Amazon AppFlow to ingest data from a SaaS application. What is a key step in configuring this data ingestion?

<p>Creating a connector with appropriate filters to select the required data. (A)</p> Signup and view all the answers

What advantage does using Amazon Data Firehose offer over directly writing to Amazon S3 from a stream processing application?

<p>Built-in support for complex data transformations with minimal coding. (A)</p> Signup and view all the answers

A company is ingesting data from various sources into AWS for analytics. Which of the following is a key benefit of using AWS Glue for this purpose?

<p>Automating the ETL process with serverless ETL processing. (C)</p> Signup and view all the answers

In the context of Amazon Kinesis Data Streams, what is the significance of a 'partition key'?

<p>It determines which shard the data record is written to. (C)</p> Signup and view all the answers

An organization is setting up AWS IoT Core to ingest data from thousands of devices. What is a primary feature of AWS IoT Core that helps in this process?

<p>The ability to securely connect, process, and act on device data. (D)</p> Signup and view all the answers

When scaling AWS Glue jobs vertically, which strategy aligns with this scaling approach?

<p>Choose a worker type with larger CPU, memory, and disk space. (D)</p> Signup and view all the answers

Which key characteristic of stream ingestion and processing allows multiple consumers to process records in parallel and independently?

<p>Parallel consumers (C)</p> Signup and view all the answers

A company is using AWS DataSync to transfer data from an on-premises file system to Amazon S3. Which functionality is provided by DataSync to efficiently manage the data transfer process?

<p>Filtering capabilities to transfer a subset of files. (B)</p> Signup and view all the answers

What does the term 'shard' refer to in the context of Amazon Kinesis Data Streams?

<p>A uniquely identified sequence of data records in the stream. (C)</p> Signup and view all the answers

An organization is planning to use AWS Glue to transform data in a batch processing pipeline. What benefit does the AWS Glue Data Catalog provide in this context?

<p>It stores metadata about the data, making it available for ETL script generation. (C)</p> Signup and view all the answers

Which AWS service simplifies the ingestion of data from a software-as-a-service (SaaS) application?

<p>Amazon AppFlow (C)</p> Signup and view all the answers

A company is scaling an AWS Glue job horizontally to process large, splittable datasets. Which approach reflects horizontal scaling in AWS Glue?

<p>Adding more workers to the job. (A)</p> Signup and view all the answers

What is a primary role of the AWS IoT Core rules engine?

<p>Transforming and routing incoming messages to AWS services. (B)</p> Signup and view all the answers

Which ingestion method uses traditional ETL?

<p>Batch (D)</p> Signup and view all the answers

What type of data might a retailer wish to analyze to provide a product recommendation?

<p>Clickstream data. (A)</p> Signup and view all the answers

Which AWS service offers a simplified method for locating and subscribing to third-party datasets?

<p>AWS Data Exchange (D)</p> Signup and view all the answers

What is 'bookmarking' referring to when using AWS Glue?

<p>Provide dependency management on the workflow. (C)</p> Signup and view all the answers

What type of AWS service simplifies the ingestion of specific data types?

<p>All of the above (D)</p> Signup and view all the answers

Other than Schema identification, what else does AWS Glue allow?

<p>All of the above. (D)</p> Signup and view all the answers

Where do AWS Glue crawlers derive schemas from?

<p>Data stores (A)</p> Signup and view all the answers

Why is horizontal scaling used with AWS Glue?

<p>Working with large, splittable datasets (D)</p> Signup and view all the answers

What do data records include?

<p>Sequence number, partition key, and data blob. (C)</p> Signup and view all the answers

What helps you monitor how your stream handles the data that is being written to and read from it?

<p>CloudWatch (D)</p> Signup and view all the answers

With AWS IoT services, what can you use to communicate with IoT devices?

<p>MQTT and a pub/sub model (A)</p> Signup and view all the answers

Flashcards

What is Batch Ingestion?

Ingest and process records in batches as a dataset. Run on demand, on a schedule, or based on an event.

What is Stream Ingestion?

Ingest records continually and process sets of records as they arrive on the stream.

What are Module objectives?

Key tasks that a data engineer performs when building an ingestion layer.

What do Batch jobs do?

Query the source, transform the data, and load it into the pipeline.

Signup and view all the flashcards

What does ETL mean?

Extract, transform and load

Signup and view all the flashcards

What does Batch Ingestion involve?

Writing scripts and jobs to perform the ETL or ELT process.

Signup and view all the flashcards

What is Amazon AppFlow?

A tool to ingest data from a software as a service (SaaS) application.

Signup and view all the flashcards

What is AWS DMS?

A tool to ingest data from your relational databases.

Signup and view all the flashcards

What is AWS DataSync?

A tool to ingest data from file systems.

Signup and view all the flashcards

What is AWS Data Exchange?

A tool to integrate third-party datasets into your pipeline.

Signup and view all the flashcards

What does AWS Glue simplify?

Schema identification, Data cataloging, Job authoring and monitoring, Serverless ETL processing, ETL orchestration

Signup and view all the flashcards

What is AWS Glue?

A fully managed data integration service which simplifies ETL tasks.

Signup and view all the flashcards

What do AWS Glue crawlers do?

Derive schemas from data stores and provide them to the centralized AWS Glue Data Catalog.

Signup and view all the flashcards

What does AWS Glue Studio provide?

Provides visual authoring and job management tools.

Signup and view all the flashcards

What does AWS Glue Spark runtime engine do?

Processes jobs in a serverless environment.

Signup and view all the flashcards

What should be focused on?

Performance goals should focus on what factors are most important for your batch processing.

Signup and view all the flashcards

How do you scale AWS Glue jobs horizontally?

Adding more workers to your tasks.

Signup and view all the flashcards

How do you scale AWS Glue jobs vertically?

Choosing a larger type of worker in the job configuration.

Signup and view all the flashcards

What is considered as stream?

The stream is a buffer between the producers and the consumers of the stream.

Signup and view all the flashcards

What does KPL do?

Simplifies the work of writing producers for Kinesis Data Streams.

Signup and view all the flashcards

What are Shards?

A uniquely identified sequence of data records

Signup and view all the flashcards

What does Amazon Data Firehose do?

Delivers streaming data directly to storage, including Amazon S3 and Amazon Redshift.

Signup and view all the flashcards

What is Amazon Managed Service for Apache Flink?

Purpose built to perform real-time analytics on data as it passes through the stream.

Signup and view all the flashcards

What do scaling options on Kinesis Data Streams do?

Manage the throughput of data on the stream.

Signup and view all the flashcards

What does CloudTrail do?

Track API actions, including changes to stream configuration and new consumers

Signup and view all the flashcards

What does CloudWatch do?

Track record age, throttling, and write and read failures

Signup and view all the flashcards

What does AWS IoT Core provide?

Provides the ability to securely connect, process, and act on IoT device data.

Signup and view all the flashcards

What to use to Communicate with IoT devices?

MQTT and a pub/sub model

Signup and view all the flashcards

What does the AWS IoT Core rules engine do?

Transforms and routes incoming messages to AWS services

Signup and view all the flashcards

Study Notes

  • The module prepares you to list data engineer tasks for building an ingestion layer
  • The module prepares you to describe how AWS services support ingestion tasks
  • The module prepares you to illustrate how AWS Glue features automate batch ingestion
  • The module prepares you to describe AWS streaming services and features that simplify streaming ingestion
  • The module prepares you to identify configuration options in AWS Glue and Amazon Kinesis Data Streams to scale ingestion processing
  • The module prepares you to describe distinct characteristics of ingesting IoT data by using AWS IoT Core

Batch and Streaming Ingestion

  • Batch ingestion involves ingesting and processing a batch of records as a dataset
  • Batch ingestion can be run on demand, on a schedule, or based on an event
  • Streaming ingestion involves ingesting records continually and processing sets of records as they arrive on the stream

Data Volume and Velocity

  • Data volume and velocity are key factors in choosing an ingestion method
  • Ingestion method choice depends on the amount of data to be ingested
  • Ingestion method choice depends on the frequency with which new data must be ingested and processed
  • Batch ingestion example: Sales transaction data from retailers across the world is sent periodically to a central location
  • Data is analyzed overnight and reports are sent to branches in the morning in the batch ingestion example
  • Streaming ingestion example: Website clickstream data sends a large volume of small bits of data continuously
  • Data is analyzed immediately to provide a product recommendation in the streaming ingestion example

Key Takeaways - Batch and Streaming

  • Batch jobs query the source, transform data, and load it into the pipeline
  • Traditional ETL uses batch processing
  • With stream processing, producers put records on a stream where consumers get and process them
  • Streams are designed to handle high-velocity data and real-time processing

Tasks to Build a Batch Processing Pipeline

  • Tasks include Extract, Transform/Load, and Load/Transform
  • Extract data from sources
  • Transform/Load involves identifying the source and target schemas
  • Transform/Load involves securely transferring and storing the data
  • Load/Transform involves transforming the dataset
  • Load/Transform involves loading the dataset to durable storage
  • Workflow orchestration ties components together

Key Characteristics for Batch Processing

  • Ease of use: Make it flexible and offer low-code, no-code, and serverless options
  • Data volume and variety: Handle large volumes of data and support disparate source and target systems
  • Data volume and variety: Support different data formats seamlessly
  • Orchestration and monitoring: Support workflow creation and provide dependency management
  • Orchestration and monitoring: Support bookmarking, job failure alerts, and logging
  • Scaling and cost management: Enable automatic scaling and offer pay-as-you-go options

Key Takeaways - Batch Ingestion

  • Batch ingestion involves writing scripts and jobs to perform the ETL or ELT process
  • Workflow orchestration helps you handle interdependencies between jobs and manage failures within a set of jobs
  • Key characteristics for pipeline design include ease of use, data volume and variety, orchestration and monitoring, and scaling and cost management

Purpose-Built Ingestion Tools

  • AWS offers purpose-built tools to match data sources
  • Tools provide secure connections and data store integration
  • Tools provide automated updates and Amazon CloudWatch monitoring
  • Tools provide selection and transformation
  • SaaS apps are ingested with Amazon AppFlow
  • Relational databases are ingested with AWS DMS
  • File shares are ingested using DataSync
  • Third-party datasets are ingested with AWS Data Exchange

Amazon AppFlow

  • Ingests data from software as a service apps
  • Create a connector with filters
  • Map fields and perform transformations
  • Perform validation
  • Securely transfer to Amazon S3 or Amazon Redshift
  • Example: Ingest customer support ticket data from Zendesk

AWS DMS

  • Ingests data from relational databases
  • Connect to source data and format it for a target
  • Use source filters and table mappings
  • Perform data validation
  • Write to many AWS data stores
  • Create a continuous replication task
  • Example: Ingest line of business transactions from an Oracle database

AWS DataSync

  • Ingests data from file systems
  • Apply filters to transfer a subset of files
  • Use a variety of file systems as sources and target, including Amazon S3 as a target
  • Securely transfer data between self-managed storage systems and AWS storage services
  • Example: Ingest on-premises genome sequencing data to Amazon S3

AWS Data Exchange

  • Integrates third-party datasets into your pipeline
  • Find and subscribe to sources
  • Preview before subscribing
  • Copy subscribed datasets to Amazon S3
  • Receive notifications of updates
  • Example: Ingest de-identified clinical data from a third party

Key Takeaways - Purpose Built Ingestion Tools

  • Purpose-built tools should match the type of data to be ingested and simplify the tasks involved in ingestion
  • Amazon AppFlow, AWS DMS, and DataSync each simplify the ingestion of specific data types
  • AWS Data Exchange provides a simplified way to find and subscribe to third-party datasets

AWS Glue

  • AWS Glue simplifies batch ingestion tasks
  • AWS Glue provides schema identification
  • AWS Glue provides data cataloging
  • AWS Glue provides job authoring and monitoring
  • AWS Glue provides serverless ETL processing
  • AWS Glue provides ETL orchestration

Key points - AWS Glue

  • In schema identification and data cataloging, AWS Glue crawlers derive schemas from data stores
  • Metadata for ETL script generation is sent to AWS Glue
  • In job authoring, there is low-code job creation and management, a graphical interface, transformations, and monitoring
  • Data sources are processed to storage using AWS Glue Spark runtime environment
  • ETL orchestration supports complex multi-job, multi-crawler ETL processing and is trackable as one entity

Monitoring AWS Glue Jobs

  • AWS Glue jobs can be monitored using CloudTrail
  • CloudWatch provides AWS Glue job run insights

Key Takeaways - AWS Glue for Batch

  • AWS Glue is a fully managed data integration service that simplifies ETL tasks
  • AWS Glue crawlers derive schemas from data stores
  • AWS Glue studio provides visual authoring and job management tools
  • AWS Glue Spark runtime engine processes jobs in a serverless environment
  • AWS Glue workflows provide ETL orchestration
  • CloudWatch provides integrated monitoring and logging

Horizontal Scaling

  • Increase the number of workers that are allocated to the job
  • Use case: working with large, splittable datasets
  • Example: Processing a large .csv file

Vertical Scaling

  • Choose a worker type with larger CPU, memory, and disk space
  • Use case: Working with memory-intensive applications
  • Example: Machine Learning transformations

Key Takeaways - Scaling Considerations for Batch

  • Performance goals should focus on what factors are most important for your batch processing
  • AWS Glue jobs can be scaled horizontally by adding more workers
  • AWS Glue Jobs can be scaled vertically by choosing a larger type of worker in the job configuration
  • Large, splittable files let the AWS Glue Spark runtime engine run many jobs in parallel

Building a Real Time Stream Processing Pipeline

  • Tasks include Extract, Transform/Load, and Load/Transform
  • Extract involves putting records on stream (Producers)
  • Transform/Load involves getting records off the stream and transforming them (Consumers)
  • Load/Transform involves analyzing or storing processed data
  • Data moves through the pipeline continuously

Key Characteristics for Stream Ingestion and Processing

  • Throughput: Plan for a resilient, scalable stream that can adapt to changing velocity and volume
  • Loose coupling: Build independent ingestion, processing, and consumer components
  • Parallel consumers: Allow multiple consumers on a stream to process records in parallel and independently
  • Checkpointing and replay: Maintain record order and allow replay, and the ability to mark the farthest record processed on failure

Purpose-Built Streaming Services

  • Data flows from data sources, to ingest and store, to transform
  • Data sources include web, sensors, devices, and social media
  • Services include Kinesis Data Streams, Amazon Data Firehose, and Amazon Managed Service for Apache Flink

Kinesis Data Streams

  • A shard is a uniquely identified sequence of data records
  • A data record is a unit of data stored and contains sequence number, partition key, and data blob
  • Producer applications put records in Kinesis Data Streams
  • Multiple consumers such as Amazon Data Firehose, Consumer applications running on EC2, Lambda function, and Amazon Managed Service for Apache Flink read from Kinesis Data Streams

Amazon Data Firehose

  • Can perform no-code or low-code streaming ETL
  • Ingest from many AWS services including Kinesis Data Streams
  • Apply built-in and custom transformations
  • Deliver directly to data stores, data lakes, and analytics services
  • AMS for Apache Flink can query and analyze streaming data
  • AMS for Apache Flink can ingest from other services including Kinesis Data Streams
  • AMS for Apache Flink can enrich and augment data across time windows
  • AMS for Apache Flink can build applications in Apache Flink
  • Developers can use SQL, Java, Python, or Scala

Key Takeaways - Streaming

  • The stream is a buffer between the producers and the consumers
  • The KPL simplifies the work of writing producers for Kinesis Data Streams
  • Data is written to shards on the stream as a sequence of data records
  • Data records include a sequence number, partition key, and data blob
  • Amazon Data Firehose can deliver streaming data directly to storage, including Amazon S3 and Amazon Redshift
  • Amazon Managed Service for Apache Flink is purpose-built to perform real-time analytics as data passes through the stream

Configuring Kinesis Data Streams

  • Set retention period for stream records in Duration of data availability
  • Choose stream capacity mode: On-demand or Provisioned for Write capacity
  • Choose consumer types: shared fan-out or enhanced fan-out for Read capacity

Monitoring Kinesis Data Streams

  • CloudTrail can track API actions, including changes to stream configuration and new consumers
  • CloudWatch can track record age, throttling, and write and read failures

Key Takeaways - Scaling Considerations for Streaming

  • Kinesis Data Streams provides scaling options to manage the throughput of data on the stream
  • Scale how much data can be written to the stream, how long the data is stored on the stream, and how much throughput each consumer gets
  • CloudWatch provides metrics that help you monitor how your stream handles the data that is being written to and read from it

IoT (Internet of Things)

  • The IoT universe contains smart home devices, factories, farms, and industries
  • The IoT contains devices, interfaces, cloud services, apps, and communications
  • Devices are the hardware the manage interfaces and communications
  • Interfaces are components that connect devices to the physical world
  • Cloud services provide storage and processing
  • Apps provide an end user access point to devices and features
  • Communications describes the technology and protocols for communicating between devices, and between devices and services

AWS IoT Core

  • Provides the ability to securely connect, process, and act on IoT device data
  • Includes features to filter and transform data
  • Can route data to other AWS services, including streaming and storage services

AWS IoT Core - Rule Actions

  • Publishers send to AWS IoT Core
  • AWS IoT core can dispatch to Amazon Data Firehose, Amazon S3, Lambda, and DynamoDB

Rules Engine

  • The rules engine transforms and routes data
  • AWS IoT Core sends to Amazon Data Firehose and Amazon S3
  • IoT Core sends to Amazon Managed Service for Apache Flink

Key Takeaways - IoT

  • AWS IoT services leverage MQTT and a pub/sub model to communicate with IoT devices
  • AWS IoT Core can securely connect, process, and act upon device data
  • The AWS IoT Core rules engine transforms and routes incoming messages to AWS services

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

More Like This

AWS Data Ingestion
41 questions

AWS Data Ingestion

WondrousNewOrleans avatar
WondrousNewOrleans
AWS Data Ingestion
40 questions

AWS Data Ingestion

WondrousNewOrleans avatar
WondrousNewOrleans
Data Ingestion: batch and streaming
37 questions
Batch and Stream Ingestion
55 questions

Batch and Stream Ingestion

WondrousNewOrleans avatar
WondrousNewOrleans
Use Quizgecko on...
Browser
Browser