Data Ingestion: batch and streaming

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson
Download our mobile app to listen on the go
Get App

Questions and Answers

In building an ingestion layer, what is a key task a data engineer must perform?

  • Designing the user interface for data interaction.
  • Implementing security protocols for data access.
  • Setting up the physical servers for data storage.
  • Orchestrating the data transformation and loading processes. (correct)

Traditional ETL processes align with which type of data ingestion?

  • Batch processing (correct)
  • Event-driven architecture
  • Real-time streaming
  • Micro-batching

Which AWS service is designed to simplify the ingestion of data from SaaS applications?

  • Amazon AppFlow (correct)
  • AWS Data Exchange
  • AWS DataSync
  • AWS DMS

For near real-time clickstream analysis, which ingestion method is most suitable?

<p>Real-time or streaming ingestion (A)</p> Signup and view all the answers

When designing a batch processing pipeline, what factor is important to consider for scaling and cost management?

<p>Enabling automatic scaling and pay-as-you-go options. (A)</p> Signup and view all the answers

Which of these options best describes the function of AWS Glue crawlers?

<p>To derive schemas from data stores. (B)</p> Signup and view all the answers

When choosing between horizontal and vertical scaling for AWS Glue workers, which scenario benefits most from vertical scaling?

<p>Dealing with memory-intensive applications. (B)</p> Signup and view all the answers

With stream processing, what role do producers play?

<p>They put records onto a stream. (B)</p> Signup and view all the answers

If you need to ingest data from relational databases into AWS, which service would be most appropriate?

<p>AWS Database Migration Service (DMS) (C)</p> Signup and view all the answers

Which of the following is a key characteristic of stream ingestion and processing?

<p>It is designed for handling high-velocity data and real-time analytics. (A)</p> Signup and view all the answers

Which AWS service is designed to ingest data from file systems?

<p>AWS DataSync (B)</p> Signup and view all the answers

What functionality does the AWS Data Exchange provide?

<p>A way to find and subscribe to third-party datasets (D)</p> Signup and view all the answers

AWS Glue simplifies batch ingestion tasks by providing serverless ETL processing. What does ETL stand for?

<p>Extract, Transform, Load (A)</p> Signup and view all the answers

What is the purpose of the Kinesis Producer Library (KPL)?

<p>To simplify the process of writing producers for Kinesis Data Streams. (C)</p> Signup and view all the answers

What is the primary function of AWS IoT Core?

<p>To enable secure connection, processing, and acting on IoT device data. (D)</p> Signup and view all the answers

Which of the following AWS Glue features enables visual authoring and job management?

<p>AWS Glue Studio (D)</p> Signup and view all the answers

What is a shard in Amazon Kinesis Data Streams?

<p>A uniquely identified sequence of data records in a stream. (C)</p> Signup and view all the answers

What type of model facilitates communication with IoT devices in conjunction with AWS IoT services?

<p>MQTT and pub/sub (B)</p> Signup and view all the answers

Which AWS service allows you to perform real-time analytics on streaming data as it passes through the stream?

<p>Amazon Managed Service for Apache Flink (A)</p> Signup and view all the answers

What is the purpose of using AWS Glue workflows?

<p>To orchestrate ETL tasks. (B)</p> Signup and view all the answers

Which of the following is a function of the AWS IoT Core rules engine mentioned in the content?

<p>Transforming and routing incoming messages to AWS services. (B)</p> Signup and view all the answers

Which AWS service is best suited for ingesting de-identified clinical data from a third party?

<p>AWS Data Exchange. (C)</p> Signup and view all the answers

When is choosing batch ingestion a suitable processing type?

<p>When sales transaction data from retailers across the world is sent periodically to a central location. (B)</p> Signup and view all the answers

When is selecting streaming ingestion a suitable processing type?

<p>When data must be analyzed immediately. (C)</p> Signup and view all the answers

Which of the following is NOT a task that AWS Glue simplifies for batch ingestion?

<p>Network configuration. (D)</p> Signup and view all the answers

Which of the following is a characteristic for batch processing design choices?

<p>Ease of use. (B)</p> Signup and view all the answers

Which component of the AWS IoT universe connects devices to the physical world?

<p>Interfaces (B)</p> Signup and view all the answers

Which component of the AWS IoT universe describes protocols for communicating between devices?

<p>Communications (D)</p> Signup and view all the answers

Which component of the AWS IoT universe is the end-user access point to devices and features?

<p>Apps (D)</p> Signup and view all the answers

Which component of the AWS IoT universe describes storage and processing services?

<p>Cloud services (C)</p> Signup and view all the answers

Within Kinesis Data Streams, what does a 'partition key' determine?

<p>Which shard the data record belongs to. (C)</p> Signup and view all the answers

What does DataSync do when ingesting data?

<p>Securely transfer data between self-managed storage systems and AWS storage services. (A)</p> Signup and view all the answers

A data engineer is building a batch processing pipeline for a large dataset stored in Amazon S3. Given the dataset can’t be split, which AWS Glue scaling strategy is most effective for accelerating the job?

<p>Use vertical scaling to choose a larger worker. (D)</p> Signup and view all the answers

An organization is capturing high-velocity clickstream data from its website and needs to process and analyze this data in near real-time to provide personalized recommendations to users. The data volume fluctuates significantly throughout the day. What is the MOST suitable approach for ingesting and processing this data?

<p>Use Amazon Kinesis Data Streams to ingest the clickstream data, then use Amazon Kinesis Data Analytics for real-time processing. (D)</p> Signup and view all the answers

A financial services company needs to ingest real-time stock ticker data into AWS for analysis. They require a solution that can scale to handle high data volumes, ensure low latency, and integrate with various analytics services. Which AWS service is best suited to ingest and process this data?

<p>Amazon Kinesis Data Streams. (B)</p> Signup and view all the answers

A large-scale manufacturing company wants to collect sensor data from thousands of machines to predict maintenance needs and optimize production efficiency. They need to ingest this data into AWS, process it in real time, and store it for historical analysis. Which AWS services is the MOST suitable for this scenario?

<p>AWS IoT Core, Amazon Kinesis Data Streams, Amazon S3. (A)</p> Signup and view all the answers

A data engineer is setting up a Kinesis Data Stream and notices that consumers are experiencing throttling issues during peak periods. To address this, what scaling adjustments can be made to improve throughput?

<p>Increase the number of shards. (C)</p> Signup and view all the answers

Flashcards

Batch Ingestion

Ingest and process a batch of records as a dataset, which can run on demand, on a schedule, or based on an event.

Streaming Ingestion

Ingest records continually and process sets of records as they arrive on the stream.

Batch Ingestion Example

When data is sent periodically to a central location, analyzed overnight, and reports are sent in the morning.

Streaming Ingestion Example

When clickstream data from a website are analyzed immediately to provide product recommendations.

Signup and view all the flashcards

Batch job actions

Query the source, transform the data, and load it into the pipeline.

Signup and view all the flashcards

Batch Ingestion Implies

Writing scripts and jobs to perform ETL or ELT.

Signup and view all the flashcards

Batch processing pipeline steps

Connecting to sources, selecting data, identifying schemas, transferring securely, transforming, and loading the dataset.

Signup and view all the flashcards

Workflow orchestration

Handle interdependencies, manage failures within a set of jobs in Batch Ingestion.

Signup and view all the flashcards

AWS Glue

A data integration service that simplifies ETL tasks by deriving schemas from data stores, providing visual authoring and serverless

Signup and view all the flashcards

Purpose-built Tool Selection

Choose purpose-built tools that match the type of data to be ingested and simplify the tasks that are involved in ingestion.

Signup and view all the flashcards

Amazon AppFlow

To ingest data from a software as a service (SaaS) application.

Signup and view all the flashcards

AWS DMS Usage

To ingest data from relational databases.

Signup and view all the flashcards

AWS DataSync Usage

To ingest data from file systems.

Signup and view all the flashcards

AWS Data Exchange

To integrate third-party datasets into your pipeline

Signup and view all the flashcards

AWS Glue Purpose

A fully managed data integration service that simplifies ETL tasks.

Signup and view all the flashcards

AWS Glue Crawlers

Derive schemas from data stores and provide them to the centralized AWS Glue Data Catalog.

Signup and view all the flashcards

AWS Glue Studio

It provides visual authoring and job management tools.

Signup and view all the flashcards

AWS Glue Spark Runtime

Processes jobs in a serverless environment.

Signup and view all the flashcards

AWS Glue Workflows

Provide ETL orchestration.

Signup and view all the flashcards

CloudWatch and Glue

Provides integrated monitoring and logging for AWS Glue.

Signup and view all the flashcards

Horizontal Scaling in AWS Glue

Increase the number of workers that are allocated to the job.

Signup and view all the flashcards

Vertical Scaling in AWS Glue

Choose a worker type with larger CPU, memory, and disk space.

Signup and view all the flashcards

Stream Throughput

A resilient, scalable stream that can adapt to changing velocity and volume.

Signup and view all the flashcards

Loose Coupling

Build independent ingestion, processing, and consumer components.

Signup and view all the flashcards

Parallel Consumers

Allow multiple consumers on a stream to process records in parallel and independently.

Signup and view all the flashcards

Checkpointing and Replay

Maintain record order and allow replay.

Signup and view all the flashcards

Kinesis Data Streams

Tool for ingesting and storing streaming data.

Signup and view all the flashcards

Amazon Data Firehose

Tool to transform and load for future analysis.

Signup and view all the flashcards

Amazon Managed Service for Apache Flink

Tool to query and analyze streaming data.

Signup and view all the flashcards

What is a Stream?

Provides for Stream ingestion, is a buffer between producers and consumers.

Signup and view all the flashcards

Record Actions

Put records on the stream (Producers).

Signup and view all the flashcards

Scaling Options

Read consumers types, data availability and write capacity.

Signup and view all the flashcards

CloudTrail actions

Track API actions, including changes to stream configuration and new consumers.

Signup and view all the flashcards

Records in cloudwatch

Track record age, throttling, and write and read failures.

Signup and view all the flashcards

Internet of Things (IoT)

A system of interconnected devices, interfaces, and communications.

Signup and view all the flashcards

Devices within Iot

Hardware that manages interfaces and communications.

Signup and view all the flashcards

AWS Iot Core

AWS Iot Provides the ability to securely connect, process, and act on IoT device data

Signup and view all the flashcards

AWS Messaging with Iot

Can use MQTT and a pub/sub model to communicate with IoT devices.

Signup and view all the flashcards

Rules Engine with AWS Iot

The AWS IoT Core rules engine transforms and routes incoming messages to AWS services.

Signup and view all the flashcards

Study Notes

  • This module goes over the primary tasks a data engineer must perform when building an ingestion layer
  • It describes how AWS services support ingestion tasks and automated batch ingestion
  • It also identifies streaming services and features that simplify streaming ingestion
  • It identifies configuration options in AWS Glue and Amazon Kinesis Data Streams
  • It details the scaling of ingestion processing, and the characteristics of ingesting IoT data when using AWS IoT Core

Batch vs Stream Ingestion

  • Batch ingestion processes a batch of records as a dataset
    • Runs on demand, on a schedule, or based on an event
  • Streaming ingestion ingests records continually and processes sets of records as they arrive
  • Data volume and velocity are primary drivers for deciding on which ingestion method to use
  • The method should fit both the amount of data ingested and the frequency of ingestion
  • Batch jobs query the source, transform the data, and load it into the pipeline
  • Traditional ETL uses batch processing
  • With stream processing, producers put records on a stream where consumers get and process them
  • Streams are designed to handle high-velocity data and real-time processing

Batch Ingestion Processing Pipeline

  • To build a batch processing pipeline:
    • Extract: Connect to sources and select data
    • Transform/Load: Identify the source and target schemas, transfer and store data securely, and transform the dataset
    • Load/Transform: Load the dataset to durable storage
  • Key characteristics for pipeline design include ease of use, data volume and variety, orchestration and monitoring, and scaling and cost management
  • Batch ingestion involves writing scripts and jobs to perform the ETL or ELT process
  • Workflow orchestration helps handle interdependencies between jobs, and manage failures within a set of jobs

Purpose-Built AWS tools for Batch Ingestion

  • AWS provides purpose-built tools that match the type of data to be ingested and simplify the tasks involved
  • These tools also provide: secure connections, data store integration, automated updates, Amazon CloudWatch monitoring, and selection and transformation
  • For SaaS applications, use Amazon AppFlow to ingest data
    • This involves creating a connector with filters, and mapping fields, then performing transformations, and validating
    • Finally, securely transfer data to Amazon S3 or Amazon Redshift
  • For relational databases, use AWS DMS to ingest data
    • Connect to source data and format it for a target
    • Use source filters and table mappings
    • Perform data validation and writes to many AWS data stores
    • You can also, create a continuous replication task
  • For file systems, use DataSync to ingest data
    • Apply filters to transfer a subset of files
    • It can use a variety of file systems as sources and a target, including Amazon S3
    • Securely transfer data between self-managed storage systems and AWS storage services
  • For third-party datasets in your pipeline, use AWS Data Exchange
    • Find and subscribe to sources and preview before subscribing
    • Copy subscribed datasets to Amazon S3
    • Receive notifications of updates

AWS Glue for Batch Ingestion

  • AWS Glue simplifies batch ingestion tasks.
  • AWS Glue’s functions encompass Schema identification, Data cataloging, Job authoring and monitoring, Serverless ETL processing, and ETL orchestration
  • AWS Glue is a fully managed data integration service that simplifies ETL tasks.
  • AWS Glue crawlers derive schemas from data stores and provide them to the centralized AWS Glue Data Catalog.
  • AWS Glue Studio provides visual authoring and job management tools.
  • The AWS Glue Spark runtime engine processes jobs in a serverless environment.
  • AWS Glue workflows provide ETL orchestration.
  • CloudWatch provides integrated monitoring and logging for AWS Glue, including job run insights.
  • Based on Apache Spark, it is, fully managed and serverless, optimized for fast queries across large datasets
  • It also supports complex, multi-job, multi-crawler ETL processing, is trackable as one entity, and runs on schedule or on demand

Scaling AWS Glue

  • Performance goals should focus on on what factors are most important for your batch processing
  • Scale AWS Glue jobs horizontally, by adding more workers
  • Scale AWS Glue jobs vertically, by choosing a larger type of worker in the job configuration
  • Large, splittable files let the AWS Glue Spark runtime engine run many jobs in parallel with less overhead than processing many smaller files
  • When using AWS Glue workers, you can increase the number of workers allocated to the job for horizontal scaling
  • You can also choose a worker type with larger CPU, memory, and disk space for vertical scaling

Stream Processing

  • Tasks to build a real-time stream processing pipeline include:
    • Extract: Put records on the stream (Producers) and provide secure, durable storage
    • Transform/Load: Get records off the stream (Consumers) and Transform records (Consumers)
    • Load/Transform: Analyze or store processed data
  • Data moves through the pipeline continuously
  • Key characteristics for stream ingestion and processing:
    • Throughput: Plan for a resilient, scalable stream that can adapt to changing velocity and volume
    • Loose coupling: Build independent ingestion, processing, and consumer components
    • Parallel consumers: Allow multiple consumers on a stream to process records in parallel and independently
    • Checkpointing and replay: Maintain record order and allow replay, and support the ability to mark the farthest record processed on failure

Stream Processing Services from AWS

  • Kinesis Data Streams - ingests the data stream
  • Amazon Data Firehose - transforms and loads the data for future analyses
  • Amazon S3 - stores the transformed data
  • Amazon Managed Service for Apache Flink - processes and analyzes data in real time
  • OpenSearch Service - also can process and analyze data in real time, depending on how it's configured

Kinesis Data Streams

  • The stream is a buffer between the producers and the consumers of the stream
  • The KPL simplifies the work of writing producers for Kinesis Data Streams
  • Data is written to shards on the stream as a sequence of data records
  • Data records include a sequence number, partition key, and data blob
  • A shard is a uniquely identified sequence of data records
  • A partition key determines which shard to use

Amazon Data Firehose

  • You can perform no-code or low-code stream ETL
  • Ingest data from many AWS services, including Kinesis Data Streams
  • Apply built-in and custom transformations
  • Deliver data directly to data stores, data lakes, and analytics services

AWS IoT

  • By using IoT services you can use MQTT and pub/sub model to communicate with IoT devices
  • You can use AWS IoT Core to securely connect, process, and act upon device data
  • The AWS IoT Core rules engine transforms and routes incoming messages to AWS services
  • The Rules engine routes and transforms data utilizing AWS IoT Core, Amazon Data Firehose and Amazon S3

Scaling Kinesis Data Streams

  • Scaling options help manage the throughput of data on the stream
  • The amount of writable data, the length of time that that data is stored on a stream, and the throughput each consumer gets can all be scaled CloudWatch provides metrics to help monitor what the stream handles

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

More Like This

AWS Data Ingestion: Batch and Streaming
37 questions
AWS Data Ingestion
41 questions

AWS Data Ingestion

WondrousNewOrleans avatar
WondrousNewOrleans
AWS Data Ingestion
40 questions

AWS Data Ingestion

WondrousNewOrleans avatar
WondrousNewOrleans
Batch and Stream Ingestion
55 questions

Batch and Stream Ingestion

WondrousNewOrleans avatar
WondrousNewOrleans
Use Quizgecko on...
Browser
Browser