Podcast
Questions and Answers
When designing a batch processing pipeline, which characteristic primarily focuses on the ability to handle varying data formats?
When designing a batch processing pipeline, which characteristic primarily focuses on the ability to handle varying data formats?
- Data volume and variety (correct)
- Orchestration and monitoring
- Ease of use
- Scaling and cost management
Which AWS service is best suited for ingesting social media feeds and analyzing sentiment in real time?
Which AWS service is best suited for ingesting social media feeds and analyzing sentiment in real time?
- Amazon AppFlow
- AWS DMS
- AWS Data Exchange
- Amazon Kinesis Data Streams (correct)
Which AWS service is designed to simplify the ingestion of data from SaaS applications?
Which AWS service is designed to simplify the ingestion of data from SaaS applications?
- AWS Data Exchange
- Amazon AppFlow (correct)
- AWS DataSync
- AWS DMS
A company needs to migrate a large number of files from an on-premises file server to Amazon S3. Which AWS service is most appropriate for this task?
A company needs to migrate a large number of files from an on-premises file server to Amazon S3. Which AWS service is most appropriate for this task?
When designing a stream processing application, which characteristic helps ensure minimal impact if one component fails?
When designing a stream processing application, which characteristic helps ensure minimal impact if one component fails?
Which of the following is a key feature of AWS Glue that helps in automating batch ingestion?
Which of the following is a key feature of AWS Glue that helps in automating batch ingestion?
Which AWS service should a data engineer use to ingest data from an Oracle database into Amazon S3?
Which AWS service should a data engineer use to ingest data from an Oracle database into Amazon S3?
What is a primary use case for Amazon Data Exchange?
What is a primary use case for Amazon Data Exchange?
In AWS Glue, what is the purpose of Crawlers?
In AWS Glue, what is the purpose of Crawlers?
What is the function of the Kinesis Producer Library (KPL)?
What is the function of the Kinesis Producer Library (KPL)?
For batch ingestion, what does 'Workflow Orchestration' primarily help with?
For batch ingestion, what does 'Workflow Orchestration' primarily help with?
Which AWS service allows for connecting, processing, and acting upon IoT device data?
Which AWS service allows for connecting, processing, and acting upon IoT device data?
Which AWS service is purpose-built for performing real-time analytics on streaming data?
Which AWS service is purpose-built for performing real-time analytics on streaming data?
What is a key consideration when using AWS Glue for batch processing?
What is a key consideration when using AWS Glue for batch processing?
When configuring Kinesis Data Streams, what does the 'retention period' define?
When configuring Kinesis Data Streams, what does the 'retention period' define?
When scaling AWS Glue jobs, what is the effect of 'horizontal scaling'?
When scaling AWS Glue jobs, what is the effect of 'horizontal scaling'?
In the context of Kinesis Data Streams, what does a 'shard' represent?
In the context of Kinesis Data Streams, what does a 'shard' represent?
What is the role of the 'rules engine' in AWS IoT Core?
What is the role of the 'rules engine' in AWS IoT Core?
Which of the following is a characteristic of stream processing that is NOT typically a characteristic of batch processing?
Which of the following is a characteristic of stream processing that is NOT typically a characteristic of batch processing?
What benefit does the pay-as-you-go pricing model offer within batch processing?
What benefit does the pay-as-you-go pricing model offer within batch processing?
Which of the following ingestion scenarios is AWS DataSync LEAST suited for?
Which of the following ingestion scenarios is AWS DataSync LEAST suited for?
A data engineer needs to choose the correct AWS Glue worker type. Which jobs benefit most from selecting a worker type with larger memory and disk space?
A data engineer needs to choose the correct AWS Glue worker type. Which jobs benefit most from selecting a worker type with larger memory and disk space?
Which of the follow AWS services simplifies data ingestion from multiple sources by offering schema identification, data cataloging, and ETL orchestration?
Which of the follow AWS services simplifies data ingestion from multiple sources by offering schema identification, data cataloging, and ETL orchestration?
What is a key benefit of AWS Data Firehose's no-code or low-code streaming ETL capabilities?
What is a key benefit of AWS Data Firehose's no-code or low-code streaming ETL capabilities?
Which of the following represents the correct order of operations as part of the batch ingestion data flow?
Which of the following represents the correct order of operations as part of the batch ingestion data flow?
Which two characteristics should data engineers consider when identifying an appropriate data ingestion method?
Which two characteristics should data engineers consider when identifying an appropriate data ingestion method?
What is a key difference between ETL and ELT?
What is a key difference between ETL and ELT?
Which AWS service is best suited for setting up a continuous capture task to load real-time data changes from an on-premises database into Amazon RDS?
Which AWS service is best suited for setting up a continuous capture task to load real-time data changes from an on-premises database into Amazon RDS?
What is a component that is part of the AWS IoT Core framework?
What is a component that is part of the AWS IoT Core framework?
Which two AWS services best captures metrics to monitor a Kinesis data stream, record age, throttling, and write and read failures?
Which two AWS services best captures metrics to monitor a Kinesis data stream, record age, throttling, and write and read failures?
Which of the following is NOT a task for building a batch processing pipeline?
Which of the following is NOT a task for building a batch processing pipeline?
A manufacturing company wants to collect sensor data from its factory equipment and analyze it in real-time to predict equipment failures. Which AWS services should they consider to ingest, process, and then analyze the real-time streaming sensor data?
A manufacturing company wants to collect sensor data from its factory equipment and analyze it in real-time to predict equipment failures. Which AWS services should they consider to ingest, process, and then analyze the real-time streaming sensor data?
What is the purpose of supporting bookmarking within batch processing?
What is the purpose of supporting bookmarking within batch processing?
What type of data does streaming ingestion use?
What type of data does streaming ingestion use?
What would be the benefit of using the serverless job processing paradigm?
What would be the benefit of using the serverless job processing paradigm?
What type of environment is AWS Glue Spark's runtime engine?
What type of environment is AWS Glue Spark's runtime engine?
What functionality does CloudTrail offers?
What functionality does CloudTrail offers?
Which Amazon service would you use to find and subscribe to third-party data sets?
Which Amazon service would you use to find and subscribe to third-party data sets?
Which are some advantages of stream ingestion?
Which are some advantages of stream ingestion?
What is the primary purpose of Amazon Managed Service for Apache Flink?
What is the primary purpose of Amazon Managed Service for Apache Flink?
Which of the following is an example of batch ingestion?
Which of the following is an example of batch ingestion?
Flashcards
Batch Ingestion
Batch Ingestion
Ingest and process records as a dataset; run on demand, schedule, or event-based.
Streaming Ingestion
Streaming Ingestion
Ingest records continually and process sets as they arrive in the stream.
Batch Job Actions
Batch Job Actions
Query the source, transform data, and load it into a pipeline.
Stream Processing Actions
Stream Processing Actions
Signup and view all the flashcards
Batch Ingestion Example
Batch Ingestion Example
Signup and view all the flashcards
Streaming Ingestion Example
Streaming Ingestion Example
Signup and view all the flashcards
Batch Ingestion Implementation
Batch Ingestion Implementation
Signup and view all the flashcards
Workflow Orchestration
Workflow Orchestration
Signup and view all the flashcards
Purpose-built Tools
Purpose-built Tools
Signup and view all the flashcards
Amazon AppFlow
Amazon AppFlow
Signup and view all the flashcards
AWS DMS Usage
AWS DMS Usage
Signup and view all the flashcards
AWS DataSync Usage
AWS DataSync Usage
Signup and view all the flashcards
AWS Data Exchange
AWS Data Exchange
Signup and view all the flashcards
AWS Glue
AWS Glue
Signup and view all the flashcards
AWS Glue Crawlers
AWS Glue Crawlers
Signup and view all the flashcards
AWS Glue Studio
AWS Glue Studio
Signup and view all the flashcards
AWS Glue Spark Runtime
AWS Glue Spark Runtime
Signup and view all the flashcards
AWS Glue Workflows
AWS Glue Workflows
Signup and view all the flashcards
CloudWatch for AWS Glue
CloudWatch for AWS Glue
Signup and view all the flashcards
Horizontal Scaling - AWS Glue
Horizontal Scaling - AWS Glue
Signup and view all the flashcards
Vertical Scaling - AWS Glue
Vertical Scaling - AWS Glue
Signup and view all the flashcards
Stream Throughput Key
Stream Throughput Key
Signup and view all the flashcards
Loose Coupling
Loose Coupling
Signup and view all the flashcards
What is a parallel consumer?
What is a parallel consumer?
Signup and view all the flashcards
Checkpointing and replay ability.
Checkpointing and replay ability.
Signup and view all the flashcards
Shard definition
Shard definition
Signup and view all the flashcards
Data Record
Data Record
Signup and view all the flashcards
Amazon Data Firehose.
Amazon Data Firehose.
Signup and view all the flashcards
Managed Apache Flink
Managed Apache Flink
Signup and view all the flashcards
Kinesis Data Scaling
Kinesis Data Scaling
Signup and view all the flashcards
CloudTrail for Kinesis
CloudTrail for Kinesis
Signup and view all the flashcards
CloudWatch for Kinesis
CloudWatch for Kinesis
Signup and view all the flashcards
IoT Devices
IoT Devices
Signup and view all the flashcards
IoT Interfaces
IoT Interfaces
Signup and view all the flashcards
IoT Cloud Services
IoT Cloud Services
Signup and view all the flashcards
IoT Apps
IoT Apps
Signup and view all the flashcards
AWS IoT Core Purpose
AWS IoT Core Purpose
Signup and view all the flashcards
AWS IoT Rules engine
AWS IoT Rules engine
Signup and view all the flashcards
Study Notes
- The module prepares you to:
- List data engineer tasks for building an ingestion layer
- Describe how AWS services support ingestion tasks
- Illustrate how AWS Glue features automate batch ingestion
- Describe how AWS streaming services simplify streaming ingestion
- Identify configuration options in AWS Glue and Amazon Kinesis Data Streams
- Describe ingesting Internet of Things (IoT) data using AWS IoT Core
Batch vs. Streaming Ingestion
-
Batch ingestion processes records as a dataset on demand, schedule, or event.
-
Streaming ingestion continually ingests records and processes them as they arrive.
-
Important factors when choosing an ingestion method are data volume and velocity
-
Batch ingestion involves sales data sent periodically, analyzed overnight, and reported in the morning.
-
Streaming ingestion involves processing clickstream data immediately to provide product recommendations.
-
Batch jobs query the source, transform the data, and load it into the pipeline.
-
Traditional ETL uses batch processing.
-
With stream processing, producers put records on a stream for consumers to process.
-
Streams handle high-velocity data and real-time processing.
Batch Processing Pipeline
- Tasks for building batch pipeline:
- Extract by connecting to sources and selecting data
- Transform/Load by identifying source and target schemas and securely transferring data
- Load/Transform by transforming the dataset and loading it to durable storage
- Orchestrate workflows
- Key characteristics for batch processing design choices:
- Ease of use: flexible, low-code/no-code options, serverless options
- Data volume and variety: handle large volumes, support disparate systems/formats
- Orchestration and monitoring: support workflow creation, dependency management, bookmarking, alerting, and logging
- Scaling and cost management: automatic scaling, pay-as-you-go options
- Batch ingestion involves writing scripts and jobs to perform the ETL or ELT process.
- Workflow orchestration is helpful to handle interdependencies between jobs and manage failures.
- Pipeline design should include ease of use, data volume, variety, orchestration, monitoring, scaling, and cost management.
AWS Purpose-Built Tools
- AWS offers purpose-built tools that match data types and simplify ingestion tasks.
- SaaS apps use Amazon AppFlow
- Creates a connector with filters
- Maps fields and performs transformations
- Performs validation
- Securely transfers to Amazon S3 or Amazon Redshift
- Suitable for use cases like ingesting customer support ticket data from Zendesk.
- Relational databases use AWS DMS
- Connects to source data
- Formats the data for a target
- Uses source filters and table mappings
- Performs data validation
- Writes to many AWS data stores
- Creates a continuous replication task
- Suitable for use cases like ingesting line of business transactions from an Oracle database
- File shares use DataSync
- Applies filters to transfer a subset of files
- Uses a variety of file systems as sources and target, including Amazon S3 as a target
- Securely transfers data between self-managed storage systems and AWS storage services
- Suitable for use cases such as ingesting on-premises genome sequencing data to Amazon S3
- Third-party datasets use AWS Data Exchange
- Finds and subscribes to sources
- Previews before subscribing
- Copies subscribed datasets to Amazon S3
- Receives notifications of updates
- Suitable for use cases such as ingesting de-identified clinical data from a third party
- Amazon AppFlow, AWS DMS, and DataSync simplify specific data type ingestion.
- AWS Data Exchange simplifies subscription to third-party datasets.
- These tools support secure connections, data store integration, automated updates, CloudWatch monitoring, selection, and transformation.
AWS Glue
- AWS Glue simplifies batch ingestion with schema identification, data cataloging, job authoring/monitoring, serverless ETL processing, and ETL orchestration.
- AWS Glue crawlers derive schemas from data stores for the AWS Glue Data Catalog.
- Job authoring enables low-code job creation for ETL management
- AWS Glue Studio provides visual authoring and job management tools.
- The AWS Glue Spark runtime engine processes jobs in a serverless environment.
- AWS Glue workflows provide ETL orchestration.
- CloudWatch provides integrated monitoring and logging for AWS Glue, including job run insights.
Scaling Considerations for Batch Processing
- To scale AWS Glue jobs horizontally, more workers can be added
- Suitable for working with large, splittable datasets as in processing a large .csv file
- To scale AWS Glue jobs vertically, a type of worker in the job configuration with larger CPU, memory and disk space can be chosen
- Suitable for memory-intensive or disk-intensive applications, i.e. Machine Learning transformations
- Performance goals should focus on important factors for batch processing.
- Large, splittable files let the AWS Glue Spark runtime engine run many jobs in parallel with less overhead than processing many smaller files.
Real-Time Stream Processing Pipeline
- For real-time stream processing pipeline:
- Extract puts records on the stream (Producers)
- Transform/Load provides secure, durable storage and get records off the stream(Consumers)
- Load/Transform transforms records(Consumers) and analyze or store processed data
- Data moves through the pipeline continuously
- Key characteristics of stream ingestion and processing:
- Throughput: Plan for a resilient, scalable stream that can adapt to changing velocity and volume
- Loose coupling: Build independent ingestion, processing, and consumer components
- Parallel consumers: Allow multiple consumers on a stream to process records in parallel and independently
- Checkpointing and replay: Maintain record order and allows replay; support the ability to mark the farthest record processed on failure
Services that Simplify Stream Ingestion
- Kinesis Data Streams ingest and store data from various sources
- Web, Sensors, Devices, Social Media, etc
- Amazon Data Firehose transforms and loads data for future analysis
- Amazon S3
- Amazon Managed Service for Apache Flink processes and analyzes data in real-time
- OpenSearch Service
- For Kinesis Data Streams:
- A shard uniquely identifies a sequence of data records.
- A partition key determines which shard to use.
- A data record contains a sequence number, partition key, and data blob
- Amazon Data Firehose performs no-code or low-code streaming ETL
- It can ingest from many AWS services, including Kinesis Data Streams
- Apply built-in and custom transformations
- Deliver directly to data stores, data lakes, and analytics services.
- Amazon Managed Service for Apache Flink can query and analyze streaming data
- It can ingest from other services, including Kinesis Data Streams
- Enrich and augment data across time windows
- Build Applications in Apache Flink
- Use SQL, Java, Python, or Scala.
- Monitoring a Kinesis Data Stream
- CloudTrail tracks API actions, including changes to stream configuration and new consumers
- CloudWatch tracks record age, throttling, and write and read failures
Stream Scaling Considerations
- Three scaling configurations for Kinesis Data Streams
- Duration of data availability: can set how long stream records are avaiable
- Write capacity: Choose the stream capacity mode: on-demand or provisioned
- Read capacity: Choose consumer types: shared fan-out or enhanced fan-out
- Components:
- Producers
- Data stream
- Consumers
- The stream is a buffer between producers and consumers.
- KPL simplifies the work of writing Kinesis Data Streams producers.
- Data is written to shards on the stream as a sequence of data records.
- Records include a sequence number, partition key, and data blob.
- Amazon Data Firehose delivers streaming data directly to storage, including Amazon S3 and Amazon Redshift.
- Amazon Managed Service for Apache Flink performs real-time analytics on data as it passes through the stream.
- Kinesis Data Streams provides scaling options to manage throughput and storage.
- CloudWatch provides metrics to monitor data handling in the stream.
Ingesting IoT Data
- AWS IoT services use MQTT and a pub/sub model for IoT device communication.
- AWS IoT Core securely connects, processes, and acts upon device data.
- The AWS IoT Core rules engine transforms and routes incoming messages to AWS services.
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.