Podcast
Questions and Answers
When designing a data ingestion strategy, what factors are most influential in determining whether to use batch or stream ingestion?
When designing a data ingestion strategy, what factors are most influential in determining whether to use batch or stream ingestion?
- The volume of the data and the speed at which it needs to be ingested and processed. (correct)
- The compliance requirements for data governance and the size of the data engineering team.
- The cost of the chosen AWS services and the availability of pre-built connectors.
- The number of different data sources and the complexity of the required transformations.
A company is migrating from an on-premises data warehouse to AWS. They need to transfer large volumes of data from their local file system to Amazon S3 for further processing. Which AWS service is purpose-built for this task?
A company is migrating from an on-premises data warehouse to AWS. They need to transfer large volumes of data from their local file system to Amazon S3 for further processing. Which AWS service is purpose-built for this task?
- AWS Data Pipeline
- AWS Storage Gateway
- AWS DataSync (correct)
- AWS Transfer Family
Which of the following is a key characteristic of stream processing that distinguishes it from batch processing?
Which of the following is a key characteristic of stream processing that distinguishes it from batch processing?
- Processing data in large, predefined datasets.
- Analyzing data overnight and generating reports in the morning.
- Querying a source, transforming the data, and loading it into a pipeline.
- Ingesting and processing records continually as they arrive. (correct)
When using AWS Glue for batch ingestion, which feature helps in automatically discovering the schema of your data?
When using AWS Glue for batch ingestion, which feature helps in automatically discovering the schema of your data?
An organization needs to ingest customer data from a third-party marketing platform into their data lake on AWS. They want a solution that simplifies the process of finding and subscribing to the required datasets. Which AWS service should they use?
An organization needs to ingest customer data from a third-party marketing platform into their data lake on AWS. They want a solution that simplifies the process of finding and subscribing to the required datasets. Which AWS service should they use?
When designing a batch processing pipeline with AWS Glue, what is the primary benefit of using Glue workflows?
When designing a batch processing pipeline with AWS Glue, what is the primary benefit of using Glue workflows?
A data engineer is tasked with building a stream processing application that requires real-time analytics on data as it passes through the stream. Which AWS service is best suited for this purpose?
A data engineer is tasked with building a stream processing application that requires real-time analytics on data as it passes through the stream. Which AWS service is best suited for this purpose?
What is the role of the Kinesis Producer Library (KPL) in the context of AWS Kinesis Data Streams?
What is the role of the Kinesis Producer Library (KPL) in the context of AWS Kinesis Data Streams?
Which of the following is a key scaling consideration when using Amazon Kinesis Data Streams for stream processing?
Which of the following is a key scaling consideration when using Amazon Kinesis Data Streams for stream processing?
An IoT platform collects data from numerous sensors in real-time. Which protocol is commonly used for communication with IoT devices in AWS IoT Core?
An IoT platform collects data from numerous sensors in real-time. Which protocol is commonly used for communication with IoT devices in AWS IoT Core?
A data engineer is setting up a new Amazon Kinesis Data Stream. What does the 'retention period' determine?
A data engineer is setting up a new Amazon Kinesis Data Stream. What does the 'retention period' determine?
A financial services company needs to ingest sales transaction data from retailers around the world. The data is sent periodically to a central location, analyzed overnight, and reports are sent to branches in the morning. Which type of data ingestion is most suitable for this use case?
A financial services company needs to ingest sales transaction data from retailers around the world. The data is sent periodically to a central location, analyzed overnight, and reports are sent to branches in the morning. Which type of data ingestion is most suitable for this use case?
What is the main purpose of 'workflow orchestration' in a batch data ingestion pipeline?
What is the main purpose of 'workflow orchestration' in a batch data ingestion pipeline?
A company wants to use Amazon AppFlow to ingest data from a SaaS application. What is a key step in configuring this data ingestion?
A company wants to use Amazon AppFlow to ingest data from a SaaS application. What is a key step in configuring this data ingestion?
What advantage does using Amazon Data Firehose offer over directly writing to Amazon S3 from a stream processing application?
What advantage does using Amazon Data Firehose offer over directly writing to Amazon S3 from a stream processing application?
A company is ingesting data from various sources into AWS for analytics. Which of the following is a key benefit of using AWS Glue for this purpose?
A company is ingesting data from various sources into AWS for analytics. Which of the following is a key benefit of using AWS Glue for this purpose?
In the context of Amazon Kinesis Data Streams, what is the significance of a 'partition key'?
In the context of Amazon Kinesis Data Streams, what is the significance of a 'partition key'?
An organization is setting up AWS IoT Core to ingest data from thousands of devices. What is a primary feature of AWS IoT Core that helps in this process?
An organization is setting up AWS IoT Core to ingest data from thousands of devices. What is a primary feature of AWS IoT Core that helps in this process?
When scaling AWS Glue jobs vertically, which strategy aligns with this scaling approach?
When scaling AWS Glue jobs vertically, which strategy aligns with this scaling approach?
Which key characteristic of stream ingestion and processing allows multiple consumers to process records in parallel and independently?
Which key characteristic of stream ingestion and processing allows multiple consumers to process records in parallel and independently?
A company is using AWS DataSync to transfer data from an on-premises file system to Amazon S3. Which functionality is provided by DataSync to efficiently manage the data transfer process?
A company is using AWS DataSync to transfer data from an on-premises file system to Amazon S3. Which functionality is provided by DataSync to efficiently manage the data transfer process?
What does the term 'shard' refer to in the context of Amazon Kinesis Data Streams?
What does the term 'shard' refer to in the context of Amazon Kinesis Data Streams?
An organization is planning to use AWS Glue to transform data in a batch processing pipeline. What benefit does the AWS Glue Data Catalog provide in this context?
An organization is planning to use AWS Glue to transform data in a batch processing pipeline. What benefit does the AWS Glue Data Catalog provide in this context?
Which AWS service simplifies the ingestion of data from a software-as-a-service (SaaS) application?
Which AWS service simplifies the ingestion of data from a software-as-a-service (SaaS) application?
A company is scaling an AWS Glue job horizontally to process large, splittable datasets. Which approach reflects horizontal scaling in AWS Glue?
A company is scaling an AWS Glue job horizontally to process large, splittable datasets. Which approach reflects horizontal scaling in AWS Glue?
What is a primary role of the AWS IoT Core rules engine?
What is a primary role of the AWS IoT Core rules engine?
Which ingestion method uses traditional ETL?
Which ingestion method uses traditional ETL?
What type of data might a retailer wish to analyze to provide a product recommendation?
What type of data might a retailer wish to analyze to provide a product recommendation?
Which AWS service offers a simplified method for locating and subscribing to third-party datasets?
Which AWS service offers a simplified method for locating and subscribing to third-party datasets?
What is 'bookmarking' referring to when using AWS Glue?
What is 'bookmarking' referring to when using AWS Glue?
What type of AWS service simplifies the ingestion of specific data types?
What type of AWS service simplifies the ingestion of specific data types?
Other than Schema identification, what else does AWS Glue allow?
Other than Schema identification, what else does AWS Glue allow?
Where do AWS Glue crawlers derive schemas from?
Where do AWS Glue crawlers derive schemas from?
Why is horizontal scaling used with AWS Glue?
Why is horizontal scaling used with AWS Glue?
What do data records include?
What do data records include?
What helps you monitor how your stream handles the data that is being written to and read from it?
What helps you monitor how your stream handles the data that is being written to and read from it?
With AWS IoT services, what can you use to communicate with IoT devices?
With AWS IoT services, what can you use to communicate with IoT devices?
Flashcards
What is Batch Ingestion?
What is Batch Ingestion?
Ingest and process records in batches as a dataset. Run on demand, on a schedule, or based on an event.
What is Stream Ingestion?
What is Stream Ingestion?
Ingest records continually and process sets of records as they arrive on the stream.
What are Module objectives?
What are Module objectives?
Key tasks that a data engineer performs when building an ingestion layer.
What do Batch jobs do?
What do Batch jobs do?
Signup and view all the flashcards
What does ETL mean?
What does ETL mean?
Signup and view all the flashcards
What does Batch Ingestion involve?
What does Batch Ingestion involve?
Signup and view all the flashcards
What is Amazon AppFlow?
What is Amazon AppFlow?
Signup and view all the flashcards
What is AWS DMS?
What is AWS DMS?
Signup and view all the flashcards
What is AWS DataSync?
What is AWS DataSync?
Signup and view all the flashcards
What is AWS Data Exchange?
What is AWS Data Exchange?
Signup and view all the flashcards
What does AWS Glue simplify?
What does AWS Glue simplify?
Signup and view all the flashcards
What is AWS Glue?
What is AWS Glue?
Signup and view all the flashcards
What do AWS Glue crawlers do?
What do AWS Glue crawlers do?
Signup and view all the flashcards
What does AWS Glue Studio provide?
What does AWS Glue Studio provide?
Signup and view all the flashcards
What does AWS Glue Spark runtime engine do?
What does AWS Glue Spark runtime engine do?
Signup and view all the flashcards
What should be focused on?
What should be focused on?
Signup and view all the flashcards
How do you scale AWS Glue jobs horizontally?
How do you scale AWS Glue jobs horizontally?
Signup and view all the flashcards
How do you scale AWS Glue jobs vertically?
How do you scale AWS Glue jobs vertically?
Signup and view all the flashcards
What is considered as stream?
What is considered as stream?
Signup and view all the flashcards
What does KPL do?
What does KPL do?
Signup and view all the flashcards
What are Shards?
What are Shards?
Signup and view all the flashcards
What does Amazon Data Firehose do?
What does Amazon Data Firehose do?
Signup and view all the flashcards
What is Amazon Managed Service for Apache Flink?
What is Amazon Managed Service for Apache Flink?
Signup and view all the flashcards
What do scaling options on Kinesis Data Streams do?
What do scaling options on Kinesis Data Streams do?
Signup and view all the flashcards
What does CloudTrail do?
What does CloudTrail do?
Signup and view all the flashcards
What does CloudWatch do?
What does CloudWatch do?
Signup and view all the flashcards
What does AWS IoT Core provide?
What does AWS IoT Core provide?
Signup and view all the flashcards
What to use to Communicate with IoT devices?
What to use to Communicate with IoT devices?
Signup and view all the flashcards
What does the AWS IoT Core rules engine do?
What does the AWS IoT Core rules engine do?
Signup and view all the flashcards
Study Notes
- The module prepares you to list data engineer tasks for building an ingestion layer
- The module prepares you to describe how AWS services support ingestion tasks
- The module prepares you to illustrate how AWS Glue features automate batch ingestion
- The module prepares you to describe AWS streaming services and features that simplify streaming ingestion
- The module prepares you to identify configuration options in AWS Glue and Amazon Kinesis Data Streams to scale ingestion processing
- The module prepares you to describe distinct characteristics of ingesting IoT data by using AWS IoT Core
Batch and Streaming Ingestion
- Batch ingestion involves ingesting and processing a batch of records as a dataset
- Batch ingestion can be run on demand, on a schedule, or based on an event
- Streaming ingestion involves ingesting records continually and processing sets of records as they arrive on the stream
Data Volume and Velocity
- Data volume and velocity are key factors in choosing an ingestion method
- Ingestion method choice depends on the amount of data to be ingested
- Ingestion method choice depends on the frequency with which new data must be ingested and processed
- Batch ingestion example: Sales transaction data from retailers across the world is sent periodically to a central location
- Data is analyzed overnight and reports are sent to branches in the morning in the batch ingestion example
- Streaming ingestion example: Website clickstream data sends a large volume of small bits of data continuously
- Data is analyzed immediately to provide a product recommendation in the streaming ingestion example
Key Takeaways - Batch and Streaming
- Batch jobs query the source, transform data, and load it into the pipeline
- Traditional ETL uses batch processing
- With stream processing, producers put records on a stream where consumers get and process them
- Streams are designed to handle high-velocity data and real-time processing
Tasks to Build a Batch Processing Pipeline
- Tasks include Extract, Transform/Load, and Load/Transform
- Extract data from sources
- Transform/Load involves identifying the source and target schemas
- Transform/Load involves securely transferring and storing the data
- Load/Transform involves transforming the dataset
- Load/Transform involves loading the dataset to durable storage
- Workflow orchestration ties components together
Key Characteristics for Batch Processing
- Ease of use: Make it flexible and offer low-code, no-code, and serverless options
- Data volume and variety: Handle large volumes of data and support disparate source and target systems
- Data volume and variety: Support different data formats seamlessly
- Orchestration and monitoring: Support workflow creation and provide dependency management
- Orchestration and monitoring: Support bookmarking, job failure alerts, and logging
- Scaling and cost management: Enable automatic scaling and offer pay-as-you-go options
Key Takeaways - Batch Ingestion
- Batch ingestion involves writing scripts and jobs to perform the ETL or ELT process
- Workflow orchestration helps you handle interdependencies between jobs and manage failures within a set of jobs
- Key characteristics for pipeline design include ease of use, data volume and variety, orchestration and monitoring, and scaling and cost management
Purpose-Built Ingestion Tools
- AWS offers purpose-built tools to match data sources
- Tools provide secure connections and data store integration
- Tools provide automated updates and Amazon CloudWatch monitoring
- Tools provide selection and transformation
- SaaS apps are ingested with Amazon AppFlow
- Relational databases are ingested with AWS DMS
- File shares are ingested using DataSync
- Third-party datasets are ingested with AWS Data Exchange
Amazon AppFlow
- Ingests data from software as a service apps
- Create a connector with filters
- Map fields and perform transformations
- Perform validation
- Securely transfer to Amazon S3 or Amazon Redshift
- Example: Ingest customer support ticket data from Zendesk
AWS DMS
- Ingests data from relational databases
- Connect to source data and format it for a target
- Use source filters and table mappings
- Perform data validation
- Write to many AWS data stores
- Create a continuous replication task
- Example: Ingest line of business transactions from an Oracle database
AWS DataSync
- Ingests data from file systems
- Apply filters to transfer a subset of files
- Use a variety of file systems as sources and target, including Amazon S3 as a target
- Securely transfer data between self-managed storage systems and AWS storage services
- Example: Ingest on-premises genome sequencing data to Amazon S3
AWS Data Exchange
- Integrates third-party datasets into your pipeline
- Find and subscribe to sources
- Preview before subscribing
- Copy subscribed datasets to Amazon S3
- Receive notifications of updates
- Example: Ingest de-identified clinical data from a third party
Key Takeaways - Purpose Built Ingestion Tools
- Purpose-built tools should match the type of data to be ingested and simplify the tasks involved in ingestion
- Amazon AppFlow, AWS DMS, and DataSync each simplify the ingestion of specific data types
- AWS Data Exchange provides a simplified way to find and subscribe to third-party datasets
AWS Glue
- AWS Glue simplifies batch ingestion tasks
- AWS Glue provides schema identification
- AWS Glue provides data cataloging
- AWS Glue provides job authoring and monitoring
- AWS Glue provides serverless ETL processing
- AWS Glue provides ETL orchestration
Key points - AWS Glue
- In schema identification and data cataloging, AWS Glue crawlers derive schemas from data stores
- Metadata for ETL script generation is sent to AWS Glue
- In job authoring, there is low-code job creation and management, a graphical interface, transformations, and monitoring
- Data sources are processed to storage using AWS Glue Spark runtime environment
- ETL orchestration supports complex multi-job, multi-crawler ETL processing and is trackable as one entity
Monitoring AWS Glue Jobs
- AWS Glue jobs can be monitored using CloudTrail
- CloudWatch provides AWS Glue job run insights
Key Takeaways - AWS Glue for Batch
- AWS Glue is a fully managed data integration service that simplifies ETL tasks
- AWS Glue crawlers derive schemas from data stores
- AWS Glue studio provides visual authoring and job management tools
- AWS Glue Spark runtime engine processes jobs in a serverless environment
- AWS Glue workflows provide ETL orchestration
- CloudWatch provides integrated monitoring and logging
Horizontal Scaling
- Increase the number of workers that are allocated to the job
- Use case: working with large, splittable datasets
- Example: Processing a large .csv file
Vertical Scaling
- Choose a worker type with larger CPU, memory, and disk space
- Use case: Working with memory-intensive applications
- Example: Machine Learning transformations
Key Takeaways - Scaling Considerations for Batch
- Performance goals should focus on what factors are most important for your batch processing
- AWS Glue jobs can be scaled horizontally by adding more workers
- AWS Glue Jobs can be scaled vertically by choosing a larger type of worker in the job configuration
- Large, splittable files let the AWS Glue Spark runtime engine run many jobs in parallel
Building a Real Time Stream Processing Pipeline
- Tasks include Extract, Transform/Load, and Load/Transform
- Extract involves putting records on stream (Producers)
- Transform/Load involves getting records off the stream and transforming them (Consumers)
- Load/Transform involves analyzing or storing processed data
- Data moves through the pipeline continuously
Key Characteristics for Stream Ingestion and Processing
- Throughput: Plan for a resilient, scalable stream that can adapt to changing velocity and volume
- Loose coupling: Build independent ingestion, processing, and consumer components
- Parallel consumers: Allow multiple consumers on a stream to process records in parallel and independently
- Checkpointing and replay: Maintain record order and allow replay, and the ability to mark the farthest record processed on failure
Purpose-Built Streaming Services
- Data flows from data sources, to ingest and store, to transform
- Data sources include web, sensors, devices, and social media
- Services include Kinesis Data Streams, Amazon Data Firehose, and Amazon Managed Service for Apache Flink
Kinesis Data Streams
- A shard is a uniquely identified sequence of data records
- A data record is a unit of data stored and contains sequence number, partition key, and data blob
- Producer applications put records in Kinesis Data Streams
- Multiple consumers such as Amazon Data Firehose, Consumer applications running on EC2, Lambda function, and Amazon Managed Service for Apache Flink read from Kinesis Data Streams
Amazon Data Firehose
- Can perform no-code or low-code streaming ETL
- Ingest from many AWS services including Kinesis Data Streams
- Apply built-in and custom transformations
- Deliver directly to data stores, data lakes, and analytics services
Amazon Managed Service (AMS) for Apache Flink
- AMS for Apache Flink can query and analyze streaming data
- AMS for Apache Flink can ingest from other services including Kinesis Data Streams
- AMS for Apache Flink can enrich and augment data across time windows
- AMS for Apache Flink can build applications in Apache Flink
- Developers can use SQL, Java, Python, or Scala
Key Takeaways - Streaming
- The stream is a buffer between the producers and the consumers
- The KPL simplifies the work of writing producers for Kinesis Data Streams
- Data is written to shards on the stream as a sequence of data records
- Data records include a sequence number, partition key, and data blob
- Amazon Data Firehose can deliver streaming data directly to storage, including Amazon S3 and Amazon Redshift
- Amazon Managed Service for Apache Flink is purpose-built to perform real-time analytics as data passes through the stream
Configuring Kinesis Data Streams
- Set retention period for stream records in Duration of data availability
- Choose stream capacity mode: On-demand or Provisioned for Write capacity
- Choose consumer types: shared fan-out or enhanced fan-out for Read capacity
Monitoring Kinesis Data Streams
- CloudTrail can track API actions, including changes to stream configuration and new consumers
- CloudWatch can track record age, throttling, and write and read failures
Key Takeaways - Scaling Considerations for Streaming
- Kinesis Data Streams provides scaling options to manage the throughput of data on the stream
- Scale how much data can be written to the stream, how long the data is stored on the stream, and how much throughput each consumer gets
- CloudWatch provides metrics that help you monitor how your stream handles the data that is being written to and read from it
IoT (Internet of Things)
- The IoT universe contains smart home devices, factories, farms, and industries
- The IoT contains devices, interfaces, cloud services, apps, and communications
- Devices are the hardware the manage interfaces and communications
- Interfaces are components that connect devices to the physical world
- Cloud services provide storage and processing
- Apps provide an end user access point to devices and features
- Communications describes the technology and protocols for communicating between devices, and between devices and services
AWS IoT Core
- Provides the ability to securely connect, process, and act on IoT device data
- Includes features to filter and transform data
- Can route data to other AWS services, including streaming and storage services
AWS IoT Core - Rule Actions
- Publishers send to AWS IoT Core
- AWS IoT core can dispatch to Amazon Data Firehose, Amazon S3, Lambda, and DynamoDB
Rules Engine
- The rules engine transforms and routes data
- AWS IoT Core sends to Amazon Data Firehose and Amazon S3
- IoT Core sends to Amazon Managed Service for Apache Flink
Key Takeaways - IoT
- AWS IoT services leverage MQTT and a pub/sub model to communicate with IoT devices
- AWS IoT Core can securely connect, process, and act upon device data
- The AWS IoT Core rules engine transforms and routes incoming messages to AWS services
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.