Delta Lake Streaming Fundamentals

Delta Lake Streaming Under the Hood

The presentation discusses Delta Lake streaming under the hood, focusing on the structure streaming internals, Delta table properties, and common issues with mitigation strategies.

Structure Streaming Internals

A sample stream reads from a Delta table and writes to a Delta table with a specified checkpoint location.
The stream processes data in micro-batches, tracking progress in the checkpoint location.
Delta Lake's transaction logs enable efficient streaming, allowing Spark to read incremental data from the source table.

Query Progress Logs

Query progress logs are JSON logs generated by Spark after each micro-batch, providing execution details about the stream.
Logs are categorized into five types:
- Micro-batch execution metrics (e.g., micro-batch ID, trigger time)
- Source and sync metrics (e.g., source offset, end offset, number of input rows)
- Stream performance metrics (e.g., input rows per second, process rows per second)
- Batch duration metrics (e.g., time spent on different activities)
- Streaming state metrics (e.g., state information)

Correlating Query Progress Logs with Delta Table History

Example stream: reading from Delta table "delta_key_val" and writing to Delta table "delta_stream".
Properties: maxFilesPerTrigger=5, trigger=60 seconds.
Delta table history: version 0 with 8 files, numFiles=8, numOutputRows=16,384.
Query progress logs after the first micro-batch:
- Start offset: null
- End offset: reservoir version 0, index 0-4
- Number of input rows: 10,240

Streaming Semantics with Delta Lake

Understanding query progress logs and correlating them with Delta table history.
Example stream: reading from Delta table "delta_key_val" and writing to Delta table "delta_stream".
After the first micro-batch:
- Query progress logs: start offset=null, end offset=reservoir version 0, index 0-4
- Delta table history: version 0 with 8 files, numFiles=8, numOutputRows=16,384
- Destination table history: version 1 with 5 files, numFiles=5, numOutputRows=10,240

Next Micro-Batches

After the second micro-batch:
- Start offset: reservoir version 0, index 4
- End offset: reservoir version 0, index 7
- Number of input rows: 5,120
After the third micro-batch:
- Start offset: reservoir version 0, index 7
- End offset: reservoir version 0, index 7
- Number of input rows: 0

Adding New Files to the Source Delta Table

Eight new files are added to the source Delta table, creating version 1.
Query progress logs after the addition:
- Start offset: reservoir version 0, index 7
- End offset: reservoir version 1, index 4
- Number of input rows: 10,240

Streaming Checkpoint

The checkpoint directory contains the persisted progress of the stream.
The stream's progress is divided into three steps:
1. Construction of the micro-batch
2. Processing the micro-batch
3. Committing the micro-batch### Checkpoint Location
In the checkpoint location, there are three different contents: offsets, metadata, and commits, stored in JSON format.

Offsets Folder

One file is generated inside the offsets folder for every micro batch, generated whenever a batch starts.
Each file contains batch and streaming state details, and source details, but we focus only on the source details.
Example: Offset file for micro batch id 0, contains information that maps to the query progress log.
Spark stores the content of the end offset, not the start offset, as it's redundant and can be obtained from the previous micro batch offset file.

Metadata Folder

Contains the stream id, a unique id generated when a streaming query starts.

Commits Folder

One file per micro batch is created under the commits folder in step 3 of the execution.
The file represents the completion of the micro batch, and if it's not generated, Spark has to handle the previously failed batch first.
This is why, when changing configuration options for a stream, it might not kick in until the first batch completes.

Query Progress Logs

They contain different metrics that explain the stream execution patterns.
They map to the delta transaction logs.

Stream Checkpoints

Offsets and commits are used to track stream progress and provide fault tolerance capabilities.

Upcoming Topics

Variations of delta streams, such as trigger ones and max bytes per trigger.
Delta table properties.
Common issues faced when running delta streams in production and mitigation strategies.

Delta Lake Streaming Under the Hood

Delta Lake streaming uses structure streaming internals, involving a sample stream that reads from a Delta table and writes to another Delta table with a specified checkpoint location.

Structure Streaming Internals

The stream processes data in micro-batches, tracking progress in the checkpoint location.
Delta Lake's transaction logs enable efficient streaming, allowing Spark to read incremental data from the source table.

Query Progress Logs

Query progress logs are JSON logs generated by Spark after each micro-batch, providing execution details about the stream.
Logs are categorized into five types:
- Micro-batch execution metrics
- Source and sync metrics
- Stream performance metrics
- Batch duration metrics
- Streaming state metrics

Correlating Query Progress Logs with Delta Table History

Example stream: reading from Delta table "delta_key_val" and writing to Delta table "delta_stream".
Properties: maxFilesPerTrigger=5, trigger=60 seconds.
Delta table history: version 0 with 8 files, numFiles=8, numOutputRows=16,384.

Streaming Semantics with Delta Lake

Understanding query progress logs and correlating them with Delta table history is crucial.
After the first micro-batch:
- Query progress logs: start offset=null, end offset=reservoir version 0, index 0-4
- Delta table history: version 0 with 8 files, numFiles=8, numOutputRows=16,384
- Destination table history: version 1 with 5 files, numFiles=5, numOutputRows=10,240

Next Micro-Batches

After the second micro-batch:
- Start offset: reservoir version 0, index 4
- End offset: reservoir version 0, index 7
- Number of input rows: 5,120
After the third micro-batch:
- Start offset: reservoir version 0, index 7
- End offset: reservoir version 0, index 7
- Number of input rows: 0

Adding New Files to the Source Delta Table

Eight new files are added to the source Delta table, creating version 1.
Query progress logs after the addition:
- Start offset: reservoir version 0, index 7
- End offset: reservoir version 1, index 4
- Number of input rows: 10,240

Streaming Checkpoint

The checkpoint directory contains the persisted progress of the stream.
The stream's progress is divided into three steps:
- Construction of the micro-batch
- Processing the micro-batch
- Committing the micro-batch

Checkpoint Location

The checkpoint location contains three different contents: offsets, metadata, and commits, stored in JSON format.

Offsets Folder

One file is generated inside the offsets folder for every micro-batch, containing batch and streaming state details.
Each file contains source details, which map to the query progress log.

Metadata Folder

Contains the stream id, a unique id generated when a streaming query starts.

Commits Folder

One file per micro-batch is created under the commits folder, representing the completion of the micro-batch.
If the file is not generated, Spark has to handle the previously failed batch first.

Query Progress Logs and Stream Checkpoints

Query progress logs contain different metrics that explain the stream execution patterns.
Offsets and commits are used to track stream progress and provide fault tolerance capabilities.

Delta Lake Streaming Fundamentals

Choose a study mode

Podcast

Questions and Answers

What is the primary function of Delta Lake's transaction logs in streaming?

What type of information is provided by the micro-batch execution metrics in the query progress logs?

What is the purpose of correlating query progress logs with Delta table history?

What is the function of the checkpoint location in a Delta Lake stream?

What is the significance of the 'numFiles' property in the Delta table history?

What type of metrics are provided by the streaming state metrics in the query progress logs?

What is the purpose of generating query progress logs after each micro-batch?

What is the primary function of the stream in the example provided?

What is the number of input rows after the second micro-batch?

What is the purpose of the commits folder in the checkpoint location?

What happens when new files are added to the source Delta table?

What is stored in the offset files inside the offsets folder?

What is the purpose of the metadata folder?

What do query progress logs contain?

What is the benefit of using stream checkpoints?

What is the purpose of the offsets folder in the checkpoint location?

Study Notes

Delta Lake Streaming Under the Hood

Structure Streaming Internals

Query Progress Logs

Correlating Query Progress Logs with Delta Table History

Streaming Semantics with Delta Lake

Next Micro-Batches

Adding New Files to the Source Delta Table

Streaming Checkpoint

Offsets Folder

Metadata Folder

Commits Folder

Query Progress Logs

Stream Checkpoints

Upcoming Topics

Delta Lake Streaming Under the Hood

Structure Streaming Internals

Query Progress Logs

Correlating Query Progress Logs with Delta Table History

Streaming Semantics with Delta Lake

Next Micro-Batches

Adding New Files to the Source Delta Table

Streaming Checkpoint

Checkpoint Location

Offsets Folder

Metadata Folder

Commits Folder

Query Progress Logs and Stream Checkpoints

Studying That Suits You

More Like This

Delta Lake Data Architecture Patterns

Delta Lake and Apache Spark Structured Streaming

Spark Structured Streaming and Delta Lake Integration