Podcast
Questions and Answers
What does the bronze level in Medallion Architecture represent?
What does the bronze level in Medallion Architecture represent?
- Materialized views of cleaned data
- Business-level aggregated data
- Raw ingestion and history (correct)
- Filtered, cleaned, and augmented data
Which statement correctly describes Live Tables in Delta Live Tables?
Which statement correctly describes Live Tables in Delta Live Tables?
- They are materialized views updated by a pipeline. (correct)
- They do not manage data dependencies.
- They provide real-time data processing only.
- They are designed for batch-only processing.
What is a key feature of Streaming Live Tables?
What is a key feature of Streaming Live Tables?
- They process old data multiple times.
- They are always built without a data source.
- They do not support stateful computations.
- They guarantee exactly-once processing of input rows. (correct)
What role does Delta Live Tables play in data quality?
What role does Delta Live Tables play in data quality?
Which of the following best characterizes the silver level in Medallion Architecture?
Which of the following best characterizes the silver level in Medallion Architecture?
What is the main purpose of Delta Live Tables?
What is the main purpose of Delta Live Tables?
Which benefit does Delta Live Tables offer concerning latency?
Which benefit does Delta Live Tables offer concerning latency?
What is required to create a Live Table in Delta Live Tables?
What is required to create a Live Table in Delta Live Tables?
What does the gold level in Medallion Architecture involve?
What does the gold level in Medallion Architecture involve?
Which streaming sources are compatible with Streaming Live Tables?
Which streaming sources are compatible with Streaming Live Tables?
What is the first step in creating a live table pipeline?
What is the first step in creating a live table pipeline?
What distinguishes development mode from production mode?
What distinguishes development mode from production mode?
How does DLT manage dependencies in a Live Table pipeline?
How does DLT manage dependencies in a Live Table pipeline?
What is the purpose of using Expectations in data quality management?
What is the purpose of using Expectations in data quality management?
What does the command 'Select * from cloud_files(files);' accomplish?
What does the command 'Select * from cloud_files(files);' accomplish?
What is a key characteristic of production mode regarding cluster management?
What is a key characteristic of production mode regarding cluster management?
How can another table created in a different notebook be referenced?
How can another table created in a different notebook be referenced?
What happens when an expectation on data quality fails?
What happens when an expectation on data quality fails?
What does DLT do to ensure data lineage capturing?
What does DLT do to ensure data lineage capturing?
What type of SQL command is 'EXPECT (timestamp_col > 'timestamp value') ON VIOLATION DROP;' classified as?
What type of SQL command is 'EXPECT (timestamp_col > 'timestamp value') ON VIOLATION DROP;' classified as?
What is the default behavior of DLT when handling bad records?
What is the default behavior of DLT when handling bad records?
Which feature allows the visualization of data flows between tables in a pipeline?
Which feature allows the visualization of data flows between tables in a pipeline?
Which of the following is NOT a function of the event log?
Which of the following is NOT a function of the event log?
Which requirement must be met for a streaming table using the SQL stream() function?
Which requirement must be met for a streaming table using the SQL stream() function?
What does the command 'Create Streaming Live table my_stream as Select * from STREAM(table_name);' do?
What does the command 'Create Streaming Live table my_stream as Select * from STREAM(table_name);' do?
What is a limitation for streaming tables regarding the APPLY CHANGES INTO command?
What is a limitation for streaming tables regarding the APPLY CHANGES INTO command?
What is one purpose of configuration parameters in DLT?
What is one purpose of configuration parameters in DLT?
When targeting schemas in pipelines, what term is used to refer to the active schema?
When targeting schemas in pipelines, what term is used to refer to the active schema?
Which of the following is NOT a type of record that DLT tracks?
Which of the following is NOT a type of record that DLT tracks?
In a streaming context, what must be true regarding the data being read?
In a streaming context, what must be true regarding the data being read?
Flashcards
Medallion Architecture
Medallion Architecture
A data pipeline architecture with three levels: bronze, silver, and gold.
Bronze Level (Medallion)
Bronze Level (Medallion)
Raw data ingestion and history in a data pipeline.
Silver Level (Medallion)
Silver Level (Medallion)
Filtered, cleaned, and augmented data in a data pipeline.
Gold Level (Medallion)
Gold Level (Medallion)
Signup and view all the flashcards
Delta Live Tables (DLT)
Delta Live Tables (DLT)
Signup and view all the flashcards
Live Tables
Live Tables
Signup and view all the flashcards
Streaming Live Tables (SLT)
Streaming Live Tables (SLT)
Signup and view all the flashcards
Live Table Pipeline
Live Table Pipeline
Signup and view all the flashcards
Development Mode
Development Mode
Signup and view all the flashcards
Production Mode
Production Mode
Signup and view all the flashcards
Dependent Tables & LVS
Dependent Tables & LVS
Signup and view all the flashcards
Data Quality Expectations
Data Quality Expectations
Signup and view all the flashcards
Pipelines UI
Pipelines UI
Signup and view all the flashcards
Event Log
Event Log
Signup and view all the flashcards
SQL Stream Function
SQL Stream Function
Signup and view all the flashcards
Streaming Table Restrictions
Streaming Table Restrictions
Signup and view all the flashcards
Configuration Parameters
Configuration Parameters
Signup and view all the flashcards
Study Notes
Medallion Architecture
- Defines three levels for data pipelines: bronze, silver, and gold.
- Bronze represents raw data ingestion and history.
- Silver involves filtering, cleaning, and augmenting data.
- Gold represents business-level data, including aggregates.
Delta Live Tables (DLT)
- Focus on simplifying ETL processes on Delta Lake.
- Provide declarative tools to build batch and streaming data pipelines.
- Offer built-in declarative quality control, allowing users to declare data quality expectations and actions to take when violations occur.
- Enable easy scaling of infrastructure alongside data.
Live Tables
- Materialized views for the lakehouse.
- Defined by SQL queries.
- Created and kept up-to-date by pipelines.
- Provide tools for managing dependencies, controlling quality, automating operations, simplifying collaboration, saving costs, and reducing latency.
Creating Live Tables
- Use SQL syntax:
CREATE LIVE TABLE table_name AS SELECT col_name FROM another_table;
Streaming Live Tables (SLT)
- Based on Spark Structured Streaming.
- Ensure exactly-once processing of input rows.
- Inputs are only read once.
- Compute results over appended-only streams, such as Kafka, Kinesis, or Auto Loader.
- Allow for cost and latency reduction by avoiding reprocessing of old data.
- Created with SQL:
CREATE STREAMING LIVE TABLE table_name SELECT * FROM cloud_files(files);
(using Auto Loader)
Live Table Pipeline
- Requires three steps:
- Create the Live Table in a notebook.
- Create the pipeline with one or more notebooks.
- Run the pipeline.
Development vs Production Mode
- Development Mode: reuses long-running clusters for faster iteration, no retries on errors enabling faster debugging.
- Production Mode: cuts cost by turning off clusters after completion (within 5 minutes), escalating retries including cluster restarts ensures reliability in the face of transient issues.
Dependent Tables and Live Virtual Schema (LVS)
- Dependencies owned by other products are read from the catalog or Spark data sources as normal.
- Live dependencies from the same pipeline are read from the Live schema.
- DLT detects Live dependencies and executes all operations in the correct order.
- DLT handles parallelism and captures the lineage of the data.
- Live dependencies are referenced using SQL:
CREATE LIVE TABLE table_name AS SELECT * FROM Live.live_table_name;
Data Quality with Expectations
- Expectations serve as tests ensuring data quality in production.
- Implemented using SQL constraints or Python functions.
- Examples:
- SQL:
CONSTRAINT valid_timestamp EXPECT (timestamp_col > 'timestamp value') ON VIOLATION DROP;
- Python:
@dlt.expect_or_drop("valid_timestamp", col("timestamp_col") > 'timestamp value')
- SQL:
- Can be configured to track bad records, drop bad records, or abort processing for a single bad record.
Pipelines UI
- Allows visualizing all pipeline information.
- Features include:
- Visualization of data flows between tables.
- Discovery of metadata and quality of tables.
- Access to historical updates.
- Control operations: switching between dev and prod modes, deleting the pipeline, managing user permissions, scheduling, etc.
Event Log
- Automatically records all pipeline operations.
- Records operational statistics, including time and current status of operations, pipeline cluster configurations, and row count.
- Captures provenance, including table schemas, definitions, declared properties, table-level lineage, and query plans used to update tables.
- Tracks data quality through expectations, including pass/fail statistics and input/output rows that caused expectation failures.
SQL Stream Function
stream()
function allows streaming data from any Delta table.- Reads a stream of new records instead of a snapshot.
- Applicable only to append-only tables.
- Example:
CREATE STREAMING LIVE TABLE my_stream AS SELECT * FROM STREAM(table_name);
Restrictions for Streaming Tables
- Cannot be the target of
APPLY CHANGES INTO
(CDC). - Cannot define an aggregate function..
- Cannot be a table on which DML operations (delete/update) have been executed.
Configuration Parameters
- Enable modularization by creating variables that can be used in any notebook.
- Stored as key-value pairs.
- Example: "my_vars.etl_path": "s3://path/json/"
- Accessed in SQL:
CREATE STREAMING LIVE TABLE table_name SELECT * FROM cloud_files("${my_vars.etl_path}", "json")
- Accessed in Python:
input_path = spark.conf.get("my_vars.etl_path") spark.readStream.format("cloud_files").load(input_path)
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.
Related Documents
Description
This quiz covers the Medallion Architecture framework, highlighting the three levels of data processing: bronze, silver, and gold. It also explores Delta Live Tables (DLT), which simplify ETL processes in Delta Lake and enable users to build resilient data pipelines using SQL. Test your knowledge on these essential concepts in data engineering!