Pipelines with Databricks Delta Live Tables PDF

Summary

This document provides an overview of Databricks Delta Live Tables (DLT), a framework for building data pipelines in a lakehouse environment. It details the three levels of data pipelines (bronze, silver, gold), two types of tables (live and streaming live), and functionalities such as declarative quality control, easy scaling, and visualization tools. The focus is on making ETL processes easier and more efficient.

Full Transcript

Pipelines with Databrick Delta Live Tables 1 #deltalivestables #databricks #medallionarchitecture Medallion Architecture Defines three levels to the pipelines data: bronze: raw ingestion and history silver: filtered, cleaned, augmented gold: business-level, aggregates Formally...

Pipelines with Databrick Delta Live Tables 1 #deltalivestables #databricks #medallionarchitecture Medallion Architecture Defines three levels to the pipelines data: bronze: raw ingestion and history silver: filtered, cleaned, augmented gold: business-level, aggregates Formally there are two types of DLT: Live tables: materialized views that will return the current results of any query with each refresh streaming live tables: design for incremental, near-real time data processing. Delta Live Tables (DLT) DLTs focus on make ETL easy on delta lake. Allows: operate with agility: declarative tolls to build batch and streaming data pipelines trust the data: DLT has built-in declarative quality control. Declare quality expectations and actions to take. Scale with reliability: easily scale infrastructure alongside the data. Live tables are materialized views for the lakehouse. Defined by SQL query, created and kept up-to-date by a pipeline. Live tables provide tools to: manage dependencies control quality automate operation simplify collaboration save costs reduce latency DLT are created with syntax: SQL Create live table table_name as Select col_name from another_table; Streaming Live Tables Based on Spark Structured Streaming, SLT are stateful in the sense that ensures exactly-once processing of input rows and inputs are only read once. SLT compute results over appended-only stream such as Kafka, Kinesis, or Auto Loader (files on cloud storage). Also, allow you to reduce cost and latency by avoiding reprocessing of old data. SQL Create Streaming Live table table_name Select * from cloud_files(files); --using auto-loader Live Table pipeline Create a live table pipeline requires three steps: create the Live Table in a notebook create the pipeline with one or more notebook run the pipeline Development vs production mode Fast iteration or enterprise grade reliability: Development mode Production mode reuses long-running clusters for fast cuts cost by turning off clusters as soon as they are done (within 5 mintes) iteration no retries on errors enabling faster escalating retries, including clustesrs restasrs, ensures reliability in the debugging face of transient issues. Dependent tables, declare dependencies Using the Live virtual schema (LVS): dependencies owned by the other products are just read from the catalog or spark data sources as normal Live dependencies from the same pipeline, are read from the live schema DLT detects Live dependencies and executes all operations in the correct order DLT handles parallelism and captures the lineage of the data SQL Create Live table table_name as Select * from Live.live_table_name;-- refers to a table in the LVS Notice this allows to reference another tables created in another notebooks. Data quality with Expectations Ensure correctness with Expectations. Expectations are tests that ensure data quality in production: Examples: SQL CONSTRAINT valid_timestamp EXPECT (timestamp_col > 'timestamp value') ON VIOLATION DROP; %python @dlt.expect_or_drop("valid_timestamp", col("timestamp_col") > 'timestamp value') Expectations are true or false expressions that are used to validate each row during process. DLT offers flexible policies on how to handle records that violate expectations: track number of bad records (default behavior) drop bad records abort processing for a single bad record Pipelines UI Allows to visualize all the information of a pipeline: Visualize data flows between tables discover metadata and quality of each table access to historical updates control operations: switch between dev and prod modes, delete the pipeline, permission to users, schedule, etc. Dive deep into events. When created, pipelines target a specific schema. This schema is referred in the pipelines as the LIVE schema. This allows to target different schemas in each environment without changing the code. Event log The event log automatically record all pipelines operations: Operational statistics: time and current status for all operations, pipeline cluster configs, row count. Provenance: table schemas, definitions and declared properties, table-level lineage, query plans used to update tables Data quality: expectations (pass, failure, drop statistics), input/output roes that caused expectations failures SQL Stream function The SQL stream() function allows streaming data from any delta table: reads a stream of new records instead of a snapshot streaming tables must be an append-only tables any append-only delta table can be read as a stream (i.e. from the live schema, from the catalog, or just from a path) SQL Create Streaming Live table my_stream as Select * from STREAM(table_name); The table may not: Be the target of APPLY CHANGES INTO (CDC) define an aggregate function be a table on which you've executed a DML to delete/update a row (GDPR requirements) Configurations parameters Config parameters allows modularizing the code by creating variables that can be used in any notebook. These variables are a mapping of key-values pairs. Example: Given the parameter "my_vars.etl_path": "s3://path/json/" SQL Create streaming live table table_name select * from cloud_files("${my_vars.etl_path}", "json") %python def data(): input_path = spark.conf.get("my_vars.etl_path") spark.readStream.format("cloud_files").load(input_path)

Use Quizgecko on...
Browser
Browser