38 Questions
Match the following Delta Lake operations with their descriptions:
UPDATE = Update rows that match a predicate condition DELETE = Delete rows that match a predicate condition RESTORE = Rollback a table to an earlier version INSERT = Insert values directly into table
Match the following Delta Lake operations with their SQL syntax:
UPDATE = UPDATE tableName SET event = 'click' WHERE event = 'clk' DELETE = DELETE FROM tableName WHERE "date < '2017-01-01' RESTORE = RESTORE tableName VERSION AS OF 0 INSERT = INSERT INTO TABLE tableName VALUES (...)
Match the following Delta Lake utility methods with their descriptions:
DESCRIBE DETAIL = View table details DESCRIBE FORMATTED = View table details in a formatted way RESTORE = Rollback a table to an earlier version CREATE = Create and use managed database
Match the following Delta Lake operations with their purposes:
TIME TRAVEL = Rollback a table to an earlier version UPDATE = Modify existing rows in a table DELETE = Remove rows from a table MERGE = Merge data from multiple sources
Match the following Delta Lake concepts with their descriptions:
Managed database = Saved in the Hive metastore ACID transactions = Brings transactions to Apache Spark and big data workloads Delta Lake = Open source storage layer Databricks = Platform for running Delta Lake
Match the following Delta Lake operations with their examples:
INSERT = INSERT INTO TABLE tableName VALUES (...) UPDATE = UPDATE tableName SET event = 'click' WHERE event = 'clk' DELETE = DELETE FROM tableName WHERE "date < '2017-01-01' RESTORE = RESTORE tableName VERSION AS OF 0
Match the following Delta Lake concepts with their descriptions:
DDL = Data Definition Language for defining database schema DML = Data Manipulation Language for modifying data TIME TRAVEL = Rollback a table to an earlier version UTILITY METHODS = Methods for viewing table details
Match the following Delta Lake operations with their descriptions:
CREATE DATABASE dbName; = Create a new database DROP DATABASE IF EXISTS dbName; = Delete a database USE dbName = Specify the default database CONVERT TO DELTA [dbName.]tableName = Convert a Parquet table to Delta Lake format
Match the following Delta Lake operations with their descriptions:
MERGE INTO target = Upsert data into a Delta Lake table INSERT OVERWRITE loan_by_state_delta VALUES (...) = Replace data in a Delta Lake table VACUUM tableName [RETAIN num HOURS] [DRY RUN] = Delete old files from a Delta Lake table ALTER TABLE tableName ADD COLUMNS (...) = Add columns to a Delta Lake table
Match the following Delta Lake operations with their descriptions:
SELECT * FROM delta.path/to/delta_table
= Query a Delta Lake table by path
df = spark.read.table('tableName') = Read a Delta Lake table into a DataFrame
SELECT * FROM [dbName.] tableName = Query a Delta Lake table by name
spark.sql('SELECT * FROM tableName') = Run a SQL query on a Delta Lake table from Python
Match the following Delta Lake operations with their descriptions:
CLONE sourceName [VERSION AS OF 0] = Deep clone a Delta Lake table CREATE TABLE [dbName.] targetName = Create a new Delta Lake table MERGE INTO target USING updates = Merge data into a Delta Lake table VACUUM tableName [RETAIN num HOURS] [DRY RUN] = Delete old files from a Delta Lake table
Match the following Delta Lake operations with their descriptions:
ALTER TABLE tableName ADD COLUMNS (...) = Add columns to a Delta Lake table
MERGE INTO target USING updates = Upsert data into a Delta Lake table
CONVERT TO DELTA [dbName.]tableName = Convert a Parquet table to Delta Lake format
SELECT * FROM delta.path/to/delta_table
= Query a Delta Lake table by path
Match the following Delta Lake operations with their descriptions:
MERGE INTO logs USING newDedupedLogs = Insert data with deduplication INSERT OVERWRITE loan_by_state_delta VALUES (...) = Replace data in a Delta Lake table VACUUM tableName [RETAIN num HOURS] [DRY RUN] = Delete old files from a Delta Lake table ALTER TABLE tableName ADD COLUMNS (...) = Add columns to a Delta Lake table
Match the following Delta Lake operations with their descriptions:
CREATE TABLE [dbName.] targetName = Create a new Delta Lake table SELECT * FROM [dbName.] tableName = Query a Delta Lake table by name MERGE INTO target USING updates = Upsert data into a Delta Lake table spark.sql('SELECT * FROM tableName') = Run a SQL query on a Delta Lake table from Python
Match the following Delta Lake operations with their descriptions:
CONVERT TO DELTA [dbName.]tableName = Convert a Parquet table to Delta Lake format VACUUM tableName [RETAIN num HOURS] [DRY RUN] = Delete old files from a Delta Lake table ALTER TABLE tableName ADD COLUMNS (...) = Add columns to a Delta Lake table MERGE INTO target USING updates = Upsert data into a Delta Lake table
Match the following Delta Lake operations with their descriptions:
INSERT OVERWRITE loan_by_state_delta VALUES (...) = Replace data in a Delta Lake table MERGE INTO logs USING newDedupedLogs = Insert data with deduplication VACUUM tableName [RETAIN num HOURS] [DRY RUN] = Delete old files from a Delta Lake table CREATE TABLE [dbName.] targetName = Create a new Delta Lake table
Match the following Delta Lake operations with their descriptions:
df = spark.read.table('tableName') = Read a Delta Lake table into a DataFrame spark.sql('SELECT * FROM tableName') = Run a SQL query on a Delta Lake table from Python ALTER TABLE tableName ADD COLUMNS (...) = Add columns to a Delta Lake table CLONE sourceName [VERSION AS OF 0] = Deep clone a Delta Lake table
Match the following SQL syntax with their purposes:
SELECT * FROM tableName VERSION AS OF 12 = Retrieve a specific version of a table SELECT colA, colB FROM tableName WHERE colNameA > 0 = Filter data based on a condition EXCEPT ALL SELECT * FROM tableName VERSION AS OF 11 = Find changes between two versions of a table df1 = spark.read.format("delta").load(pathToTable) = Load a Delta table into a DataFrame
Match the following Delta Lake commands with their purposes:
CREATE TABLE ... USING DELTA = Create a new Delta Lake table ALTER TABLE ... ADD CONSTRAINT = Add a constraint to a table OPTIMIZE tableName = Compact data files DESCRIBE HISTORY tableName = View transaction log
Match the following Delta Lake features with their descriptions:
Auto-optimize tables = Automatically optimize write performance Cache frequently queried data = Store frequently accessed data in memory Delta Log = View historical versions of a table Time Travel = Query historical versions of a table
Match the following Delta Lake commands with their effects:
COPY INTO ... FROM ... = Copy new data into a Delta Lake table ALTER TABLE ... SET TBLPROPERTIES = Set table properties SELECT * FROM tableName VERSION AS OF 0 = Query the initial version of a table CREATE TABLE ... PARTITIONED BY = Create a table with partitioned columns
Match the following Delta Lake constraints with their purposes:
NOT NULL = Ensure a column is not null CHECK date > "1900-01-01" = Check if a date is within a certain range ZORDER BY (colNameA, colNameB) = Optimize data files for query performance PARTITIONED BY (time, date) = Partition a table by time and date columns
Match the following Delta Lake commands with their effects:
CACHE SELECT * FROM tableName = Cache frequently queried data ALTER TABLE ... DROP CONSTRAINT = Remove a constraint from a table SHOW TBLPROPERTIES tableName = View table properties CREATE TABLE ... AS SELECT * FROM tableName = Create a new table with data from another table
Match the following Delta Lake features with their benefits:
Time Travel = Enable querying historical versions of a table Delta Log = Provide a transaction log for auditing Auto-optimize tables = Improve write performance Cache frequently queried data = Improve query performance
Match the following Delta Lake commands with their purposes:
ALTER TABLE ... CHANGE COLUMN ... SET NOT NULL = Add a not null constraint to a column CREATE TABLE ... (id INT, ...) = Define a table schema with explicit columns SELECT * FROM tableName TIMESTAMP AS OF "2020-12-18" = Query a specific version of a table OPTIMIZE tableName ZORDER BY (colNameA, colNameB) = Optimize data files for query performance
Match the following Delta Lake features with their descriptions:
Delta Lake = A data storage system that supports ACID transactions Delta Log = A record of all changes made to a table Time Travel = A feature that allows querying historical versions of a table Auto-optimize tables = A feature that automatically optimizes write performance
Match the following Delta Lake commands with their effects:
CREATE TABLE ... LIKE ... = Create a new table with the same schema as another table ALTER TABLE ... RENAME TO ... = Rename a table COPY INTO ... FROM ... FILEFORMAT = DELTA = Copy data into a Delta Lake table DESCRIBE HISTORY tableName = View the transaction log for a table
Match the following Delta Lake features with their benefits:
ACID transactions = Ensure data consistency and integrity Delta Cache = Improve query performance by caching frequently accessed data Auto-optimize tables = Improve write performance by optimizing data files Time Travel = Enable querying historical versions of a table for auditing and debugging
Match the following Delta Lake operations with their purposes:
Read data from pandas DataFrame = Load data into a Delta Lake table Write DataFrame in Delta Lake format = Save data from a DataFrame into a Delta Lake table Delete rows that match a predicate condition = Remove specific data from a Delta Lake table Run Spark SQL queries in Python = Execute SQL queries on a Delta Lake table
Match the following Delta Lake operations with their SQL syntax:
Upsert (update + insert) using MERGE = WHEN MATCHED THEN UPDATE... WHEN NOT MATCHED THEN INSERT... Delete rows that match a predicate condition = DELETE FROM table_name WHERE condition Update rows that match a predicate condition = UPDATE table_name SET column = value WHERE condition Insert data into a Delta Lake table = INSERT INTO table_name VALUES (value1, value2, ...)
Match the following Delta Lake utility methods with their descriptions:
Vacuum old files = Remove old versions of a Delta Lake table Clone a Delta Lake table = Create a copy of a Delta Lake table Get DataFrame representation of a Delta Lake table = Retrieve a DataFrame from a Delta Lake table Restore a table to a previous version = Revert a Delta Lake table to a previous state
Match the following Delta Lake operations with their requirements:
Read data using Apache Spark = Spark session Save DataFrame in Delta Lake format = Delta Lake format Run Spark SQL queries in Python = Spark SQL Streaming writes (Delta table as a sink) = Delta Lake table
Match the following Delta Lake concepts with their descriptions:
ACID transactions = Atomicity, Consistency, Isolation, Durability Delta Lake table = A table stored in Delta Lake format Merge = Upsert operation in Delta Lake Optimize = Compacting small files into larger ones
Match the following Delta Lake operations with their examples:
Read data from pandas DataFrame = df = spark.createDataFrame(pdf) Write DataFrame in Delta Lake format = df.write.format("delta").mode("append") Delete rows that match a predicate condition = deltaTable.delete("date < '2017-01-01'") Run Spark SQL queries in Python = spark.sql("SELECT * FROM tableName")
Match the following Delta Lake concepts with their purposes:
Versioning = Tracking changes to a Delta Lake table Time travel = Querying a Delta Lake table at a specific point in time Checkpoints = Saving the state of a streaming query Merge = Upserting data into a Delta Lake table
Match the following Delta Lake operations with their SQL syntax:
Insert data into a Delta Lake table = INSERT INTO table_name VALUES (value1, value2, ...) Update rows that match a predicate condition = UPDATE table_name SET column = value WHERE condition Delete rows that match a predicate condition = DELETE FROM table_name WHERE condition Merge data into a Delta Lake table = MERGE INTO table_name USING source_table ON condition
Match the following Delta Lake utility methods with their purposes:
Vacuum old files = Removing old versions of a Delta Lake table Clone a Delta Lake table = Creating a copy of a Delta Lake table Get DataFrame representation of a Delta Lake table = Retrieving a DataFrame from a Delta Lake table Restore a table to a previous version = Reverting a Delta Lake table to a previous state
Match the following Delta Lake operations with their examples:
Streaming reads (Delta table as streaming source) = df = spark.readStream.format("delta").load("/path/to/delta_table") Streaming writes (Delta table as a sink) = df.writeStream.format("delta").option("checkpointLocation", "path/to/checkpoints") Read data from a Delta Lake table = df = deltaTable.toDF() Write DataFrame in Delta Lake format = df.write.format("delta").mode("append")
Study Notes
Delta Lake DDL/DML
- Update rows that match a predicate condition using
UPDATE
statement:UPDATE tableName SET event = 'click' WHERE event = 'clk'
- Delete rows that match a predicate condition using
DELETE
statement:DELETE FROM tableName WHERE "date < '2017-01-01'
- Insert values directly into a table using
INSERT
statement:INSERT INTO tableName VALUES (8003, "Kim Jones", "2020-12-18", 3.875)
- Upsert (update + insert) using
MERGE
statement:MERGE INTO target USING updates ON target.Id = updates.Id WHEN NOT MATCHED THEN INSERT (date, Id, data) VALUES (date, Id, data)
- Rollback a table to an earlier version using
RESTORE
statement:RESTORE tableName VERSION AS OF 0
orRESTORE tableName TIMESTAMP AS OF "2020-12-18"
Table Management
- Create a Delta Lake table using
CREATE TABLE
statement:CREATE TABLE tableName USING DELTA AS SELECT * FROM tableName
- Create a Delta Lake table with a specified schema:
CREATE TABLE tableName (id INT, name STRING, date DATE, int_rate FLOAT) USING DELTA
- Convert a Parquet table to Delta Lake format in place:
CONVERT TO DELTA tableName
- Alter table schema — add columns:
ALTER TABLE tableName ADD COLUMNS (col_name data_type)
- Alter table — add constraint:
ALTER TABLE tableName ADD CONSTRAINT dateWithinRange CHECK date > "1900-01-01"
- Drop constraint:
ALTER TABLE tableName DROP CONSTRAINT dateWithinRange
Performance Optimizations
- Compact data files with
OPTIMIZE
statement:OPTIMIZE tableName
- Use
ZORDER
to optimize data files:OPTIMIZE tableName ZORDER BY (colNameA, colNameB)
- Auto-optimize tables:
ALTER TABLE tableName SET TBLPROPERTIES (delta.autoOptimize.optimizeWrite = true)
- Cache frequently queried data in Delta Cache:
CACHE SELECT * FROM tableName
Time Travel
-
View transaction log (aka Delta Log) using
DESCRIBE HISTORY
statement:DESCRIBE HISTORY tableName
-
Query historical versions of Delta Lake tables:
SELECT * FROM tableName VERSION AS OF 0
orSELECT * FROM tableName TIMESTAMP AS OF "2020-12-18"
-
Find changes between 2 versions of a table:
SELECT * FROM tableName VERSION AS OF 12 EXCEPT ALL SELECT * FROM tableName VERSION AS OF 11
### Delta Lake Fundamentals -
Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark and big data workloads.
-
It provides a reliable data storage layer for Apache Spark and other big data engines.
Reading and Writing Data with Delta Lake
- Read data from a pandas DataFrame using
spark.createDataFrame(pdf)
and then save it in Delta Lake format. - Read data using Apache Spark by specifying the format (e.g., parquet, csv, json) and loading from a path.
- Read data by table name using
spark.table("events")
. - Save a DataFrame in Delta Lake format using
write.format("delta")
and specify the mode (append, overwrite) and optional partitioning.
Delta Lake DDL/DML Operations
- Delete rows that match a predicate condition using
deltaTable.delete("date < '2017-01-01'")
. - Update rows that match a predicate condition using
deltaTable.update(condition = "eventType = 'clk'", set = { "eventType": "'click'" } )
. - Upsert (update + insert) using MERGE with options for whenMatchedUpdate, whenMatchedUpdateAll, whenNotMatchedInsert, and whenMatchedDelete.
- Clone a Delta Lake table using
deltaTable.clone(target="/path/to/delta_table/", isShallow=True, replace=True)
.
Utility Methods
- Run Spark SQL queries in Python using
spark.sql("SELECT * FROM tableName")
. - Get a DataFrame representation of a Delta Lake table using
deltaTable.toDF()
. - Vacuum old files using
deltaTable.vacuum()
ordeltaTable.vacuum(100)
.
Performance Optimizations
- Compact data files with Optimize and Z-Order using
spark.sql("OPTIMIZE tableName [ZORDER BY (colA, colB)]")
. - Auto-optimize tables using a Databricks Delta Lake feature.
- Insert with deduplication using MERGE.
Test your understanding of Delta Lake DDL/DML operations, including update, delete, merge, and alter table, as well as time travel and rollback capabilities.
Make Your Own Quizzes and Flashcards
Convert your notes into interactive study material.
Get started for free