Cheat Sheet: Delta Lake DDL/DML Operations
38 Questions
1 Views

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

Match the following Delta Lake operations with their descriptions:

UPDATE = Update rows that match a predicate condition DELETE = Delete rows that match a predicate condition RESTORE = Rollback a table to an earlier version INSERT = Insert values directly into table

Match the following Delta Lake operations with their SQL syntax:

UPDATE = UPDATE tableName SET event = 'click' WHERE event = 'clk' DELETE = DELETE FROM tableName WHERE "date < '2017-01-01' RESTORE = RESTORE tableName VERSION AS OF 0 INSERT = INSERT INTO TABLE tableName VALUES (...)

Match the following Delta Lake utility methods with their descriptions:

DESCRIBE DETAIL = View table details DESCRIBE FORMATTED = View table details in a formatted way RESTORE = Rollback a table to an earlier version CREATE = Create and use managed database

Match the following Delta Lake operations with their purposes:

<p>TIME TRAVEL = Rollback a table to an earlier version UPDATE = Modify existing rows in a table DELETE = Remove rows from a table MERGE = Merge data from multiple sources</p> Signup and view all the answers

Match the following Delta Lake concepts with their descriptions:

<p>Managed database = Saved in the Hive metastore ACID transactions = Brings transactions to Apache Spark and big data workloads Delta Lake = Open source storage layer Databricks = Platform for running Delta Lake</p> Signup and view all the answers

Match the following Delta Lake operations with their examples:

<p>INSERT = INSERT INTO TABLE tableName VALUES (...) UPDATE = UPDATE tableName SET event = 'click' WHERE event = 'clk' DELETE = DELETE FROM tableName WHERE &quot;date &lt; '2017-01-01' RESTORE = RESTORE tableName VERSION AS OF 0</p> Signup and view all the answers

Match the following Delta Lake concepts with their descriptions:

<p>DDL = Data Definition Language for defining database schema DML = Data Manipulation Language for modifying data TIME TRAVEL = Rollback a table to an earlier version UTILITY METHODS = Methods for viewing table details</p> Signup and view all the answers

Match the following Delta Lake operations with their descriptions:

<p>CREATE DATABASE dbName; = Create a new database DROP DATABASE IF EXISTS dbName; = Delete a database USE dbName = Specify the default database CONVERT TO DELTA [dbName.]tableName = Convert a Parquet table to Delta Lake format</p> Signup and view all the answers

Match the following Delta Lake operations with their descriptions:

<p>MERGE INTO target = Upsert data into a Delta Lake table INSERT OVERWRITE loan_by_state_delta VALUES (...) = Replace data in a Delta Lake table VACUUM tableName [RETAIN num HOURS] [DRY RUN] = Delete old files from a Delta Lake table ALTER TABLE tableName ADD COLUMNS (...) = Add columns to a Delta Lake table</p> Signup and view all the answers

Match the following Delta Lake operations with their descriptions:

<p>SELECT * FROM delta.<code>path/to/delta_table</code> = Query a Delta Lake table by path df = spark.read.table('tableName') = Read a Delta Lake table into a DataFrame SELECT * FROM [dbName.] tableName = Query a Delta Lake table by name spark.sql('SELECT * FROM tableName') = Run a SQL query on a Delta Lake table from Python</p> Signup and view all the answers

Match the following Delta Lake operations with their descriptions:

<p>CLONE sourceName [VERSION AS OF 0] = Deep clone a Delta Lake table CREATE TABLE [dbName.] targetName = Create a new Delta Lake table MERGE INTO target USING updates = Merge data into a Delta Lake table VACUUM tableName [RETAIN num HOURS] [DRY RUN] = Delete old files from a Delta Lake table</p> Signup and view all the answers

Match the following Delta Lake operations with their descriptions:

<p>ALTER TABLE tableName ADD COLUMNS (...) = Add columns to a Delta Lake table MERGE INTO target USING updates = Upsert data into a Delta Lake table CONVERT TO DELTA [dbName.]tableName = Convert a Parquet table to Delta Lake format SELECT * FROM delta.<code>path/to/delta_table</code> = Query a Delta Lake table by path</p> Signup and view all the answers

Match the following Delta Lake operations with their descriptions:

<p>MERGE INTO logs USING newDedupedLogs = Insert data with deduplication INSERT OVERWRITE loan_by_state_delta VALUES (...) = Replace data in a Delta Lake table VACUUM tableName [RETAIN num HOURS] [DRY RUN] = Delete old files from a Delta Lake table ALTER TABLE tableName ADD COLUMNS (...) = Add columns to a Delta Lake table</p> Signup and view all the answers

Match the following Delta Lake operations with their descriptions:

<p>CREATE TABLE [dbName.] targetName = Create a new Delta Lake table SELECT * FROM [dbName.] tableName = Query a Delta Lake table by name MERGE INTO target USING updates = Upsert data into a Delta Lake table spark.sql('SELECT * FROM tableName') = Run a SQL query on a Delta Lake table from Python</p> Signup and view all the answers

Match the following Delta Lake operations with their descriptions:

<p>CONVERT TO DELTA [dbName.]tableName = Convert a Parquet table to Delta Lake format VACUUM tableName [RETAIN num HOURS] [DRY RUN] = Delete old files from a Delta Lake table ALTER TABLE tableName ADD COLUMNS (...) = Add columns to a Delta Lake table MERGE INTO target USING updates = Upsert data into a Delta Lake table</p> Signup and view all the answers

Match the following Delta Lake operations with their descriptions:

<p>INSERT OVERWRITE loan_by_state_delta VALUES (...) = Replace data in a Delta Lake table MERGE INTO logs USING newDedupedLogs = Insert data with deduplication VACUUM tableName [RETAIN num HOURS] [DRY RUN] = Delete old files from a Delta Lake table CREATE TABLE [dbName.] targetName = Create a new Delta Lake table</p> Signup and view all the answers

Match the following Delta Lake operations with their descriptions:

<p>df = spark.read.table('tableName') = Read a Delta Lake table into a DataFrame spark.sql('SELECT * FROM tableName') = Run a SQL query on a Delta Lake table from Python ALTER TABLE tableName ADD COLUMNS (...) = Add columns to a Delta Lake table CLONE sourceName [VERSION AS OF 0] = Deep clone a Delta Lake table</p> Signup and view all the answers

Match the following SQL syntax with their purposes:

<p>SELECT * FROM tableName VERSION AS OF 12 = Retrieve a specific version of a table SELECT colA, colB FROM tableName WHERE colNameA &gt; 0 = Filter data based on a condition EXCEPT ALL SELECT * FROM tableName VERSION AS OF 11 = Find changes between two versions of a table df1 = spark.read.format(&quot;delta&quot;).load(pathToTable) = Load a Delta table into a DataFrame</p> Signup and view all the answers

Match the following Delta Lake commands with their purposes:

<p>CREATE TABLE ... USING DELTA = Create a new Delta Lake table ALTER TABLE ... ADD CONSTRAINT = Add a constraint to a table OPTIMIZE tableName = Compact data files DESCRIBE HISTORY tableName = View transaction log</p> Signup and view all the answers

Match the following Delta Lake features with their descriptions:

<p>Auto-optimize tables = Automatically optimize write performance Cache frequently queried data = Store frequently accessed data in memory Delta Log = View historical versions of a table Time Travel = Query historical versions of a table</p> Signup and view all the answers

Match the following Delta Lake commands with their effects:

<p>COPY INTO ... FROM ... = Copy new data into a Delta Lake table ALTER TABLE ... SET TBLPROPERTIES = Set table properties SELECT * FROM tableName VERSION AS OF 0 = Query the initial version of a table CREATE TABLE ... PARTITIONED BY = Create a table with partitioned columns</p> Signup and view all the answers

Match the following Delta Lake constraints with their purposes:

<p>NOT NULL = Ensure a column is not null CHECK date &gt; &quot;1900-01-01&quot; = Check if a date is within a certain range ZORDER BY (colNameA, colNameB) = Optimize data files for query performance PARTITIONED BY (time, date) = Partition a table by time and date columns</p> Signup and view all the answers

Match the following Delta Lake commands with their effects:

<p>CACHE SELECT * FROM tableName = Cache frequently queried data ALTER TABLE ... DROP CONSTRAINT = Remove a constraint from a table SHOW TBLPROPERTIES tableName = View table properties CREATE TABLE ... AS SELECT * FROM tableName = Create a new table with data from another table</p> Signup and view all the answers

Match the following Delta Lake features with their benefits:

<p>Time Travel = Enable querying historical versions of a table Delta Log = Provide a transaction log for auditing Auto-optimize tables = Improve write performance Cache frequently queried data = Improve query performance</p> Signup and view all the answers

Match the following Delta Lake commands with their purposes:

<p>ALTER TABLE ... CHANGE COLUMN ... SET NOT NULL = Add a not null constraint to a column CREATE TABLE ... (id INT, ...) = Define a table schema with explicit columns SELECT * FROM tableName TIMESTAMP AS OF &quot;2020-12-18&quot; = Query a specific version of a table OPTIMIZE tableName ZORDER BY (colNameA, colNameB) = Optimize data files for query performance</p> Signup and view all the answers

Match the following Delta Lake features with their descriptions:

<p>Delta Lake = A data storage system that supports ACID transactions Delta Log = A record of all changes made to a table Time Travel = A feature that allows querying historical versions of a table Auto-optimize tables = A feature that automatically optimizes write performance</p> Signup and view all the answers

Match the following Delta Lake commands with their effects:

<p>CREATE TABLE ... LIKE ... = Create a new table with the same schema as another table ALTER TABLE ... RENAME TO ... = Rename a table COPY INTO ... FROM ... FILEFORMAT = DELTA = Copy data into a Delta Lake table DESCRIBE HISTORY tableName = View the transaction log for a table</p> Signup and view all the answers

Match the following Delta Lake features with their benefits:

<p>ACID transactions = Ensure data consistency and integrity Delta Cache = Improve query performance by caching frequently accessed data Auto-optimize tables = Improve write performance by optimizing data files Time Travel = Enable querying historical versions of a table for auditing and debugging</p> Signup and view all the answers

Match the following Delta Lake operations with their purposes:

<p>Read data from pandas DataFrame = Load data into a Delta Lake table Write DataFrame in Delta Lake format = Save data from a DataFrame into a Delta Lake table Delete rows that match a predicate condition = Remove specific data from a Delta Lake table Run Spark SQL queries in Python = Execute SQL queries on a Delta Lake table</p> Signup and view all the answers

Match the following Delta Lake operations with their SQL syntax:

<p>Upsert (update + insert) using MERGE = WHEN MATCHED THEN UPDATE... WHEN NOT MATCHED THEN INSERT... Delete rows that match a predicate condition = DELETE FROM table_name WHERE condition Update rows that match a predicate condition = UPDATE table_name SET column = value WHERE condition Insert data into a Delta Lake table = INSERT INTO table_name VALUES (value1, value2, ...)</p> Signup and view all the answers

Match the following Delta Lake utility methods with their descriptions:

<p>Vacuum old files = Remove old versions of a Delta Lake table Clone a Delta Lake table = Create a copy of a Delta Lake table Get DataFrame representation of a Delta Lake table = Retrieve a DataFrame from a Delta Lake table Restore a table to a previous version = Revert a Delta Lake table to a previous state</p> Signup and view all the answers

Match the following Delta Lake operations with their requirements:

<p>Read data using Apache Spark = Spark session Save DataFrame in Delta Lake format = Delta Lake format Run Spark SQL queries in Python = Spark SQL Streaming writes (Delta table as a sink) = Delta Lake table</p> Signup and view all the answers

Match the following Delta Lake concepts with their descriptions:

<p>ACID transactions = Atomicity, Consistency, Isolation, Durability Delta Lake table = A table stored in Delta Lake format Merge = Upsert operation in Delta Lake Optimize = Compacting small files into larger ones</p> Signup and view all the answers

Match the following Delta Lake operations with their examples:

<p>Read data from pandas DataFrame = df = spark.createDataFrame(pdf) Write DataFrame in Delta Lake format = df.write.format(&quot;delta&quot;).mode(&quot;append&quot;) Delete rows that match a predicate condition = deltaTable.delete(&quot;date &lt; '2017-01-01'&quot;) Run Spark SQL queries in Python = spark.sql(&quot;SELECT * FROM tableName&quot;)</p> Signup and view all the answers

Match the following Delta Lake concepts with their purposes:

<p>Versioning = Tracking changes to a Delta Lake table Time travel = Querying a Delta Lake table at a specific point in time Checkpoints = Saving the state of a streaming query Merge = Upserting data into a Delta Lake table</p> Signup and view all the answers

Match the following Delta Lake operations with their SQL syntax:

<p>Insert data into a Delta Lake table = INSERT INTO table_name VALUES (value1, value2, ...) Update rows that match a predicate condition = UPDATE table_name SET column = value WHERE condition Delete rows that match a predicate condition = DELETE FROM table_name WHERE condition Merge data into a Delta Lake table = MERGE INTO table_name USING source_table ON condition</p> Signup and view all the answers

Match the following Delta Lake utility methods with their purposes:

<p>Vacuum old files = Removing old versions of a Delta Lake table Clone a Delta Lake table = Creating a copy of a Delta Lake table Get DataFrame representation of a Delta Lake table = Retrieving a DataFrame from a Delta Lake table Restore a table to a previous version = Reverting a Delta Lake table to a previous state</p> Signup and view all the answers

Match the following Delta Lake operations with their examples:

<p>Streaming reads (Delta table as streaming source) = df = spark.readStream.format(&quot;delta&quot;).load(&quot;/path/to/delta_table&quot;) Streaming writes (Delta table as a sink) = df.writeStream.format(&quot;delta&quot;).option(&quot;checkpointLocation&quot;, &quot;path/to/checkpoints&quot;) Read data from a Delta Lake table = df = deltaTable.toDF() Write DataFrame in Delta Lake format = df.write.format(&quot;delta&quot;).mode(&quot;append&quot;)</p> Signup and view all the answers

Study Notes

Delta Lake DDL/DML

  • Update rows that match a predicate condition using UPDATE statement: UPDATE tableName SET event = 'click' WHERE event = 'clk'
  • Delete rows that match a predicate condition using DELETE statement: DELETE FROM tableName WHERE "date &lt; '2017-01-01'
  • Insert values directly into a table using INSERT statement: INSERT INTO tableName VALUES (8003, "Kim Jones", "2020-12-18", 3.875)
  • Upsert (update + insert) using MERGE statement: MERGE INTO target USING updates ON target.Id = updates.Id WHEN NOT MATCHED THEN INSERT (date, Id, data) VALUES (date, Id, data)
  • Rollback a table to an earlier version using RESTORE statement: RESTORE tableName VERSION AS OF 0 or RESTORE tableName TIMESTAMP AS OF "2020-12-18"

Table Management

  • Create a Delta Lake table using CREATE TABLE statement: CREATE TABLE tableName USING DELTA AS SELECT * FROM tableName
  • Create a Delta Lake table with a specified schema: CREATE TABLE tableName (id INT, name STRING, date DATE, int_rate FLOAT) USING DELTA
  • Convert a Parquet table to Delta Lake format in place: CONVERT TO DELTA tableName
  • Alter table schema — add columns: ALTER TABLE tableName ADD COLUMNS (col_name data_type)
  • Alter table — add constraint: ALTER TABLE tableName ADD CONSTRAINT dateWithinRange CHECK date &gt; "1900-01-01"
  • Drop constraint: ALTER TABLE tableName DROP CONSTRAINT dateWithinRange

Performance Optimizations

  • Compact data files with OPTIMIZE statement: OPTIMIZE tableName
  • Use ZORDER to optimize data files: OPTIMIZE tableName ZORDER BY (colNameA, colNameB)
  • Auto-optimize tables: ALTER TABLE tableName SET TBLPROPERTIES (delta.autoOptimize.optimizeWrite = true)
  • Cache frequently queried data in Delta Cache: CACHE SELECT * FROM tableName

Time Travel

  • View transaction log (aka Delta Log) using DESCRIBE HISTORY statement: DESCRIBE HISTORY tableName

  • Query historical versions of Delta Lake tables: SELECT * FROM tableName VERSION AS OF 0 or SELECT * FROM tableName TIMESTAMP AS OF "2020-12-18"

  • Find changes between 2 versions of a table: SELECT * FROM tableName VERSION AS OF 12 EXCEPT ALL SELECT * FROM tableName VERSION AS OF 11### Delta Lake Fundamentals

  • Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark and big data workloads.

  • It provides a reliable data storage layer for Apache Spark and other big data engines.

Reading and Writing Data with Delta Lake

  • Read data from a pandas DataFrame using spark.createDataFrame(pdf) and then save it in Delta Lake format.
  • Read data using Apache Spark by specifying the format (e.g., parquet, csv, json) and loading from a path.
  • Read data by table name using spark.table("events").
  • Save a DataFrame in Delta Lake format using write.format("delta") and specify the mode (append, overwrite) and optional partitioning.

Delta Lake DDL/DML Operations

  • Delete rows that match a predicate condition using deltaTable.delete("date &lt; '2017-01-01'").
  • Update rows that match a predicate condition using deltaTable.update(condition = "eventType = 'clk'", set = { "eventType": "'click'" } ).
  • Upsert (update + insert) using MERGE with options for whenMatchedUpdate, whenMatchedUpdateAll, whenNotMatchedInsert, and whenMatchedDelete.
  • Clone a Delta Lake table using deltaTable.clone(target="/path/to/delta_table/", isShallow=True, replace=True).

Utility Methods

  • Run Spark SQL queries in Python using spark.sql("SELECT * FROM tableName").
  • Get a DataFrame representation of a Delta Lake table using deltaTable.toDF().
  • Vacuum old files using deltaTable.vacuum() or deltaTable.vacuum(100).

Performance Optimizations

  • Compact data files with Optimize and Z-Order using spark.sql("OPTIMIZE tableName [ZORDER BY (colA, colB)]").
  • Auto-optimize tables using a Databricks Delta Lake feature.
  • Insert with deduplication using MERGE.

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

Delta-Lake-cheat-sheet.pdf

Description

Test your understanding of Delta Lake DDL/DML operations, including update, delete, merge, and alter table, as well as time travel and rollback capabilities.

More Like This

Use Quizgecko on...
Browser
Browser