Databricks Data Engineer Exam Notes

Podcast

Play an AI-generated podcast conversation about this lesson

Download our mobile app to listen on the go

Get App

Questions and Answers

A data engineer and data analyst are working together on a data pipeline. The data engineer is working on the raw, bronze, and silver layers of the pipeline using Python, and the data analyst is working on the gold layer of the pipeline using SQL. The raw source of the pipeline is a streaming input. They now want to migrate their pipeline to use Delta Live Tables. Which of the following changes will need to be made to the pipeline when migrating to Delta Live Tables?

The pipeline will need to be written entirely in Python (correct)
The pipeline will need to use a batch source in place of a streaming source
The pipeline will need to be written entirely in SQL
The pipeline will need to stop using the medallion-based multi-hop architecture

A data analyst has developed a query that runs against a Delta table. They want help from the data engineering team to implement a series of tests to ensure the data returned by the query is clean. However, the data engineering team uses Python for its tests rather than SQL. Which of the following operations could the data engineering team use to run the query and operate with the results in PySpark?

SELECT * FROM sales
spark.delta.table
spark.sql (correct)
spark.table

A data organization leader is upset about the data analysis team’s reports being different from the data engineering team’s reports. The leader believes the siloed nature of their organization’s data engineering and data analysis architectures is to blame. Which of the following describes how a data lakehouse could alleviate this issue?

Both teams would reorganize to report to the same department
Both teams would be able to collaborate on projects in real-time
Both teams would respond more quickly to ad-hoc requests
Both teams would use the same source of truth for their work (correct)

Which Structured Streaming query is performing a hop from a Silver table to a Gold table?

(spark.table("sales") .withColumn("avgPrice", col("sales") / col("units")) .writeStream .option("checkpointLocation", checkpointPath) .outputMode("append") .table("newSales")) (D) Signup and view all the answers

A data engineer only wants to execute the final block of a Python program if the Python variable `day_of_week` is equal to 1 and the Python variable `review_period` is `True`. Which of the following control flow statements should the data engineer use to begin this conditionally executed code block?

if <code>day_of_week</code> == 1 and <code>review_period</code>: (B) Signup and view all the answers

A data engineer needs to apply custom logic to string column `city` in table stores for a specific use case. In order to apply this custom logic at scale, the data engineer wants to create a SQL user-defined function (UDF). Which of the following code blocks creates this SQL UDF?

CREATE FUNCTION combine_nyc(city STRING) RETURNS STRING BEGIN CASE WHEN city = "brooklyn" THEN "new york" ELSE city END; END; (A), CREATE FUNCTION combine_nyc(city STRING) RETURNS STRING BEGIN CASE WHEN city = "brooklyn" THEN "new york" ELSE city END; END; (C), CREATE FUNCTION combine_nyc(city STRING) RETURNS STRING BEGIN CASE WHEN city = "brooklyn" THEN "new york" ELSE city END; END; (E) Signup and view all the answers

A data engineer needs to access the view created by the sales team, using a shared cluster. The data engineer has been provided usage permissions on the catalog and schema. In order to access the view created by the sales team, what are the minimum permissions the data engineer would require in addition?

Needs SELECT permission only on the VIEW (D) Signup and view all the answers

Which two conditions are applicable for governance in Databricks Unity Catalog? (Choose two.)

You can have more than 1 metastore within a databricks account console but only 1 per region. (D), If metastore is not associated with location, it’s mandatory to associate catalog with managed locations (E) Signup and view all the answers

Flashcards

What is a data lakehouse?

A structured data lakehouse uses an open data format, such as Parquet, and combines transactional ACID properties with data warehousing capabilities. This allows both real-time and batch analytics on the same data, eliminating the need for separate data lakes and data warehouses.

What are the layers of a data pipeline?

A data pipeline in Databricks typically follows a layered approach, starting with raw data and progressively refining it through various stages.

Raw: Untouched data, as it arrives from its source.
Bronze: Basic data processing and cleaning, ensuring consistency and format.
Silver: Further data enrichment and transformations, bringing the data closer to its final analytical form.
Gold: Final, analytical data, ready for reporting and analysis.

What is Delta Live Tables?

Delta Live Tables (DLT) in Databricks enables a more efficient and automated way to build data pipelines by providing a declarative approach using SQL. With DLT, data engineers can define the desired data transformations and constraints, and DLT manages the execution and monitoring of the pipeline. This simplifies the process of building data pipelines while ensuring data quality and consistency.

When should you use a `CREATE STREAMING LIVE TABLE`?

When you use a CREATE STREAMING LIVE TABLE instead of a CREATE LIVE TABLE, it indicates that you want the data to be processed incrementally as it arrives. This is useful for streaming sources, where new data is constantly being added.