Podcast
Questions and Answers
A data engineer and data analyst are working together on a data pipeline. The data engineer is working on the raw, bronze, and silver layers of the pipeline using Python, and the data analyst is working on the gold layer of the pipeline using SQL. The raw source of the pipeline is a streaming input. They now want to migrate their pipeline to use Delta Live Tables. Which of the following changes will need to be made to the pipeline when migrating to Delta Live Tables?
A data engineer and data analyst are working together on a data pipeline. The data engineer is working on the raw, bronze, and silver layers of the pipeline using Python, and the data analyst is working on the gold layer of the pipeline using SQL. The raw source of the pipeline is a streaming input. They now want to migrate their pipeline to use Delta Live Tables. Which of the following changes will need to be made to the pipeline when migrating to Delta Live Tables?
A data analyst has developed a query that runs against a Delta table. They want help from the data engineering team to implement a series of tests to ensure the data returned by the query is clean. However, the data engineering team uses Python for its tests rather than SQL. Which of the following operations could the data engineering team use to run the query and operate with the results in PySpark?
A data analyst has developed a query that runs against a Delta table. They want help from the data engineering team to implement a series of tests to ensure the data returned by the query is clean. However, the data engineering team uses Python for its tests rather than SQL. Which of the following operations could the data engineering team use to run the query and operate with the results in PySpark?
A data organization leader is upset about the data analysis team’s reports being different from the data engineering team’s reports. The leader believes the siloed nature of their organization’s data engineering and data analysis architectures is to blame. Which of the following describes how a data lakehouse could alleviate this issue?
A data organization leader is upset about the data analysis team’s reports being different from the data engineering team’s reports. The leader believes the siloed nature of their organization’s data engineering and data analysis architectures is to blame. Which of the following describes how a data lakehouse could alleviate this issue?
Which Structured Streaming query is performing a hop from a Silver table to a Gold table?
Which Structured Streaming query is performing a hop from a Silver table to a Gold table?
Signup and view all the answers
A data engineer only wants to execute the final block of a Python program if the Python variable day_of_week
is equal to 1 and the Python variable review_period
is True
. Which of the following control flow statements should the data engineer use to begin this conditionally executed code block?
A data engineer only wants to execute the final block of a Python program if the Python variable day_of_week
is equal to 1 and the Python variable review_period
is True
. Which of the following control flow statements should the data engineer use to begin this conditionally executed code block?
Signup and view all the answers
A data engineer needs to apply custom logic to string column city
in table stores for a specific use case. In order to apply this custom logic at scale, the data engineer wants to create a SQL user-defined function (UDF). Which of the following code blocks creates this SQL UDF?
A data engineer needs to apply custom logic to string column city
in table stores for a specific use case. In order to apply this custom logic at scale, the data engineer wants to create a SQL user-defined function (UDF). Which of the following code blocks creates this SQL UDF?
Signup and view all the answers
A data engineer needs to access the view created by the sales team, using a shared cluster. The data engineer has been provided usage permissions on the catalog and schema. In order to access the view created by the sales team, what are the minimum permissions the data engineer would require in addition?
A data engineer needs to access the view created by the sales team, using a shared cluster. The data engineer has been provided usage permissions on the catalog and schema. In order to access the view created by the sales team, what are the minimum permissions the data engineer would require in addition?
Signup and view all the answers
Which two conditions are applicable for governance in Databricks Unity Catalog? (Choose two.)
Which two conditions are applicable for governance in Databricks Unity Catalog? (Choose two.)
Signup and view all the answers
Flashcards
What is a data lakehouse?
What is a data lakehouse?
A structured data lakehouse uses an open data format, such as Parquet, and combines transactional ACID properties with data warehousing capabilities. This allows both real-time and batch analytics on the same data, eliminating the need for separate data lakes and data warehouses.
What are the layers of a data pipeline?
What are the layers of a data pipeline?
A data pipeline in Databricks typically follows a layered approach, starting with raw data and progressively refining it through various stages.
- Raw: Untouched data, as it arrives from its source.
- Bronze: Basic data processing and cleaning, ensuring consistency and format.
- Silver: Further data enrichment and transformations, bringing the data closer to its final analytical form.
- Gold: Final, analytical data, ready for reporting and analysis.
What is Delta Live Tables?
What is Delta Live Tables?
Delta Live Tables (DLT) in Databricks enables a more efficient and automated way to build data pipelines by providing a declarative approach using SQL. With DLT, data engineers can define the desired data transformations and constraints, and DLT manages the execution and monitoring of the pipeline. This simplifies the process of building data pipelines while ensuring data quality and consistency.
When should you use a CREATE STREAMING LIVE TABLE
?
When should you use a CREATE STREAMING LIVE TABLE
?
Signup and view all the flashcards
What's the difference between a managed table and an external table in Databricks?
What's the difference between a managed table and an external table in Databricks?
Signup and view all the flashcards
What is a cluster pool in Databricks?
What is a cluster pool in Databricks?
Signup and view all the flashcards
What is spark.sql
in PySpark?
What is spark.sql
in PySpark?
Signup and view all the flashcards
What is Auto Loader in Databricks?
What is Auto Loader in Databricks?
Signup and view all the flashcards
How do you configure a Structured Streaming job to run every 5 seconds?
How do you configure a Structured Streaming job to run every 5 seconds?
Signup and view all the flashcards
What is the COPY INTO
command in Databricks?
What is the COPY INTO
command in Databricks?
Signup and view all the flashcards
How does the MERGE
command handle duplicate records in Databricks?
How does the MERGE
command handle duplicate records in Databricks?
Signup and view all the flashcards
What are expectations in Delta Live Tables?
What are expectations in Delta Live Tables?
Signup and view all the flashcards
What are constraints in Delta Live Tables?
What are constraints in Delta Live Tables?
Signup and view all the flashcards
What does the ON VIOLATION DROP ROW
option do in Delta Live Tables?
What does the ON VIOLATION DROP ROW
option do in Delta Live Tables?
Signup and view all the flashcards
What does the ON VIOLATION FAIL UPDATE
option do in Delta Live Tables?
What does the ON VIOLATION FAIL UPDATE
option do in Delta Live Tables?
Signup and view all the flashcards
What are SQL UDFs in Databricks?
What are SQL UDFs in Databricks?
Signup and view all the flashcards
What is the PIVOT
transformation in SQL?
What is the PIVOT
transformation in SQL?
Signup and view all the flashcards
What is the count_if
function in SQL?
What is the count_if
function in SQL?
Signup and view all the flashcards
What's the count
function in SQL?
What's the count
function in SQL?
Signup and view all the flashcards
What is Databricks Repos?
What is Databricks Repos?
Signup and view all the flashcards
What is Unity Catalog in Databricks?
What is Unity Catalog in Databricks?
Signup and view all the flashcards
What are roles in Databricks Unity Catalog?
What are roles in Databricks Unity Catalog?
Signup and view all the flashcards
What is the USING DELTA
clause for in Databricks?
What is the USING DELTA
clause for in Databricks?
Signup and view all the flashcards
What is a Databricks Job?
What is a Databricks Job?
Signup and view all the flashcards
How can you schedule a Databricks Job?
How can you schedule a Databricks Job?
Signup and view all the flashcards
What is a task in a Databricks Job?
What is a task in a Databricks Job?
Signup and view all the flashcards
What is a task dependency in a Databricks Job?
What is a task dependency in a Databricks Job?
Signup and view all the flashcards
What is an alert in Databricks?
What is an alert in Databricks?
Signup and view all the flashcards
What languages are typically used for data engineering tasks?
What languages are typically used for data engineering tasks?
Signup and view all the flashcards
What is spark.delta.table
function in PySpark?
What is spark.delta.table
function in PySpark?
Signup and view all the flashcards
What is the spark.table
function in PySpark?
What is the spark.table
function in PySpark?
Signup and view all the flashcards
What is the dbutils.sql
function in Databricks?
What is the dbutils.sql
function in Databricks?
Signup and view all the flashcards
What is a Bronze table in a data pipeline?
What is a Bronze table in a data pipeline?
Signup and view all the flashcards
What is a Silver table in a data pipeline?
What is a Silver table in a data pipeline?
Signup and view all the flashcards
What is a Gold table in a data pipeline?
What is a Gold table in a data pipeline?
Signup and view all the flashcards
What is a dashboard in Databricks?
What is a dashboard in Databricks?
Signup and view all the flashcards
What is a SQL endpoint in Databricks?
What is a SQL endpoint in Databricks?
Signup and view all the flashcards
What are alerts in Databricks?
What are alerts in Databricks?
Signup and view all the flashcards
What does the SELECT * FROM sales
query do in SQL?
What does the SELECT * FROM sales
query do in SQL?
Signup and view all the flashcards
What is a jobs cluster?
What is a jobs cluster?
Signup and view all the flashcards
What is a serverless SQL warehouse?
What is a serverless SQL warehouse?
Signup and view all the flashcards
What is a Databricks SQL dashboard?
What is a Databricks SQL dashboard?
Signup and view all the flashcards
When does Auto Loader come in handy?
When does Auto Loader come in handy?
Signup and view all the flashcards
What are checkpoints in Structured Streaming?
What are checkpoints in Structured Streaming?
Signup and view all the flashcards
What are write-ahead logs in Delta Lake?
What are write-ahead logs in Delta Lake?
Signup and view all the flashcards
What are idempotent sinks?
What are idempotent sinks?
Signup and view all the flashcards
How can you assign full permissions on a table to a specific user?
How can you assign full permissions on a table to a specific user?
Signup and view all the flashcards
How can you give a user permission to view a table?
How can you give a user permission to view a table?
Signup and view all the flashcards
How can you avoid unnecessary compute costs from SQL endpoints?
How can you avoid unnecessary compute costs from SQL endpoints?
Signup and view all the flashcards
How do you improve the startup time of a Spark cluster in Databricks?
How do you improve the startup time of a Spark cluster in Databricks?
Signup and view all the flashcards
Signup and view all the flashcards
Study Notes
Databricks Certified Data Engineer Associate Exam Notes
- Question 163: Migrating a data pipeline to Delta Live Tables requires rewriting the pipeline in Python.
- Question 161: Testing data quality returned by a Delta query needs to be done using Python, rather than SQL.
- Question 160: Using a data lakehouse alleviates siloed data analysis and engineering teams by establishing a single source of truth.
- Question 159: The Structured Streaming query that performs a hop from a Silver table to a Gold table uses the
withColumn
function to calculateavgPrice
andwriteStream
withappend
mode. - Question 153: To conditionally execute code in Python, use
if
statements, checking for equality inday_of_week
and boolean values inreview_period
. - Question 150: A custom SQL User Defined Function (UDF) is best used to apply complex logic to string columns.
- Question 146: To access a view in a shared cluster, data engineers need SELECT permissions on the view and the underlying table.
- Question 145: Databricks Unity Catalog governance requires the catalog and schema to have a managed location.
- Question 144: Constraints in Delta Live Tables can prevent the ingestion of data rows that do not comply with a set criteria.
- Question 142: Migrating a data pipeline to Delta LiveTables might need different notebook sources and a batch source rather than streaming.
- Question 141: Structured Streaming writes to a new table.
- Question 140: Delta Lake improves data architecture by unifying siloed data architectures via a standardized format.
- Question 139: The best approach to query a specific prior version of a Delta table in healthcare is to reference the specific version from the Delta transaction log.
- Question 138: ON VIOLATION DROP ROW and ON VIOLATION FAIL UPDATE statements in constraints can be used to implement data validation.
- Question 137: The
count_if
function can return the number of rows where a given condition holds true. Thecount where x is null
can return the number of a given column’s null values. - Question 136: A Job cluster is better suited for scheduled Python notebooks.
- Question 135: Full privileges in Databricks involve granting
ALL PRIVILEGES
on the target table. - Question 133: To prevent unnecessary compute costs on a Databricks SQL query, set a limit on DBUs consumed by the SQL Endpoint or specify a termination date.
- Question 129: Use the Databricks Jobs UI to find the reason a notebook in a Data Job is running slowly. Use ‘Runs’ and ‘Task’ tabs.
- Question 128: To prepare a new task to run in a job before the original task, add the new task as a dependency to the original task to allow the tasks to run concurrently.
- Question 127: To improve the startup time of Databricks clusters, use Databricks SQL endpoints or jobs clusters using a cluster pool.
- Question 126: The expected behavior in Delta Live Tables when a batch of data violates timestamp constraints is to drop those rows from the main table and log them as invalid entries.
- Question 125: A streaming hop involves reading from a raw source and writing to a Bronze table.
- Question 124: Silver tables typically contain less data than Bronze tables.
- Question 123:
- Question 122: Auto Loader is usable for streaming workloads.
- Question 121: For a continuous pipeline in development mode with valid datasets, all datasets will be updated at intervals until the pipeline is shut down, and the compute resources will persist to allow for additional testing.
- Question 120: Use
CREATE STREAMING LIVE TABLE
when you need to process data incrementally in a continuous pipeline, rather thanCREATE LIVE TABLE
which is for one-time processing. - Question 119: Gold tables contain aggregations.
- Question 118: Use DLT’s checkpointing and write-ahead logs to track the offset range of data processing.
- Question 117: Review the DLT pipeline page and error button on individual tables to see where the data is being dropped.
- Question 116: The most applicable tool for monitoring data quality is Delta Live Tables.
- Question 115:
- Question 114: Use
trigger(processingTime="5 seconds")
for frequent micro-batch processing in a streaming pipeline. - Question 113: The
FILTER
higher-order function, combined with SQL, allows the desired manipulation at scale. - Question 112: Use
LEFT JOIN
to combine data from two tables in a query. - Question 110:
PIVOT
is used for converting data from a long format to a wide format in SQL. - Question 109: Parquet files enable partitioning, which is beneficial for performance and data management.
- Question 108: Use the
USING CSV
clause to create external tables from CSV files in Databricks. - Question 107: When
DROP TABLE IF EXISTS
is executed, the files are deleted when the table is a managed table. - Question 106: Databricks databases are stored in the
dbfs:/user/hive/warehouse
location. - Question 105: Use
spark.table("sales")
in PySpark to access a Delta table. - Question 104: Use
MERGE
orUPSERT
to avoid writing duplicate records to a Delta table. - Question 103: Using
FORMAT_OPTIONS
along with theCOPY INTO
command is required. - Question 102: A
Table
is used to present data from multiple tables in the form of a single, relational object in Databricks. - Question 100: To insert new rows into a table, use
INSERT INTO ... VALUES
command. - Question 99: Use
DESCRIBE LOCATION
to determine the location of a database in Databricks. - Question 98: Use
CREATE OR REPLACE TABLE
. - Question 97: To sync changes to a Databricks Repo from a Git repository, perform a
Pull
. - Question 96: Use Databricks Catalog Explorer to access and view table permissions.
- Question 95: Open source technologies are beneficial in cloud-based platforms, particularly regarding vendor lock-in.
- Question 94: Databricks Repos are advantageous because they include versioning for notebooks.
- Question 93: The Databricks web application is located in the control plane.
- Question 92: Use cluster pools when you need to use multiple, smaller clusters but want to group them for a specific function.
- Question 91: Data lakehouses frequently use ACID transactions to improve data quality consistency.
- Question 90: Grant the
ALL PRIVILEGES
access on the table. - Question 89: Use the Data Explorer's “Permissions” tab to examine table ownership.
- Question 88:
- Question 87: Use the Auto Stop feature to reduce the total running time of the SQL endpoint.
- Question 86: Trigger another task to complete before it begins by adding the dependency in the Databricks Jobs UI.
- Question 85:
- Question 84:
- Question 83: Utilize the
datetime
module in Python to dynamically define scheduling for Databricks jobs. - Question 82: Setting the SQL endpoint to serverless reduces query duration.
- Question 81: Use Databricks Notebooks versioning when an update is needed in a new task.
- Question 80: To limit compute costs on a Databricks SQL query, use the query's refresh schedule to end after a certain number of refreshes, or set a limit on DBUs.
- Question 79: A query that reads from a raw data source, manipulates data in the trigger, and appends to a Bronze table.
- Question 78: Use Databricks Alerts with a new webhook to notify the team.
- Question 77: If migrating to Delta Live Tables from a JSON streaming input, avoid using a batch source and use Python.
- Question 76: Streaming workloads are compatible with Auto Loader.
- Question 75: Use
STREAM
to process a live table in real time. - Question 74: At least one notebook library must be specified during Delta Live Tables pipeline creation.
- Question 73: Use a Gold table to collect summary statistics.
- Question 72: Use
CREATE STREAMING LIVE TABLE
when there's a need for batch and incremental updates in the pipeline. - Question 71: In development mode, datasets will be repeatedly updated without immediate termination of the pipeline or compute resources.
- Question 70: JSON files are text-based, hence it's typical for Auto Loader to interpret JSON columns as strings.
- Question 69: Use
trigger(availableNow=True)
in the micro-batch trigger line to ensure a query processes all available data in micro batches. - Question 68: Use
LEFT JOIN
to combine the data from two tables in a query that joins on customer_id. - Question 67:
- Question 66: Use
MERGE
to update or insert data into a table. - Question 65: Managed tables delete data files.
- Question 64: A Databricks database is located within the
dbfs:/user/hive/warehouse
directory. - Question 63:
spark.sql
is the appropriate call to execute the SQL query using the variable,table_name
. - Question 62: Use a
FILTER
higher-order function in SQL to filter for employees who have worked more than 5 years. - Question 61: Find the null member_ids is accomplished by using
count_if
in a SQL query with a conditionIS NULL
. - Question 60: Use
spark.sql
to run the query in PySpark. - Question 59: Create a
View
object to represent the joined data from the two tables, without storing a copy, - Question 58: Parquet files (rather than CSV) result in optimized data structures for the
CREATE TABLE
operations. - Question 57:
PIVOT
in SQL is the keyword used to reshape a table from a long format to a wide format. - Question 56: To include SQL in a Python notebook, add
%sql
before any SQL statement. - Question 55: Use
CREATE TABLE IF NOT EXISTS
to generate an empty Delta table. - Question 54: Use
Data Lakehouse
to unify siloed data. - Question 53:
- Question 52:
Parquet
is the principal file format for Delta Table data. - Question 51:
- Question 50: Use
INSERT INTO my_table VALUES
to append new data to the table. - Question 49: Single-node clusters are ideal when working with small datasets.
- Question 48: Access table permissions in the Databricks Data Explorer.
- Question 47: Leveraging open-source technology in a Databricks Lakehouse can help prevent vendor lock-in.
- Question 46: Use Databricks Repos, rather than Notebooks versioning, for version control.
- Question 45: Grant usage access on the database to the team using
GRANT USAGE ON DATABASE customers TO team
. - Question 44: Use
GRANT ALL PRIVILEGES ON DATABASE customers TO team
to grant full permissions on the database to a new team in Databricks. - Question 43: Using cluster pools improves cluster startup time for multiple tasks that run nightly.
- Question 42: Review the Databricks Jobs’ “Runs” tab to identify slow running notebook in a job.
- Question 41: The
Alert
, and specificallyWebhook
, configuration is used for notifications in Databricks. - Question 40: Leverage Databricks' Auto Stop feature to minimize the total running time of the SQL endpoint when it is no longer needed for immediate use.,
- Question 39: To improve the latency of Databricks SQL queries, decrease the size of the SQL endpoint.
- Question 38: Limit the consumption of DBUs to control compute costs when an SQL query is part of a job with a dynamically changing schedule.
- Question 37: Create a new task, with the original task as a dependency, within the same Databricks Job to accomplish a sequential task.
- Question 36: View Databricks data quality statistics to pinpoint data issues within the DLT pipeline.
- Question 35: The
groupBy()
,agg()
,writeStream
, andoutputMode('complete')
methods in spark.sql are used to perform the hop from Silver to Gold table manipulation. - Question 34: Use Auto Loader to identify and ingest only new files in a shared location.
- Question 33: Use
CREATE STREAMING
when working with incremental data. - Question 32: When data violates constraints, it's dropped from the main table and logged as invalid.
- Question 31: Use
trigger(processingTime="5 seconds")
in Structured Streaming. - Question 30: Spark Structured Streaming is how Auto Loader processes data incrementally.
- Question 29: Raw data is frequently less refined than data in Bronze tables.
- Question 28: Gold tables generally hold aggregates, while Silver tables typically do not.
- Question 27: Use Checkpointing and Write-Ahead logs.
- Question 26: Ensure that the pipeline's resources continue to be available for testing in production mode for continuous processing..
- Question 25: Use Delta Live Tables to automate data quality monitoring.
- Question 24: A Table is the most appropriate data entity, as it contains all data.
- Question 23: Managed tables delete data and metadata if the table is not needed.
- Question 22: Use
if day_of_week == 1 and review_period == "True":
to create the code block's conditional run block. - Question 21: Use the
UNION
command to prevent duplicate data, when combining data from two Delta tables. - Question 20: The
url
option for theCREATE TABLE
statement in Databricks should use a JDBC connection string (e.g.,jdbc:sqlite:/path/to/file.db
). - Question 19: When a
COPY INTO
statement does not ingest new rows, it could be due to missing files, the file not being in the right format, a table refresh being needed or the file already having been copied. - Question 18: Ensure necessary Python control flow exists in the script in place of manual access to the associated SQL endpoint.
- Question 17: A data engineer needs to use a SQL UDF to make a function call in Databricks.
- Question 16: A data engineer would use the MERGE command to incrementally update a Delta Table while avoiding duplicates.
- Question 15: Array functions provide the ability to work with different data types in a single operation.
- Question 14: Use the
COMMENT "Contains PII"
statement as part of theCREATE TABLE
statement to denote the inclusion of Personally Identifiable information (PII) in the new table. - Question 13: Access the location of a database in Databricks by issuing a
DESCRIBE DATABASE ...
command, not aDROP DATABASE...
command.
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.
Related Documents
Description
Prepare for the Databricks Certified Data Engineer Associate Exam with these concise notes. This quiz covers essential concepts such as Delta Live Tables, data lakehouses, and Python functionalities in data engineering. Test your knowledge and ensure you're ready for the certification challenges ahead.