Databricks Data Engineer Exam Notes
8 Questions
0 Views

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

A data engineer and data analyst are working together on a data pipeline. The data engineer is working on the raw, bronze, and silver layers of the pipeline using Python, and the data analyst is working on the gold layer of the pipeline using SQL. The raw source of the pipeline is a streaming input. They now want to migrate their pipeline to use Delta Live Tables. Which of the following changes will need to be made to the pipeline when migrating to Delta Live Tables?

  • The pipeline will need to be written entirely in Python (correct)
  • The pipeline will need to use a batch source in place of a streaming source
  • The pipeline will need to be written entirely in SQL
  • The pipeline will need to stop using the medallion-based multi-hop architecture
  • A data analyst has developed a query that runs against a Delta table. They want help from the data engineering team to implement a series of tests to ensure the data returned by the query is clean. However, the data engineering team uses Python for its tests rather than SQL. Which of the following operations could the data engineering team use to run the query and operate with the results in PySpark?

  • SELECT * FROM sales
  • spark.delta.table
  • spark.sql (correct)
  • spark.table
  • A data organization leader is upset about the data analysis team’s reports being different from the data engineering team’s reports. The leader believes the siloed nature of their organization’s data engineering and data analysis architectures is to blame. Which of the following describes how a data lakehouse could alleviate this issue?

  • Both teams would reorganize to report to the same department
  • Both teams would be able to collaborate on projects in real-time
  • Both teams would respond more quickly to ad-hoc requests
  • Both teams would use the same source of truth for their work (correct)
  • Which Structured Streaming query is performing a hop from a Silver table to a Gold table?

    <p>(spark.table(&quot;sales&quot;) .withColumn(&quot;avgPrice&quot;, col(&quot;sales&quot;) / col(&quot;units&quot;)) .writeStream .option(&quot;checkpointLocation&quot;, checkpointPath) .outputMode(&quot;append&quot;) .table(&quot;newSales&quot;)) (D)</p> Signup and view all the answers

    A data engineer only wants to execute the final block of a Python program if the Python variable day_of_week is equal to 1 and the Python variable review_period is True. Which of the following control flow statements should the data engineer use to begin this conditionally executed code block?

    <p>if <code>day_of_week</code> == 1 and <code>review_period</code>: (B)</p> Signup and view all the answers

    A data engineer needs to apply custom logic to string column city in table stores for a specific use case. In order to apply this custom logic at scale, the data engineer wants to create a SQL user-defined function (UDF). Which of the following code blocks creates this SQL UDF?

    <p>CREATE FUNCTION combine_nyc(city STRING) RETURNS STRING BEGIN CASE WHEN city = &quot;brooklyn&quot; THEN &quot;new york&quot; ELSE city END; END; (A), CREATE FUNCTION combine_nyc(city STRING) RETURNS STRING BEGIN CASE WHEN city = &quot;brooklyn&quot; THEN &quot;new york&quot; ELSE city END; END; (C), CREATE FUNCTION combine_nyc(city STRING) RETURNS STRING BEGIN CASE WHEN city = &quot;brooklyn&quot; THEN &quot;new york&quot; ELSE city END; END; (E)</p> Signup and view all the answers

    A data engineer needs to access the view created by the sales team, using a shared cluster. The data engineer has been provided usage permissions on the catalog and schema. In order to access the view created by the sales team, what are the minimum permissions the data engineer would require in addition?

    <p>Needs SELECT permission only on the VIEW (D)</p> Signup and view all the answers

    Which two conditions are applicable for governance in Databricks Unity Catalog? (Choose two.)

    <p>You can have more than 1 metastore within a databricks account console but only 1 per region. (D), If metastore is not associated with location, it’s mandatory to associate catalog with managed locations (E)</p> Signup and view all the answers

    Flashcards

    What is a data lakehouse?

    A structured data lakehouse uses an open data format, such as Parquet, and combines transactional ACID properties with data warehousing capabilities. This allows both real-time and batch analytics on the same data, eliminating the need for separate data lakes and data warehouses.

    What are the layers of a data pipeline?

    A data pipeline in Databricks typically follows a layered approach, starting with raw data and progressively refining it through various stages.

    • Raw: Untouched data, as it arrives from its source.
    • Bronze: Basic data processing and cleaning, ensuring consistency and format.
    • Silver: Further data enrichment and transformations, bringing the data closer to its final analytical form.
    • Gold: Final, analytical data, ready for reporting and analysis.

    What is Delta Live Tables?

    Delta Live Tables (DLT) in Databricks enables a more efficient and automated way to build data pipelines by providing a declarative approach using SQL. With DLT, data engineers can define the desired data transformations and constraints, and DLT manages the execution and monitoring of the pipeline. This simplifies the process of building data pipelines while ensuring data quality and consistency.

    When should you use a CREATE STREAMING LIVE TABLE?

    When you use a CREATE STREAMING LIVE TABLE instead of a CREATE LIVE TABLE, it indicates that you want the data to be processed incrementally as it arrives. This is useful for streaming sources, where new data is constantly being added.

    Signup and view all the flashcards

    What's the difference between a managed table and an external table in Databricks?

    Managed tables in Databricks are automatically managed by Databricks, meaning that they do not have a specific location in the filesystem. This means you can use a simple DROP TABLE command to delete a managed table, and the associated data files and metadata will be automatically deleted. External tables, on the other hand, require you to manage the data files and metadata yourself.

    Signup and view all the flashcards

    What is a cluster pool in Databricks?

    A cluster pool in Databricks allows you to pre-configure a set of clusters with specific configurations, such as worker types, instance sizes, and libraries, that can be reused by multiple jobs. This enables faster startup times for jobs, as the cluster is already ready to go when the job starts.

    Signup and view all the flashcards

    What is spark.sql in PySpark?

    The spark.sql function in PySpark provides a unified way to interact with data in Databricks, regardless of whether it is in a Delta table, a DataFrame, or a plain SQL query.

    Signup and view all the flashcards

    What is Auto Loader in Databricks?

    Auto Loader is a feature in Databricks that acts as a smart data ingest tool for streaming data sources. It automatically detects new files in cloud storage and ingests them into a Delta table. This allows for near real-time data processing and eliminates the need for manual file monitoring and ingestion.

    Signup and view all the flashcards

    How do you configure a Structured Streaming job to run every 5 seconds?

    The trigger(processingTime="5 seconds") option in Structured Streaming lets you configure the query to process data in micro-batches, with a predetermined interval of 5 seconds. This means the query will run every 5 seconds to process the data that has arrived since the last run.

    Signup and view all the flashcards

    What is the COPY INTO command in Databricks?

    The COPY INTO command in Databricks is a convenient way to copy data from a file or directory in cloud storage into a Delta table. It supports various file formats, including Parquet, CSV, JSON, and Avro, and offers various options to control the data copying process, such as filtering, transformations, and error handling.

    Signup and view all the flashcards

    How does the MERGE command handle duplicate records in Databricks?

    When using the MERGE command in Databricks to insert data into a Delta table, it will intelligently handle cases where the records already exist. If a record is already present in the table, it will update it with the new values. This ensures that duplicate records are not inserted, and the data in the table remains accurate and consistent.

    Signup and view all the flashcards

    What are expectations in Delta Live Tables?

    An expectation in Delta Live Tables is like a rule that helps ensure data quality and integrity throughout your data pipeline. When you define an expectation, you are essentially saying that the data should meet certain criteria. For example, a timestamp in your data should be within a specific range. If the data violates an expectation, Delta Live Tables can handle the violation in different ways, such as dropping the record or failing the pipeline, depending on your configuration.

    Signup and view all the flashcards

    What are constraints in Delta Live Tables?

    The CONSTRAINT clause in Delta Live Tables allows you to define and enforce specific rules on your data, ensuring data quality and consistency. It works in conjunction with expectations to define the criteria for valid data. You can define a constraint to drop rows that violate certain criteria.

    Signup and view all the flashcards

    What does the ON VIOLATION DROP ROW option do in Delta Live Tables?

    The ON VIOLATION DROP ROW option in Delta Live Tables specifies how to handle data that violates constraints. If a constraint violation occurs, the entire row of data will be discarded and dropped from the table. This strict approach helps guarantee data quality by ensuring that invalid data is not included in the table.

    Signup and view all the flashcards

    What does the ON VIOLATION FAIL UPDATE option do in Delta Live Tables?

    The ON VIOLATION FAIL UPDATE option in Delta Live Tables dictates how a job should handle data that violates constraints. When this option is set, the entire job will immediately fail, preventing the entry of invalid or erroneous data into the tables. This helps ensure data accuracy and consistency by terminating the processing when constraints are not met.

    Signup and view all the flashcards

    What are SQL UDFs in Databricks?

    SQL UDFs (User Defined Functions) in Databricks provide a way to extend the SQL language by allowing you to customize data transformations. These functions can be written in SQL and called within SQL queries, simplifying the implementation of complex logic.

    Signup and view all the flashcards

    What is the PIVOT transformation in SQL?

    A PIVOT transformation in SQL is used to reshape data by moving one or more columns from rows into columns. This is useful for converting a table from a long format where data is stacked vertically in rows to a wide format, where data is spread horizontally across columns.

    Signup and view all the flashcards

    What is the count_if function in SQL?

    The count_if function in SQL enables you to count the rows in a table based on a specific filtering condition. You can provide a boolean expression as an argument, which is evaluated for each row. If the condition is true, the counter increments. Otherwise, the counter remains unchanged.

    Signup and view all the flashcards

    What's the count function in SQL?

    The count function in SQL allows you to determine the total number of rows in a table. This command essentially counts the rows in a SQL table and returns the total number of rows as an output.

    Signup and view all the flashcards

    What is Databricks Repos?

    Databricks Repos is a feature that allows you to version control your entire data engineering project in Databricks through Git integration. This enables a collaborative development environment, allowing multiple engineers to work on a project together and track changes, reducing the chance of conflicts.

    Signup and view all the flashcards

    What is Unity Catalog in Databricks?

    Unity Catalog in Databricks provides a central repository for managing data assets, enabling secure and governed data sharing across different teams and departments. It enables fine-grained access controls and data governance policies, ensuring data security and compliance.

    Signup and view all the flashcards

    What are roles in Databricks Unity Catalog?

    Roles in Databricks Unity Catalog determine the level of access a user has to a specific resource, such as a database, schema, or table. For example, you could assign a role that grants read-only access to specific tables within a database.

    Signup and view all the flashcards

    What is the USING DELTA clause for in Databricks?

    When a Delta table is being created in Databricks, the USING DELTA clause must be included in the CREATE TABLE statement. This clause specifies that the table should be created as a Delta table, guaranteeing transactional ACID properties and other features provided by Delta Lake.

    Signup and view all the flashcards

    What is a Databricks Job?

    A Databricks Job is a way to automate your data engineering tasks. You can create jobs in Databricks to run regularly, such as processing data or building models. Jobs can consist of multiple tasks.

    Signup and view all the flashcards

    How can you schedule a Databricks Job?

    Databricks Jobs can be set to run at specific intervals, such as daily, hourly, or even more frequently. This allows for consistent and automated data processing. You can determine the trigger schedule, allowing for continuous data processing or periodic data ingestion based on your need.

    Signup and view all the flashcards

    What is a task in a Databricks Job?

    A task in a Databricks Job represents a individual step in your automated workflow. You can add multiple tasks to a Job, making it possible to run a series of steps seamlessly. Tasks can be notebooks, SQL queries, or even shell commands.

    Signup and view all the flashcards

    What is a task dependency in a Databricks Job?

    When you set up a task dependency in a Databricks Job, it creates a sequence of tasks that need to be run in a specific order. This ensures that the tasks are executed in the correct order based on their dependencies, thereby maintaining the flow of your automated data processes.

    Signup and view all the flashcards

    What is an alert in Databricks?

    An alert in Databricks provides a means of monitoring your jobs and data pipelines. When an alert is triggered, it can notify you through various channels, such as email or webhooks, so you can take appropriate action quickly.

    Signup and view all the flashcards

    What languages are typically used for data engineering tasks?

    When developing data pipelines, Data Engineers typically use Python or SQL. Databricks allows the use of both languages within data pipelines, providing flexibility in implementing different parts of the pipeline based on each language's strengths.

    Signup and view all the flashcards

    What is spark.delta.table function in PySpark?

    The spark.delta.table function in PySpark is a convenient way to access and interact with Delta tables specifically. It provides a PySpark object representing the Delta table, allowing you to perform various operations, such as reading, writing, and performing schema updates.

    Signup and view all the flashcards

    What is the spark.table function in PySpark?

    The spark.table function in PySpark is a general purpose function that allows you to access any table in Databricks, including both managed and external tables. This versatility simplifies table access and reduces the need for specific functions for different table types.

    Signup and view all the flashcards

    What is the dbutils.sql function in Databricks?

    The dbutils.sql function in Databricks provides a way to execute SQL commands programmatically. You can pass a SQL string as an argument to this function, enabling you to run SQL queries directly from your PySpark code, facilitating integration between Python and SQL within your data engineering tasks.

    Signup and view all the flashcards

    What is a Bronze table in a data pipeline?

    A Bronze table in a data pipeline typically represents the initial landing stage for raw data. The data in a Bronze table is usually unstructured or semi-structured, and its primary focus is to ingest the data as it arrives from the source.

    Signup and view all the flashcards

    What is a Silver table in a data pipeline?

    A Silver table in a data pipeline often serves as an intermediary stage for data transformation and enrichment. It contains structured data and incorporates basic data quality checks, preparing the data for further analysis.

    Signup and view all the flashcards

    What is a Gold table in a data pipeline?

    A Gold table in a data pipeline is the end destination for analytical data. It contains highly refined, consistent, and well-structured data, ready for analysis, reporting, and business intelligence tasks.

    Signup and view all the flashcards

    What is a dashboard in Databricks?

    A dashboard in Databricks allows you to visualize and monitor your data in a dynamic, interactive way. You can create dashboards to track your data pipelines, performance metrics, and business insights.

    Signup and view all the flashcards

    What is a SQL endpoint in Databricks?

    A SQL endpoint in Databricks creates a secure and efficient way for your data team to access and query data from the SQL warehouse, through an interface similar to a standard SQL database.

    Signup and view all the flashcards

    What are alerts in Databricks?

    Databricks allows you to implement various types of alerts to monitor data quality and pipeline health. These alerts can be triggered when data fails to meet expectations or when a job or query experiences errors.

    Signup and view all the flashcards

    What does the SELECT * FROM sales query do in SQL?

    The SELECT * FROM sales query is a simple way to select all the rows and columns from a table called 'sales' in SQL. It can be used for extracting data from a table or as a base for further operations like sorting, filtering, or aggregation.

    Signup and view all the flashcards

    What is a jobs cluster?

    A data engineer might use an all-purpose cluster for a variety of tasks, even during development. But for high throughput jobs, a jobs cluster focused on efficiency is preferred.

    Signup and view all the flashcards

    What is a serverless SQL warehouse?

    Serverless SQL warehouse provides compute power on-demand, so you only pay for the resources used, eliminating the need for idle resources.

    Signup and view all the flashcards

    What is a Databricks SQL dashboard?

    When a Databricks SQL dashboard needs to be refreshed, the associated SQL endpoint needs to be running to process the queries.

    Signup and view all the flashcards

    When does Auto Loader come in handy?

    When using Auto Loader, Databricks handles the automatic detection and ingestion of new files from cloud storage locations.

    Signup and view all the flashcards

    What are checkpoints in Structured Streaming?

    Checkpoints are used to record the progress of data processing in Structured Streaming jobs. They track the offset range of data processed during each trigger, enabling the job to recover gracefully from failures or restarts.

    Signup and view all the flashcards

    What are write-ahead logs in Delta Lake?

    Write-ahead logs, also known as WAL, maintain a record of all changes made to a Delta Table. They provide a mechanism to track the history of data modifications, enabling reliable recovery and reverting to past states if necessary.

    Signup and view all the flashcards

    What are idempotent sinks?

    Idempotent sinks ensure that repeated data writes to the same sink (the final output) do not affect the data.

    Signup and view all the flashcards

    How can you assign full permissions on a table to a specific user?

    The GRANT ALL PRIVILEGES command in Databricks allows you to grant all the necessary permissions to a specific user or group on a table or database.

    Signup and view all the flashcards

    How can you give a user permission to view a table?

    The GRANT USAGE command in Databricks is used to grant basic permission, allowing the user to view and reference the table, without specific access to its content.

    Signup and view all the flashcards

    How can you avoid unnecessary compute costs from SQL endpoints?

    Auto Stop feature allows you to automatically stop the SQL endpoint when it remains idle for a configured duration, saving compute costs.

    Signup and view all the flashcards

    How do you improve the startup time of a Spark cluster in Databricks?

    With a Databricks cluster pool, you can configure multiple clusters with specific configurations, which can be used by different jobs, enabling faster startup times.

    Signup and view all the flashcards

    Signup and view all the flashcards

    Study Notes

    Databricks Certified Data Engineer Associate Exam Notes

    • Question 163: Migrating a data pipeline to Delta Live Tables requires rewriting the pipeline in Python.
    • Question 161: Testing data quality returned by a Delta query needs to be done using Python, rather than SQL.
    • Question 160: Using a data lakehouse alleviates siloed data analysis and engineering teams by establishing a single source of truth.
    • Question 159: The Structured Streaming query that performs a hop from a Silver table to a Gold table uses the withColumn function to calculate avgPrice and writeStream with append mode.
    • Question 153: To conditionally execute code in Python, use if statements, checking for equality in day_of_week and boolean values in review_period.
    • Question 150: A custom SQL User Defined Function (UDF) is best used to apply complex logic to string columns.
    • Question 146: To access a view in a shared cluster, data engineers need SELECT permissions on the view and the underlying table.
    • Question 145: Databricks Unity Catalog governance requires the catalog and schema to have a managed location.
    • Question 144: Constraints in Delta Live Tables can prevent the ingestion of data rows that do not comply with a set criteria.
    • Question 142: Migrating a data pipeline to Delta LiveTables might need different notebook sources and a batch source rather than streaming.
    • Question 141: Structured Streaming writes to a new table.
    • Question 140: Delta Lake improves data architecture by unifying siloed data architectures via a standardized format.
    • Question 139: The best approach to query a specific prior version of a Delta table in healthcare is to reference the specific version from the Delta transaction log.
    • Question 138: ON VIOLATION DROP ROW and ON VIOLATION FAIL UPDATE statements in constraints can be used to implement data validation.
    • Question 137: The count_if function can return the number of rows where a given condition holds true. The count where x is null can return the number of a given column’s null values.
    • Question 136: A Job cluster is better suited for scheduled Python notebooks.
    • Question 135: Full privileges in Databricks involve granting ALL PRIVILEGES on the target table.
    • Question 133: To prevent unnecessary compute costs on a Databricks SQL query, set a limit on DBUs consumed by the SQL Endpoint or specify a termination date.
    • Question 129: Use the Databricks Jobs UI to find the reason a notebook in a Data Job is running slowly. Use ‘Runs’ and ‘Task’ tabs.
    • Question 128: To prepare a new task to run in a job before the original task, add the new task as a dependency to the original task to allow the tasks to run concurrently.
    • Question 127: To improve the startup time of Databricks clusters, use Databricks SQL endpoints or jobs clusters using a cluster pool.
    • Question 126: The expected behavior in Delta Live Tables when a batch of data violates timestamp constraints is to drop those rows from the main table and log them as invalid entries.
    • Question 125: A streaming hop involves reading from a raw source and writing to a Bronze table.
    • Question 124: Silver tables typically contain less data than Bronze tables.
    • Question 123: 
    • Question 122: Auto Loader is usable for streaming workloads.
    • Question 121: For a continuous pipeline in development mode with valid datasets, all datasets will be updated at intervals until the pipeline is shut down, and the compute resources will persist to allow for additional testing.
    • Question 120: Use CREATE STREAMING LIVE TABLE when you need to process data incrementally in a continuous pipeline, rather than CREATE LIVE TABLE which is for one-time processing.
    • Question 119: Gold tables contain aggregations.
    • Question 118: Use DLT’s checkpointing and write-ahead logs to track the offset range of data processing.
    • Question 117: Review the DLT pipeline page and error button on individual tables to see where the data is being dropped.
    • Question 116: The most applicable tool for monitoring data quality is Delta Live Tables.
    • Question 115: 
    • Question 114: Use trigger(processingTime="5 seconds") for frequent micro-batch processing in a streaming pipeline.
    • Question 113: The FILTER higher-order function, combined with SQL, allows the desired manipulation at scale.
    • Question 112: Use LEFT JOIN to combine data from two tables in a query.
    • Question 110: PIVOT is used for converting data from a long format to a wide format in SQL.
    • Question 109: Parquet files enable partitioning, which is beneficial for performance and data management.
    • Question 108: Use the USING CSV clause to create external tables from CSV files in Databricks.
    • Question 107: When DROP TABLE IF EXISTS is executed, the files are deleted when the table is a managed table.
    • Question 106: Databricks databases are stored in the dbfs:/user/hive/warehouse location.
    • Question 105: Use spark.table("sales") in PySpark to access a Delta table.
    • Question 104: Use MERGE or UPSERT to avoid writing duplicate records to a Delta table.
    • Question 103: Using FORMAT_OPTIONS along with the COPY INTO command is required.
    • Question 102:Table is used to present data from multiple tables in the form of a single, relational object in Databricks.
    • Question 100: To insert new rows into a table, use INSERT INTO ... VALUES command.
    • Question 99: Use DESCRIBE LOCATION  to determine the location of a database in Databricks.
    • Question 98:  Use CREATE OR REPLACE TABLE.
    • Question 97: To sync changes to a Databricks Repo from a Git repository, perform a Pull.
    • Question 96: Use Databricks Catalog Explorer to access and view table permissions.
    • Question 95: Open source technologies are beneficial in cloud-based platforms, particularly regarding vendor lock-in.
    • Question 94: Databricks Repos are advantageous because they include versioning for notebooks.
    • Question 93:  The Databricks web application is located in the control plane.
    • Question 92: Use cluster pools when you need to use multiple, smaller clusters but want to group them for a specific function.
    • Question 91: Data lakehouses frequently use ACID transactions to improve data quality consistency.
    • Question 90: Grant the ALL PRIVILEGES access on the table.
    • Question 89: Use the Data Explorer's “Permissions” tab to examine table ownership.
    • Question 88:  
    • Question 87: Use the Auto Stop feature to reduce the total running time of the SQL endpoint.
    • Question 86: Trigger another task to complete before it begins by adding the dependency in the Databricks Jobs UI.
    • Question 85:  
    • Question 84:  
    • Question 83: Utilize the datetime module in Python to dynamically define scheduling for Databricks jobs.
    • Question 82: Setting the SQL endpoint to serverless reduces query duration.
    • Question 81: Use Databricks Notebooks versioning when an update is needed in a new task.
    • Question 80: To limit compute costs on a Databricks SQL query, use the query's refresh schedule to end after a certain number of refreshes, or set a limit on DBUs.
    • Question 79: A query that reads from a raw data source, manipulates data in the trigger, and appends to a Bronze table.
    • Question 78: Use Databricks Alerts with a new webhook to notify the team.
    • Question 77: If migrating to Delta Live Tables from a JSON streaming input, avoid using a batch source and use Python.
    • Question 76: Streaming workloads are compatible with Auto Loader.
    • Question 75: Use STREAM to process a live table in real time.
    • Question 74: At least one notebook library must be specified during Delta Live Tables pipeline creation.
    • Question 73: Use a Gold table to collect summary statistics. 
    • Question 72: Use CREATE STREAMING LIVE TABLE when there's a need for batch and incremental updates in the pipeline.
    • Question 71: In development mode, datasets will be repeatedly updated without immediate termination of the pipeline or compute resources.
    • Question 70: JSON files are text-based, hence it's typical for Auto Loader to interpret JSON columns as strings.
    • Question 69: Use trigger(availableNow=True) in the micro-batch trigger line to ensure a query processes all available data in micro batches. 
    • Question 68: Use LEFT JOIN to combine the data from two tables in a query that joins on customer_id.
    • Question 67:  
    • Question 66: Use MERGE to update or insert data into a table.
    • Question 65: Managed tables delete data files.
    • Question 64: A Databricks database is located within the dbfs:/user/hive/warehouse directory.
    • Question 63: spark.sql is the appropriate call to execute the SQL query using the variable, table_name
    • Question 62: Use a FILTER higher-order function in SQL to filter for employees who have worked more than 5 years.
    • Question 61: Find the null member_ids is accomplished by using count_if in a SQL query with a condition IS NULL.
    • Question 60: Use spark.sql to run the query in PySpark.
    • Question 59: Create a View object to represent the joined data from the two tables, without storing a copy,
    • Question 58: Parquet files (rather than CSV) result in optimized data structures for the CREATE TABLE operations.
    • Question 57: PIVOT in SQL is the keyword used to reshape a table from a long format to a wide format.
    • Question 56: To include SQL in a Python notebook, add %sql before any SQL statement.
    • Question 55: Use CREATE TABLE IF NOT EXISTS to generate an empty Delta table.
    • Question 54: Use Data Lakehouse to unify siloed data.
    • Question 53:  
    • Question 52: Parquet is the principal file format for Delta Table data.
    • Question 51: 
    • Question 50: Use INSERT INTO my_table VALUES to append new data to the table.
    • Question 49: Single-node clusters are ideal when working with small datasets.
    • Question 48: Access table permissions in the Databricks Data Explorer.
    • Question 47:  Leveraging open-source technology in a Databricks Lakehouse can help prevent vendor lock-in.
    • Question 46: Use Databricks Repos, rather than Notebooks versioning, for version control.
    • Question 45: Grant usage access on the database to the team using GRANT USAGE ON DATABASE customers TO team.
    • Question 44: Use GRANT ALL PRIVILEGES ON DATABASE customers TO team to grant full permissions on the database to a new team in Databricks.
    • Question 43: Using cluster pools improves cluster startup time for multiple tasks that run nightly.
    • Question 42: Review the Databricks Jobs’ “Runs” tab to identify slow running notebook in a job. 
    • Question 41: The Alert, and specifically Webhook, configuration is used for notifications in Databricks.
    • Question 40: Leverage Databricks' Auto Stop feature to minimize the total running time of the SQL endpoint when it is no longer needed for immediate use.,
    • Question 39: To improve the latency of Databricks SQL queries, decrease the size of the SQL endpoint.
    • Question 38: Limit the consumption of DBUs to control compute costs when an SQL query is part of a job with a dynamically changing schedule.
    • Question 37: Create a new task, with the original task as a dependency, within the same Databricks Job to accomplish a sequential task.
    • Question 36:  View Databricks data quality statistics to pinpoint data issues within the DLT pipeline.
    • Question 35: The groupBy(), agg(), writeStream, and outputMode('complete') methods in spark.sql are used to perform the hop from Silver to Gold table manipulation.
    • Question 34: Use Auto Loader to identify and ingest only new files in a shared location.
    • Question 33: Use CREATE STREAMING when working with incremental data.
    • Question 32: When data violates constraints, it's dropped from the main table and logged as invalid. 
    • Question 31: Use trigger(processingTime="5 seconds") in Structured Streaming.
    • Question 30:  Spark Structured Streaming is how Auto Loader processes data incrementally. 
    • Question 29:  Raw data is frequently less refined than data in Bronze tables.
    • Question 28: Gold tables generally hold aggregates, while Silver tables typically do not.
    • Question 27: Use Checkpointing and Write-Ahead logs.
    • Question 26: Ensure that the pipeline's resources continue to be available for testing in production mode for continuous processing..
    • Question 25: Use Delta Live Tables to automate data quality monitoring.
    • Question 24: A Table is the most appropriate data entity, as it contains all data.
    • Question 23: Managed tables delete data and metadata if the table is not needed.
    • Question 22: Use if day_of_week == 1 and review_period == "True":  to create the code block's conditional run block.
    • Question 21: Use the UNION command to prevent duplicate data, when combining data from two Delta tables.
    • Question 20: The url option for the CREATE TABLE statement in Databricks should use a JDBC connection string (e.g., jdbc:sqlite:/path/to/file.db).
    • Question 19: When a COPY INTO statement does not ingest new rows, it could be due to missing files, the file not being in the right format, a table refresh being needed or the file already having been copied.
    • Question 18: Ensure necessary Python control flow exists in the script in place of manual access to the associated SQL endpoint.
    • Question 17: A data engineer needs to use a SQL UDF to make a function call in Databricks.
    • Question 16: A data engineer would use the MERGE command to incrementally update a Delta Table while avoiding duplicates.
    • Question 15: Array functions provide the ability to work with different data types in a single operation.
    • Question 14: Use the COMMENT "Contains PII" statement as part of the CREATE TABLE statement to denote the inclusion of Personally Identifiable information (PII) in the new table. 
    • Question 13: Access the location of a database in Databricks by issuing a DESCRIBE DATABASE ... command, not a DROP DATABASE... command.

    Studying That Suits You

    Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

    Quiz Team

    Related Documents

    Description

    Prepare for the Databricks Certified Data Engineer Associate Exam with these concise notes. This quiz covers essential concepts such as Delta Live Tables, data lakehouses, and Python functionalities in data engineering. Test your knowledge and ensure you're ready for the certification challenges ahead.

    More Like This

    Use Quizgecko on...
    Browser
    Browser