M4 The Extract, Load, and Transform Data Pipeline Pattern PDF

Summary

This document provides an overview of the Extract, Load, and Transform (ELT) data pipeline pattern, focusing on data transformation using tools like BigQuery and Dataform within Google Cloud Platform. It explains the architecture, SQL scripting capabilities, and use cases of Dataform.

Full Transcript

Proprietary + Confidential 04 The Extract, Load, and Transform Data Pipeline Pattern Proprietary + Confidential In this module, you learn to... Explain the baseline extract, load, and transform...

Proprietary + Confidential 04 The Extract, Load, and Transform Data Pipeline Pattern Proprietary + Confidential In this module, you learn to... Explain the baseline extract, load, and transform 01 (ELT) architecture diagram. Understand a common ELT pipeline on 02 Google Cloud. Describe BigQuery’s SQL scripting and 03 scheduling capabilities. Explain the functionality and use cases for 04 Dataform. In this module, first, you review the baseline extract, load, and transform architecture diagram. Second, you look at a common ELT pipeline on Google Cloud. Then, you review BigQuery’s SQL scripting and scheduling capabilities. Finally, you look at the functionality and use cases for Dataform. Proprietary + Confidential In this section, you explore Extract, Load, and Transform (ELT) Architecture SQL Scripting and Scheduling with BigQuery Dataform Proprietary + Confidential Once data is loaded into BigQuery, there are multiple ways to transform it Procedural Scheduled Scripting and programming language query Scheduling capabilities Jupyter Dataform Notebook Dependency management Extract, load, Extract, Extract and load transform transform, load Extract, load, and transform centers around data being loaded into BigQuery first. Once data is loaded, there are multiple ways to transform it. Procedural languages like SQL can be used to transform data. Scheduled queries can be used to transform data on a regular basis. Scripting and programming languages like Python can be used to transform data. And a tool like Dataform simplifies transformation beyond basic programming options. Proprietary + Confidential Use an ELT pipeline to apply transformations after data is staged in BigQuery Transform SQL scripts Structured Staging BigQuery Production data tables tables Cloud Storage BigQuery BigQuery SQL workflow Dataform In an extract, load, and transform pattern pipeline, structured data is first loaded into BigQuery staging tables. Transformations are then applied within BigQuery itself using SQL scripts or tools like Dataform with SQL workflows. The transformed data is finally moved to production tables in BigQuery, ready for use. This approach leverages BigQuery's processing power for efficient data transformation. Proprietary + Confidential In this section, you explore Extract, Load, and Transform (ELT) Architecture SQL Scripting and Scheduling with BigQuery Dataform Proprietary + Confidential BigQuery supports procedural language queries to perform multiple statements in a sequence -- create variables Procedural language in BigQuery: DECLARE book_name STRING DEFAULT 'Ulysses'; Runs multiple statements in a sequence DECLARE book_year INT64 DEFAULT 1922; with shared state. -- Create a temporary table called Books. Automates management tasks, such as EXECUTE IMMEDIATE creating or dropping tables. "CREATE TEMP TABLE Books (title STRING, publish_date INT64)"; Implements complex logic using programming constructs like IF and -- add a row for Ulysses, using the variables WHILE. -- declared and the ? placeholder EXECUTE IMMEDIATE Can declare user-created variables or "INSERT INTO Books (title, publish_date) VALUES(?, ?)" reference BigQuery’s system variables. USING book_name, book_year; Can use transactions (COMMIT, ROLLBACK). With support for procedural language, BigQuery allows the execution of multiple SQL statements in sequence with shared state. This enables the automation of tasks like table creation, implementation of complex logic using constructs like IF and WHILE, and the use of transactions for data integrity. You can also declare variables and reference system variables within your procedural code. Proprietary + Confidential User-defined functions (UDFs) transform data by using SQL or JavaScript # create a permanent user-defined function Usage notes: CREATE FUNCTION dataset_name.AddFourAndDivide Can be persistent (CREATE FUNCTION) or (x INT64, y INT64) RETURNS FLOAT64 AS ( temporary (CREATE TEMPORARY FUNCTION). (x + 4) / y ); Use SQL where possible. Use JavaScript if the same cannot be accomplished with SQL. JavaScript UDFs can use additional input libraries. # use the function in your SQL script The bigquery-utils project and the SELECT bigquery-public-data.persistent_udfs val, dataset_name.AddFourAndDivide(val, 2) FROM...; public dataset contain community-contributed UDFs. BigQuery supports user-defined functions or UDFs for custom data transformations using SQL or JavaScript. These UDFs can be persistent or temporary, and it is recommended to use SQL UDFs when possible. JavaScript UDFs offer the flexibility to use external libraries, and community-contributed UDFs are available for reuse. Proprietary + Confidential Stored procedures execute a collection of SQL statements # create a stored procedure CREATE OR REPLACE PROCEDURE Usage notes: dataset_name.create_customer() Can be called from queries or other BEGIN DECLARE id STRING; stored procedures. SET id = GENERATE_UUID(); Accept input values and return INSERT INTO dataset_name.customers (customer_id) VALUES(id); values as output. SELECT FORMAT("Created customer %s", id); Can access or modify data across END multiple datasets. Can contain multi-statement queries and support procedural language # call the procedure in your SQL script CALL dataset_name.create_customer(); statements. Stored procedures are pre-compiled SQL statement collections, streamlining database operations by encapsulating complex logic for enhanced performance and maintainability. Benefits include reusability, parameterization for flexible input, and transaction handling. Stored procedures are called from applications or within SQL scripts, promoting modular design. Proprietary + Confidential Run stored procedures for Apache Spark on BigQuery CREATE OR REPLACE PROCEDURE dataset_name.word_count() WITH CONNECTION `.` OPTIONS(engine="SPARK", runtime_version="1.1") Spark procedures on BigQuery: LANGUAGE PYTHON AS R""" Can be defined using the BigQuery from pyspark.sql import SparkSession PySpark Editor in the UI. spark = SparkSession.builder.appName("spark-bq-wordcount").getOrCreate() Can be defined using the CREATE # Load data from BigQuery. words = spark.read.format("bigquery") \ PROCEDURE statement:.option("table", "bigquery-public-data:samples.shakespeare").load() words.createOrReplaceTempView("words") ○ Supports Python, Java, or Scala. # Perform word count. ○ Can point to code in a file on word_count = words.select('word','word_count') \.groupBy('word').sum('word_count') \ Cloud Storage..withColumnRenamed("sum(word_count)", "sum_word_count") ○ Code can be defined inline in the # [...] BigQuery SQL Editor. """ BigQuery supports running stored procedures for Apache Spark. Apache Spark stored procedures on BigQuery can be defined in the BigQuery PySpark editor or using the CREATE PROCEDURE statement with Python, Java, or Scala code. The code can be stored in a Cloud Storage file or defined inline within the BigQuery SQL editor. Proprietary + Confidential Remote functions provide transformations with more complex programming logic Cloud Run functions in Python: Usage in BigQuery: # create the remote function in BigQuery import functions_framework CREATE FUNCTION dataset_name.object_length( import json signed_url STRING) RETURNS INT64 import urllib.request REMOTE WITH CONNECTION `.` OPTIONS( # return the length of an object on Cloud Storage endpoint = @functions_framework.http "https://[...].cloudfunctions.net/object_length", def object_length(request): max_batching_rows = 1 calls = request.get_json()['calls'] ); replies = [] for call in calls: object_content = urllib.request.urlopen(call).read() Provides a direct integration to code hosted replies.append(len(object_content)) return json.dumps({'replies': replies} on Cloud Run functions. Call the function the same way as a UDF. Remote functions extend BigQuery's capabilities by integrating with Cloud Run functions. This enables complex data transformations using Python code. You define the remote function in BigQuery, specifying the connection and endpoint to your Cloud Run function. This function can then be called directly within your SQL queries, similar to a UDF, allowing for seamless integration of custom logic. The example here shows a Python function that gets a list of signed URLs pointing to Cloud Storage objects and returns the length for these objects. In BigQuery, we just need to register the function. We call it object_length(). Then, we can use it as is in SQL. Proprietary + Confidential Use Jupyter notebooks for data exploration and transformation # Use BigQuery DataFrames for results not fitting into runtime memory. Python notebooks features: Join datasets or aggregate data. import bigframes.pandas as bf bf.options.bigquery.location = "your-location" Parse data from complex structures bf.options.bigquery.project = "your-project-id" using SQL or Python functions. df = bf.read_gbq("bigquery-public-data.ml_datasets.penguins") Schedule notebooks to execute at a # Find the heaviest penguin species using the groupby operation to # calculate the mean body_mass_g: specified frequency. ( BigQuery DataFrames is integrated df["body_mass_g"].groupby(by=df["species"]) into notebooks, no setup required..mean() Use matplotlib, seaborn, and.sort_values(ascending=False).head(10) other popular libraries to visualize ) data. Jupyter Notebooks coupled with BigQuery DataFrames facilitate efficient data exploration and transformation. This integration emphasizes the ability to handle large datasets that exceed runtime memory, perform complex data manipulations using SQL or Python, and schedule notebook executions. The seamless integration of BigQuery DataFrames and popular visualization libraries further streamlines the entire process. Proprietary + Confidential Save your queries and schedule them for repeating use cases RUN SAVE RUN SCHEDULE # Great insight, save it for later # This query should run hourly SELECT... SELECT... Is visible in Explorer > Queries. Set repeat frequency. Supports version control. Set start and end times. Share with users or groups. Specify destination for query results. Download as.sql file. Can use runtime parameters. Upload other queries from.sql files. Tip: use Dataform for SQL workflows. BigQuery offers the option to save and schedule queries for repeated use. You can save queries, manage versions, and share them with others. Scheduling allows you to automate query execution by setting frequency, start and end times, and result destinations. Dataform is recommended for more complex SQL workflows. Proprietary + Confidential What if you wanted to perform additional operations after query completion? Execution Dependent operations Run next SQL script? Other SQL scripts BigQuery Scheduled query BigQuery Run data quality test? Output table BigQuery Configure security? Often, there are needs to perform additional tasks after a scheduled query is executed in BigQuery. These include tasks such as triggering subsequent SQL scripts, running data quality tests on the output, or configuring security measures. In an ideal situation, these actions can be automated, ensuring data pipelines remain efficient and reliable. Proprietary + Confidential In this section, you explore Extract, Load, and Transform (ELT) Architecture SQL Scripting and Scheduling with BigQuery Dataform Proprietary + Confidential Dataform is a serverless framework to develop and operationalize ELT pipelines in SQL Transform Dataform BigQuery BigQuery Model data Staging tables Assert data quality Production tables Source systems Document data Dataform is a serverless framework that simplifies the development and management of ELT pipelines using SQL. Dataform enables data transformation within BigQuery, ensuring data quality and providing documentation. This approach streamlines the process of moving data from source systems to production tables in BigQuery, making operations more efficient and manageable. Proprietary + Confidential Dataform unifies data transformation, assertion, and automation in BigQuery Without Dataform, this would be a time-consuming and error-prone process Custom GitHub or Manual Data Cloud Cloud BigQuery framework GitLab checks Catalog Composer Monitoring Define tables Configure Test data Document Schedule Manage code Monitor jobs in SQL dependencies quality data pipelines Dataform streamlines data operations in BigQuery by unifying transformation, assertion, and automation. Without Dataform, tasks like defining tables, managing code, testing data quality, and scheduling pipelines would be time-consuming and prone to errors. They could also involve multiple tools and manual processes. Dataform simplifies these tasks within BigQuery, improving efficiency and data reliability. Proprietary + Confidential SQL workflow development and compilation run in Dataform, workflow execution runs in BigQuery Development Execution Dataform Dataform BigQuery SQL workflow SQL workflow SQL workflow development compilation invocation sqlx development Real-time compilation SQL transformations 1 using SQL and 2 into SQL, dependency 3 and materialization, JavaScript and error checks manually or scheduled Dataform and BigQuery work together to manage SQL workflows. With Dataform, developers create and compile SQL workflows using SQL and JavaScript. Dataform then performs real-time compilation, including dependency checks and error handling. Finally, the compiled SQL workflows are executed within BigQuery, enabling SQL transformations and materialization either on-demand or through scheduled runs. Proprietary + Confidential Development happens in workspaces with default files and folders definitions folder: contains sqlx files. definitions includes folder: includes contains JavaScript files..gitignore.gitignore file: package-lock.json lists files and directories to ignore during Git commits. package.json package.json file (optional): defines JavaScript packages and required versions. workflow_settings.yaml package-lock.json file (optional): contains installed packages with exact versions. You can add custom files like README.md as well. workflow_settings.yaml file: contains basic settings required to compile the project. Development using Dataform utilizes workspaces containing default files and folders. Key folders include "definitions" for SQLX files and "includes" for JavaScript files. The ".gitignore" file is used for managing Git commits. Developers may also use "package.json" and "package-lock.json" for handling JavaScript dependencies. The "workflow_settings.yaml" file stores project compilation settings, and custom files like "README.md" can also be added. Proprietary + Confidential sqlx file structure config { // specify query metadata // document data // define data quality tests config block: } define output table or view, dependencies, js { assertions and documentation. // define local JavaScript js block: // functions and constants } define reusable constants and functions for the local SQL body. pre_operations { // define SQL statements to be pre_operations block: // executed before table creation define SQL to be executed before SQL body } execution (e.g. creating a UDF). -- SQL body for core work SQL body: -- generate SQL code with JavaScript define SQL execution and dependencies. post_operations { post_operations block: // define SQL statements to be define SQL to be executed after SQL body // executed after table creation } execution (e.g., granting access using IAM). The sqlx file structure provides a clear framework for organizing SQL code and associated tasks. It begins with a config block for metadata and data quality tests, followed by a js block to define reusable JavaScript functions. The pre_operations block handles SQL statements executed before the main SQL body, which defines the core SQL logic. Finally, the post_operations block contains SQL statements to be run after the main execution, ensuring a structured and efficient workflow. Proprietary + Confidential sqlx development takes care of boilerplate code by replacing redundant SQL with definitions CREATE OR REPLACE TABLE config { type: "table" } `projectid.dataset.new_table` AS SELECT SELECT country AS country, country AS country, CASE WHEN country IN ('US', 'CA') THEN 'NA' WHEN country IN ('GB','FR','DE') THEN 'EU' WHEN country IN ('AU') THEN country ${mapping.region("country")} AS country_group, ELSE 'Other countries' END AS country_group, device_type AS device_type, device_type AS device_type, SUM(revenue) AS revenue, SUM(revenue) AS revenue, SUM(pageviews) AS pageviews SUM(pageviews) AS pageviews FROM FROM `projectid.dataset.table_1` ${ref("table_1")} GROUP BY GROUP BY 1,2,3 1,2,3 SQLX development streamlines SQL code by replacing repetitive patterns with concise definitions. The code example demonstrates how a complex CASE statement for categorizing countries can be replaced with a simple function call $(mapping.region("country")). This approach improves code readability and maintainability by reducing boilerplate code and promoting reusability. Proprietary + Confidential Create table and view definitions that will compile into SQL statements data_source config { type: "declaration" } Reference BigQuery table types as Declaration:... static data sources. my_table config { type: "table" } Compiles into CREATE OR REPLACE Table:... TABLE AS SELECT. incremental_table config { type: "incremental" } Creates a table as above and updates Incremental Table:... it with new data. my_view config { type: "view" } Compiles into CREATE OR REPLACE View:... VIEW (can be materialized). With Dataform, table and view definitions should be created in a specific manner so that they can be compiled into SQL statements. Key configuration types are: declaration for referencing existing BigQuery tables, table for creating or replacing tables with a SELECT statement, incremental for creating tables and updating them with new data, and view for creating or replacing views, which can optionally be materialized. Proprietary + Confidential Use assertions to define data quality tests and operations to run custom SQL statements Assertion types: uniqueKey config { type: "assertion" } not NULL checks Assertion:... unique key checks custom logic Use this for: custom_sql config { type: "operations" } custom SQL (e.g. DML, BQML) Operations:... custom CREATE TABLE statements (e.g. BigLake tables) Dataform offers assertions to define data quality tests, ensuring data consistency and accuracy. Assertions can be written in SQL or JavaScript, providing flexibility for complex checks. Operations allow you to run custom SQL statements before, after, or during pipeline execution. These two options enable custom data transformations, data quality checks, and other tasks within your workflows. By combining assertions and operations, Dataform empowers you to create robust and reliable data pipelines in BigQuery. Proprietary + Confidential Dependencies can be configured in two ways Implicit dependency declaration: config {... Specify the table/view name in the } JavaScript ref() function. If you want to reference a table without SELECT... FROM ${ref("customer_details")} creating a dependency, use resolve() instead. Explicit dependency declaration: config {... Specify the list of dependencies in the dependencies: ["customer_details"] dependencies array. } Dependencies can be multiple instances of SELECT... data source declarations, tables/views, custom SQL operations, and assertions. Dataform provides two methods to manage dependencies, implicit declaration and explicit declaration. Implicit declaration is when you reference tables or views directly within your SQL using the ref() function. Explicit declaration is when you list dependencies within a config block using the dependencies array. It is also possible to use the resolve() function to reference without creating a dependency. Proprietary + Confidential Dataform compiles your definitions into SQL scripts and chains them into a workflow config { CREATE OR REPLACE TABLE type: "table", `..customer_details` name: "customer_details", OPTIONS(description='''Customer details table''') description: "Customer details table", AS ( } SELECT... SELECT... FROM `..customer_source` FROM ${ref("customer_source")} ); customer_source customer_details Declaration:... Table:... Dataform allows you to compile user-defined table definitions into executable SQL scripts. The sample code shows a customer_details table being created or replaced based on a customer_source table using a SELECT statement. Dataform manages the dependencies between these tables and orchestrates their execution within a workflow. This process streamlines data transformation and ensures efficient data pipeline management. Proprietary + Confidential The compiled graph visualizes the SQL workflow with its definitions and dependencies customer_ml_trai… Operations:... customer_source customer_interme… customer_rowCon… Declaration:... Table:... Assertion:... customer_prod_vi… View:... Dataform SQL workflows are best visualized in graph format. The sample workflow starts with a declaration of customer_source followed by a customer_intermediate table, likely derived from a source system as a pre-processed data source. Next, customer_rowConsistency applies assertions for data quality checks. The graph then splits into two paths. In one path, an operation named customer_ml_training is invoked. It performs operations on the validated data. In the other path, a view named customer_prod_view is created. Proprietary + Confidential SQL workflows can be scheduled and executed on a recurring basis Internal triggers SQL workflow External triggers execution Manual execution Cloud Scheduler Scheduled using Manually in Dataform UI Workflows Dataform Workflow Cloud Composer configuration Scheduled and Scheduled in Dataform BigQuery orchestrated There are several scheduling and execution mechanisms for Dataform SQL workflows. One path is through internal triggers. These include manual execution in the Dataform UI or scheduled configurations within Dataform itself. The other path is through external triggers. These include tools like Cloud Scheduler and Cloud Composer. Ultimately, all workflows are executed within BigQuery, showcasing its central role in this process. Proprietary + Confidential Lab: Create and Execute a SQL Workflow in Dataform 30 min Learning objectives Create a Dataform repository. Create and initialize a Dataform development workspace. Create and execute a SQL workflow. View execution logs in Dataform. In this lab, you use Dataform to create and execute a SQL workflow. First, you create a Dataform repository. Second, you create and initialize a Dataform development workspace. Then, you create and execute a SQL workflow. Finally, you view execution logs in Dataform to confirm completion.

Use Quizgecko on...
Browser
Browser