M5 The Extract, Transform, and Load Data Pipeline Pattern PDF

Summary

This document provides an overview of the extract, transform, and load (ETL) data pipeline pattern using Google Cloud services. It details various components and their relationship, including batch and streaming data processing. The document is suitable for professionals in data engineering and related fields.

Full Transcript

Proprietary + Confidential 05 The Extract, Transform, and Load Data Pipeline Pattern Proprietary + Confidential In this module, you learn to... Explain the baseline extract, transform, and 01 lo...

Proprietary + Confidential 05 The Extract, Transform, and Load Data Pipeline Pattern Proprietary + Confidential In this module, you learn to... Explain the baseline extract, transform, and 01 load (ETL) architecture diagram. Identify the GUI tools on Google Cloud used 02 for ETL data pipelines. Explain batch data processing using 03 Dataproc. 04 Use Dataproc Serverless for Spark for ETL. 05 Explain streaming data processing options. Explain the role Bigtable plays in data 06 pipelines. In this module, first, you review the baseline extract, transform, and load architecture diagram. Second, you look at the GUI tools on Google Cloud used for ETL data pipelines. Third, you review batch data processing using Dataproc. Then, you examine using Dataproc Serverless for Spark for ETL. Next, you review streaming data processing options on Google Cloud. Finally, you examine the role Bigtable plays in data pipelines. Proprietary + Confidential In this section, you explore Extract, Transform, and Load (ETL), Architecture Google Cloud GUI Tools for ETL Data Pipelines Batch Data Processing Using Dataproc Streaming Data Processing Options Bigtable and Data Pipelines Proprietary + Confidential Google Cloud provides multiple services for distributed data processing UI- and code-friendly tools Dataprep Dataproc Open-source frameworks Data Fusion Dataflow Templates support Extract, load, Extract, Extract and load transform transform, load The extract, transform, and load data pipeline pattern focuses on data being adjusted or transformed prior to being loaded into BigQuery. Google Cloud offers a variety of services to handle distributed data processing. For developers who prefer visual interfaces, there are user-friendly tools like Dataprep and Data Fusion. If you favor open-source frameworks, Dataproc and Dataflow are an option. And to streamline your workflows, template support is provided across various stages of data processing, from extraction and loading to full-fledged transformation. Proprietary + Confidential In this section, you explore Extract, Transform, and Load (ETL), Architecture Google Cloud GUI Tools for ETL Data Pipelines Batch Data Processing Using Dataproc Streaming Data Processing Options Bigtable and Data Pipelines Proprietary + Confidential Dataprep by Trifacta is a serverless, no-code solution to build data transformation flows Dataprep Sources Predefined Recipe chains together Execute the flow wranglers available transformation rules Cloud Storage aggregate Break into rows using '\n' delimiter Dataflow BigQuery countpattern Split column1 into 246 columns on / / deduplicate Convert row 1 to header Serverless Google Sheets Change EIN type to Integer delete Scheduling Create calendar_year_field Third-party from Concatenate 3 functions... more separated by '-' Monitoring Dataprep by Trifacta simplifies data transformation by offering a serverless, no-code solution. It connects to various data sources, provides pre-built transformation functions, and allows users to chain these functions together into recipes. The resulting transformation flows can be executed seamlessly with Dataflow, while also providing capabilities for scheduling and monitoring. Proprietary + Confidential Dataprep lets you preview your transformations and gives recommendations for your flow Source should be dropped Preview Suggestions Whse_Name Whse_Name Extract values matching `Whse 0` `Whse 0` starting after `{start}` ending before `{end}` 2 Categories 1 Category starting after `{start}` ending before `{end}` Whse-0 Replace Whse-0 Acme Warehouse Acme Warehouse `Whse 0` with "" in Whse_Name Edit Add Acme Warehouse Acme Warehouse Count values matching Acme Warehouse Acme Warehouse `Whse 0` Acme Warehouse Acme Warehouse `Whse 0` starting after `{start}` ending before `{end}` Whse-0 starting after `{start}` ending before `{end}` Dataprep provides a visual interface where you can see the impact of your data transformations before applying them. It also offers intelligent suggestions to help you refine your data cleaning and preparation process, like extracting specific values or replacing patterns within your data. Proprietary + Confidential Data Fusion is a GUI-based enterprise data integration service On-premises, Data Fusion multicloud CDAP SaaS Pipeline Connectors Plugins Wrangler Transforms DAG Object store Extensible Code-free Rich integrations using plugins Streaming RDBMS Executes on Hadoop/Spark clusters Monitoring & logging Data Fusion is a user-friendly, GUI-based tool designed for enterprise data integration. Data Fusion seamlessly connects to various data sources, both on-premises and in the cloud. You can build data pipelines without coding using its drag-and-drop interface and pre-built transformations. The platform is extensible, allowing for custom plugins, and it executes on powerful Hadoop/Spark clusters for efficient processing. Proprietary + Confidential Create and deploy data integration pipelines in Data Fusion Studio SAP table 1 Add-Datetime BigQuery 0.9.0 4.8.2 0.21.2 Preview data Preview data Preview data Joiner 2.10.2 Preview data SAP table 2 GCS 0.9.0 0.21.2 Preview data Preview data Data Fusion Studio easily allows the creation of data pipelines with visual tools. The example pipeline shows two SAP tables being used as data sources. The two tables are joined together and then one outbound leg is written to a Cloud Storage bucket. The other leg undergoes an "Add-Datetime" transformation and is outputted to BigQuery. The pipeline also highlights the ability to preview data at different stages. Proprietary + Confidential In this section, you explore Extract, Transform, and Load (ETL), Architecture Google Cloud GUI Tools for ETL Data Pipelines Batch Data Processing Using Dataproc Streaming Data Processing Options Bigtable and Data Pipelines Proprietary + Confidential Run your Apache Hadoop and Spark workloads on Dataproc HDFS data Cloud Storage Transform HDFS Spark Data data warehouse job Cloud Storage BigQuery Dataproc NoSQL database Bigtable Dataproc allows you to seamlessly run your Apache Hadoop and Spark workloads on Google Cloud. You can leverage HDFS data stored on Cloud Storage, and use Dataproc to perform transformations with Spark jobs. The results can then be easily stored in various destinations like Cloud Storage, BigQuery, or NoSQL databases like Bigtable, all within the Google Cloud ecosystem. Proprietary + Confidential Dataproc is the managed service for data processing with Hadoop and Spark Runtimes for GCE, Rich open-source GKE, Serverless for ecosystem (Hadoop, Spark, Spark Pig, Hive pre-installed) Workflow templates Reactive autoscaling with dependencies policies Dataproc Integration with other Permanent and storage services for ephemeral clusters with HDFS replacement scheduled deletion Dataproc is Google Cloud's managed service for data processing using Hadoop and Spark. It offers flexibility with runtimes on GCE, GKE, and Serverless Spark, and provides a rich open-source ecosystem. Dataproc simplifies cluster management with workflow templates, autoscaling, and the option for both permanent and ephemeral clusters. It also integrates seamlessly with other Google Cloud storage services, eliminating the need for disk-based HDFS. Proprietary + Confidential Dataproc clusters can use HDFS on persistent disks or other Google Cloud storage services Dataproc cluster on GCE Manager Nodes (1-3) Primary Workers Secondary Workers Standard VMs Standard VMs Spot VMs Compute Engine Compute Engine Compute Engine BigQuery HBase hdfs:// gs:// Connector Connector Cluster storage Persistent storage HDFS Cloud HDDs and SSDs BigQuery Bigtable Persistent Disk Storage Dataproc clusters on Google Compute Engine offer flexible storage options. Clusters can utilize HDFS on persistent disks for cluster storage, or leverage other Google Cloud storage services like Cloud Storage for persistent data. Additionally, Dataproc integrates with BigQuery and Bigtable using connectors, enabling seamless interaction with these data stores. This setup allows users to choose the most suitable storage solution for their specific needs while taking advantage of Dataproc's processing capabilities. Proprietary + Confidential Manage and execute workflows with dependencies using Dataproc Workflow Templates YAML file Parameters jobs: OUTPUT_DIR=gs://... - hadoopJob: CLUSTER=... args: [...] mainJarFileUri: [...] stepId: first_job - hadoopJob: args:args: [...] mainJarFileUri: [...] DAG with Managed cluster stepId: second_job dependencies ephemeral, deleted once prerequisiteStepIds: workflow finished - step_one placement: [...] first_job > gcloud dataproc workflow-templates add-job hadoop --step-id=first_job second_job Cluster selector [...] existing, pre-defined Dataproc Workflow Templates allow you to define and manage complex data processing workflows with dependencies between jobs. You can specify these workflows in a YAML file, providing details about the jobs like Hadoop or Spark, their order of execution, and any required parameters. These templates can then be submitted to Google Cloud using the 'gcloud' command-line tool, where they will be executed on either a managed, ephemeral cluster or an existing, pre-defined cluster. Proprietary + Confidential Apache Spark is a popular framework for many data processing tasks Data engineering Dataset, Structured ML model training DataFrame, MLlib GraphX Streaming Spark SQL Batch inferencing Spark Core Graph processing R SQL Python Scala Java Geospatial analytics Time series analytics … and more Apache Spark is a versatile framework for data processing, offering various capabilities through its components like Spark SQL for structured data, Spark Streaming for real-time data, MLlib for machine learning, and GraphX for graph processing. Spark supports multiple languages including R, SQL, Python, Scala, and Java, making it accessible to a wide range of users. With these features, Spark excels in tasks like data engineering, machine learning, analytics, and many more. Proprietary + Confidential Run Spark workloads without cluster management using Dataproc Serverless for Spark Before Using Dataproc Serverless for Spark ✓ Auto-scaling ✓ No infrastructure to tune ⨯ Manage Clusters ✓ No clusters to manage ✓ Write code Infrastructure ✓ Only pay for the execution management + ⨯ Decide infrastructure Write code and time (no startup or execute development ⨯ Pay while it’s running shutdown cost) ⨯ Wait for your job to run ✓ No resource contention (queued up) ✓ Speed to production ✓ Batches, interactive notebooks, Vertex AI pipelines Dataproc Serverless for Spark simplifies Spark workload execution by eliminating cluster management. It offers automatic scaling, cost efficiency with pay-per-execution pricing, faster deployment, and no resource contention. Users can focus solely on writing and executing their code, making it ideal for various Spark use cases like batch processing, interactive notebooks, and Vertex AI pipelines. Proprietary + Confidential Dataproc Serverless for Spark provides batches and interactive notebook session options Serverless for Spark batches Serverless for Spark interactive sessions Local machine Google Cloud environment > gcloud dataproc batches submit spark \ --region= \ --jars=file:///usr/lib/spark/examples/ Dataproc jars/spark-examples.jar \ JupyterLab JupyterLab plugin --class=org.apache.spark.examples.SparkPi \ -- 1000 Interactive notebook session Submit batch job BigQuery external procedures Templates Custom containers Pay-as-you-go Dataproc Serverless for Spark offers two main execution modes: Serverless for batches and Serverless for interactive notebook sessions. Batches are submitted using the 'gcloud' command-line tool and are ideal for automated or scheduled jobs. Interactive sessions leverage JupyterLab, either locally or within the Google Cloud environment, for interactive development and exploration. The platform also supports features like BigQuery external procedures, templates, custom containers, and a pay-as-you-go pricing model. Proprietary + Confidential Dataproc Serverless for Spark provides rich integrations with Google Cloud services Dataproc History Server: Auxiliary Dataproc Dataproc services History Server Metastore Persistent store for logs and information about completed BigQuery Spark jobs. Dataproc Metastore: Metadata for Hive tables. Batches Dataproc service Interactive notebook session Under the hood: Vertex AI Creates ephemeral cluster using Serverless for Spark GCE image. Workbench Submits job for execution. Storage Cloud Storage BigQuery others Deletes cluster after execution. Dataproc Serverless for Spark seamlessly integrates with various Google Cloud services, enhancing its functionality and usability. It leverages Dataproc History Server and Dataproc Metastore for persistent storage and metadata management. It interacts with BigQuery for data warehousing and analytics, and with Vertex AI Workbench for machine learning tasks. Additionally, it utilizes Cloud Storage and other storage services for data storage and retrieval. Behind the scenes, it creates and manages ephemeral clusters for efficient job execution. Proprietary + Confidential The lifecycle of an interactive notebook session Creation Active Serverless runtime Kernel: Idle template Runtime version Custom container Interactive Network session Code development Metastore Max. idle time Max. session time History server Code execution Spark properties Kernel: Starting Kernel: Busy Shutdown Shut down kernel Kernel: Unknown Max. idle time The lifecycle of an interactive notebook session begins with its creation, where various configurations like runtime version and network settings are defined. Once active, the session allows for code development and execution, with the kernel transitioning between idle and busy states. The session eventually reaches a shutdown phase, either manually triggered or due to inactivity, leading to the kernel being shut down and its state becoming unknown. Proprietary + Confidential Lab: Use Dataproc Serverless for Spark to Load BigQuery 45 min Learning objectives Configure the environment. Download lab assets. Configure and execute the Spark code. View data in BigQuery. In this lab, you use Dataproc Serverless for Spark to load BigQuery. First, you configure the environment. Next, you download lab assets. You then configure and execute the Spark code. Finally, you view the data in BigQuery. Proprietary + Confidential In this section, you explore Extract, Transform, and Load (ETL), Architecture Google Cloud GUI Tools for ETL Data Pipelines Batch Data Processing Using Dataproc Streaming Data Processing Options Bigtable and Data Pipelines Proprietary + Confidential Batch processing versus streaming data processing Batch processing Streaming data processing Processing and analysis A flow of data records happens on a set of stored data. generated by various data sources. Payroll system Billing system Fraud detection Intrusion detection Batch processing involves analyzing a fixed set of stored data, suitable for tasks like payroll or billing systems. On the other hand, streaming data processing handles a continuous flow of data from various sources, making it ideal for real-time applications like fraud or intrusion detection. Proprietary + Confidential Streaming ETL workloads require continuous data ingestion, processing, and near real-time analytics Data Transform warehouse Messaging BigQuery Streaming bus job Pub/Sub NoSQL Event data Dataflow database Bigtable Streaming ETL workloads on Google Cloud involve the continuous ingestion of event data, often through Pub/Sub. This data is often processed in real-time using Dataflow, allowing for transformations and enrichment. Finally, the processed data is loaded into various destinations like BigQuery for analytics, enabling near real-time insights, or Bigtable for NoSQL storage. Proprietary + Confidential Use Pub/Sub to ingest high volumes of event data and distribute to different consuming systems HR system Vendor office (decoupled) Publishers New employee New contractor Serverless Asynchronous messaging subscriptions Topics and At-least-once delivery Filtering and sorting (decoupled) Subscribers Employee Contractor Badge Facilities account account activation system provisioning provisioning Pub/Sub can efficiently manage high volumes of event data. Pub/Sub acts as a central hub, receiving events like "New employee" or "New contractor" from various sources. Pub/Sub then distributes these events to relevant systems like badge activation, facilities, and account provisioning, ensuring reliable delivery and enabling decoupled, asynchronous communication between systems. Proprietary + Confidential Dataflow can process both batch and streaming data using the Apache Beam programming model Dataflow Batch data Unified programming model for batch and stream Apache Beam Java Python Go Dataflow Rich integrations with Google Cloud services pipeline runner Serverless Templates Notebooks Streaming data Dataflow leverages the Apache Beam programming framework to efficiently process both batch and streaming data. This unified approach simplifies development, allowing you to use languages like Java, Python, or Go. Dataflow seamlessly integrates with other Google Cloud services and offers features like a pipeline runner, serverless execution, templates, and notebooks for a streamlined experience. Proprietary + Confidential Example: Streaming and transforming messages from Pub/Sub to BigQuery with beam.Pipeline(options=pipeline_options) as pipeline: ReadFromPubSub(): # Read messages from Pub/Sub Retrieves messages from a messages = pipeline | 'read' >> ReadFromPubSub(topic='') Pub/Sub topic or subscription. # Parse messages def parse_message(message): return json.loads(message.decode('utf-8')) parsed_messages = messages | 'parse' >> beam.Map(parse_message) beam.Map(): Applies a specified # Write parsed messages to BigQuery transformation to the message. parsed_messages | 'write' >> WriteToBigQuery( table='', schema=table_schema, create_disposition=BigQueryDisposition.CREATE_IF_NEEDED, write_disposition=BigQueryDisposition.WRITE_APPEND ) WriteToBigQuery(): # Run the pipeline Writes the transformed pipeline.run() message into a BigQuery table. This code example demonstrates how to use Apache Beam to stream messages from Pub/Sub, transform them using a parsing function, and then write the results into BigQuery. The “ReadFromPubSub” function retrieves messages, “Beam.Map()” applies the parsing transformation, and “WriteToBigQuery” loads the transformed data into a specified BigQuery table, creating the table if necessary and appending new data to it. Proprietary + Confidential Use Dataflow templates for recurring pipeline executions with different parameters Benefits of using templates CREATE JOB FROM TEMPLATE Pipeline design is separate from deployment. Dataflow template Parameters customize the pipeline and make it Cloud Storage Text to BigQuery (Stream) reusable with different inputs. Templated pipelines can be deployed from the Source Google Cloud console, command-line gs:// Cloud Storage Input File(s) BROWSE interface, or API. Target Use Google-provided pre-built templates for BigQuery output table BROWSE common scenarios. Required Parameters You can create your own custom templates. Dataflow templates allow you to create reusable pipelines for recurring tasks. You can separate the pipeline design from its deployment, making it easier to manage and update. By using parameters, you can customize the pipeline for different inputs, increasing its versatility. These templated pipelines can be easily deployed through various methods, and Google provides pre-built templates for common scenarios. Proprietary + Confidential In this section, you explore Extract, Transform, and Load (ETL), Architecture Google Cloud GUI Tools for ETL Data Pipelines Batch Data Processing Using Dataproc Streaming Data Processing Options Bigtable and Data Pipelines Proprietary + Confidential Use Bigtable as a sink in your streaming data pipeline for millisecond latency analytics NoSQL queries over large flight aircraft datasets (petabytes) wide-column with column families origin destination make model High throughput with ATL#arrival#20240321-1121 ATL LON B 737 millisecond latency row key as index ATL#arrival#20240321-1201 ATL MEX B 737 for fast access ATL#arrival#20240321-1716 ATL YVR B 757 Uses open-source HBase API for integration // first row output Applications: ATL#arrival#20240321-1121 column=flight:origin, timestamp: 1711020073, value=ATL Time-series data ATL#arrival#20240321-1121 column=flight:destination, timestamp: 1711020073, value=LON IoT ATL#arrival#20240321-1121 column=aircraft:make, timestamp: 1711020073, value=B Financial data ATL#arrival#20240321-1121 column=aircraft:model, timestamp: 1711020073, value=737 Machine learning Bigtable is an excellent choice for handling streaming data pipelines that require millisecond-level latency analytics. Bigtable utilizes a wide-column data model with column families, allowing for flexible schema design. Row keys serve as efficient indexes for quick data access. Bigtable's high throughput and low latency capabilities make it suitable for applications like time-series data, IoT, financial data, and machine learning, especially when dealing with large datasets. Proprietary + Confidential Compare ETL processing options Dataprep Data Fusion Dataproc Dataflow Applications Data wrangling Data integration ETL workloads ETL workloads Cloud Storage, Pub/Sub, Cloud Cloud Storage, hybrid, multicloud Integrations BigQuery, other Storage, BigQuery, BigQuery, others environments OSS other OSS Apache Hadoop, Open source – CDAP Apache Beam Spark, other OSS Batch, stream Velocity Batch Batch, stream Batch, stream (recommended) no (except Serverless yes no yes Serverless Spark) In summary, Google Cloud provides various services for ETL processing. Dataprep is ideal for data wrangling tasks and offers a serverless option. Data Fusion excels at data integration, particularly in hybrid and multi-cloud environments, utilizing the open-source CDAP framework. Dataproc handles ETL workloads with support for Hadoop, Spark, and other open-source tools, with Serverless Spark as a serverless option. Lastly, Dataflow, built on Apache Beam, is recommended for both batch and streaming ETL workloads and provides a serverless architecture. Proprietary + Confidential Lab: Creating a Streaming Data Pipeline for a Real-Time Dashboard with Dataflow 45 min Learning objectives Create a Dataflow job from a template. Stream data via Dataflow pipeline into BigQuery. Monitor a Dataflow pipeline in BigQuery. Analyze results with SQL. Visualize key metrics in Looker Studio. In this lab, you create a streaming data pipeline for a real-time dashboard with Dataflow. You create a Dataflow job from a template. You then monitor a pipeline loading data into BigQuery. After that, you examine the data loaded using SQL. Finally, you visualize key metrics using Looker Studio.

Use Quizgecko on...
Browser
Browser