M3 The Extract and Load Data Pipeline Pattern PDF

Summary

This document describes the Extract and Load data pipeline pattern used for managing and transferring data to BigQuery. It explains different methods for loading data from various sources into BigQuery, including using command-line tools and the BigQuery Data Transfer Service, along with the BigLake approach for non-extract-load patterns. The document also covers data formats supported by BigQuery for import and export.

Full Transcript

Proprietary + Confidential 03 The Extract and Load Data Pipeline Pattern Proprietary + Confidential In this module, you learn to... Explain the baseline extract and load 01 architecture diagram....

Proprietary + Confidential 03 The Extract and Load Data Pipeline Pattern Proprietary + Confidential In this module, you learn to... Explain the baseline extract and load 01 architecture diagram. Understand the options of the bq command 02 line tool. Explain the functionality and use cases for the 03 BigQuery Data Transfer Service. Explain the functionality and use cases for 04 BigLake as a non extract-load pattern. In this module, first, you review the baseline extract and load architecture diagram. Second, you explore the options of the bq command line tool. Then, you review the functionality and use cases for the BigQuery Data Transfer Service. Finally, you look at the functionality and use cases for BigLake as a non extract-load pattern. Proprietary + Confidential In this section, you explore Extract and Load Architecture The bq Command Line Tool BigQuery Data Transfer Service BigLake Proprietary + Confidential Extract and load ingests data into BigQuery without the need for transformation External Multiple data sources bq load tables Scheduling capabilities Data External Transfer tables Service (BigLake) No data copy needed Extract, load, Extract, Extract and load transform transform, load The extract and load data pipeline pattern focuses on the tools and options to bring data into BigQuery by eliminating the need for upfront transformation. Extract and load greatly simplifies data ingestion into BigQuery. Extract and load leverages tools like bq load and Data Transfer Service to directly load data from various sources or uses external tables and BigLake tables to make data accessible via BigQuery. This pattern also offers scheduling capabilities and eliminates the need for data copying, promoting efficiency in data pipelines. Proprietary + Confidential BigQuery supports importing from and exporting to multiple formats Query results Loading data CSV Avro JSON Parquet Google Sheets ORC Table export CSV CSV permanent table JSON JSON Firestore export Avro Parquet BigQuery provides extensive flexibility in data handling. It supports loading data from various formats like Avro, Parquet, ORC, CSV, JSON, as well as Google Cloud Firestore exports. Similarly, you can export BigQuery artifacts, including query results and table data, into formats like CSV, JSON, Avro, and Parquet, facilitating easy integration with other tools and systems. Proprietary + Confidential Load data into BigQuery by using the UI or the LOAD DATA statement Create table from # create a new table by loading data from # CSV files using schema auto-detect Upload LOAD DATA INTO dataset_name.table_name FROM FILES( Select file BROWSE format='CSV', uris = ['gs://mybucket/*.csv'] File format ) CSV Use LOAD DATA INTO to append to an Schema existing table. Auto detect Use LOAD DATA OVERWRITE to overwrite the existing table. BigQuery offers two ways to load data: through its friendly user interface for file uploads or via the LOAD DATA SQL statement. The UI simplifies the process, allowing you to select files, specify formats, and even auto-detect schema. LOAD DATA provides more control, ideal for automation and appending or overwriting existing tables. Proprietary + Confidential In this section, you explore Extract and Load Architecture The bq Command Line Tool BigQuery Data Transfer Service BigLake Proprietary + Confidential Interact with BigQuery by using the bq command > bq mk --location=US -dataset dataset-name Dataset Table > bq load \ --source_format=CSV \ --skip_leading_rows=2 dataset-name.table_name \ "gs://mybucket/00/*.csv","gs://mybucket/01/*.csv" \./table_schema.json The Cloud SDK’s bq command offers a programmatic way to interact with BigQuery. You can create BigQuery objects like datasets and tables with the familiar Linux-like command bq mk. The bq load command efficiently loads data into BigQuery tables. Key parameters for bq load include specifying the source format such as CSV, skipping header rows, and defining the target dataset and table. You can load data from multiple files in Cloud Storage using wildcards and optionally provide a schema file for the table structure. These options provide flexibility and control for loading data into BigQuery from various sources. Proprietary + Confidential In this section, you explore Extract and Load Architecture The bq Command Line Tool BigQuery Data Transfer Service BigLake Proprietary + Confidential Use the BigQuery Data Transfer Service to load data from other structured data sources BigQuery Data Transfer Service Sources Data source details Schedule SaaS Repeating / on-demand Start time Object store Destination settings Notifications Data warehouse Managed and serverless Third-party No-code solution The BigQuery Data Transfer Service enables seamless loading of structured data from diverse sources, like SaaS applications, object stores, and other data warehouses into BigQuery. The service provides scheduling options for recurring or on-demand transfers, along with configuration options for data source details and destination settings. It is a managed and serverless solution, eliminating infrastructure management overhead. In addition, its no-code approach simplifies data transfer setup and management. Proprietary + Confidential In this section, you explore Extract and Load Architecture The bq Command Line Tool BigQuery Data Transfer Service BigLake Proprietary + Confidential Your data isn’t stored in BigQuery? You can still query it! External tables Cloud Storage Google Sheets Bigtable BigQuery BigLake tables Cloud Storage Cross-cloud object store BigQuery's data access capabilities extend beyond its own storage. BigQuery allows you to query data residing in sources like Cloud Storage, Google Sheets, and Bigtable using external tables. Additionally, BigLake tables provide a way to query data across Cloud Storage and even other cloud object stores, expanding BigQuery's reach and flexibility for data analysis. Proprietary + Confidential Three ways of analyzing structured data in BigQuery Permanent Analytics Load data into BigQuery storage, table Engine 1 obtain high performance, but data BigQuery BigQuery movement is required. External Analytics Skip BigQuery storage, but CSV files Cloud Storage table Engine 2 receive low performance, with no BigQuery BigQuery data movement required. Analytics Skip BigQuery storage, and BigLake Engine 3 receive high performance, with no table BigQuery data movement required. BigQuery offers flexibility in analyzing structured data. You can load data into permanent BigQuery tables for high-performance analytics, but with data movement involved. External tables allow you to query data directly in Cloud Storage without loading it into BigQuery, which is suitable for less frequent access. BigLake tables provide the best of both worlds: high-performance analytics on data in Cloud Storage without the need to load it into BigQuery, and without data movement. Proprietary + Confidential External tables let you query Google Sheets data directly in BigQuery SCHEMA DETAILS LINEAGE... # Query the Google Sheets table Table information SELECT... FROM Table ID your-project-id.sheets.sheets_table `your-project-id.sheets.sheets_table`;... External data configuration No query cost estimation, table preview, or query caching available. Source URI(s) https://docs.google.com/spreadsheets/… Performance is slower than querying a Source format GOOGLE_SHEETS permanent table in BigQuery. BigQuery external tables bridge the gap between Google Sheets and BigQuery, enabling direct querying of Sheets data within BigQuery. By specifying the Google Sheets URL and format, users can treat the sheet as a table in BigQuery, simplifying data analysis across platforms. However, be aware that querying external tables may have limitations, like slower performance and the unavailability of cost estimation, table preview, and query caching. Proprietary + Confidential BigLake: BigQuery on your data lake… and much more! Query engine Uses Apache Arrow: BigQuery Spark* Hive* others* Efficient data representation, processing, and interchange. Fine-grained security. BigLake: Unified Storage API Metadata caching to accelerate performance. Multiple data locations: BigQuery tables, Cloud Storage, cross-cloud Avro CSV JSON ORC Unified experience across object store data lakes and data warehouses. Parquet Delta Lake BigQuery others Read data directly from where it is stored. * using BigQuery connectors BigLake extends BigQuery's capabilities, providing a unified interface to query data directly from your data lake and other sources without moving or copying it. BigLake leverages Apache Arrow for efficient data handling and offers fine-grained security and metadata caching. With BigLake, you can seamlessly access data across data lakes and data warehouses using familiar BigQuery tools. Proprietary + Confidential A BigLake table behaves the same way as a permanent table SCHEMA DETAILS LINEAGE... # Query the Biglake table SELECT... Table information FROM `your-project-id.biglake.csv_table`; Table ID your-project-id.biglake.csv_table... Use BigLake if loading data into BigQuery isn’t an option for your use case. External data configuration Queries are performant due to metadata Source URI(s) gs://your-bucket/your_file.csv caching.... No query cost estimation or table preview. BigLake tables provide a seamless querying experience, allowing you to interact with data stored in external sources like Cloud Storage just like you would with data in native BigQuery tables. You can use standard SQL queries to access and analyze the data within BigLake tables, including SELECT statements and joins. Behind the scenes, BigLake leverages metadata caching to enhance query performance, even though the data physically resides outside BigQuery. However, some features like query cost estimation and table preview are not available for BigLake tables due to the external nature of the data. Proprietary + Confidential BigLake uses metadata caching to accelerate performance Advantages of BigQuery Analytics Engine Skip object list BigQuery Faster file and partition pruning Dynamic predicate pruning BigLake metadata cache File Partition Size RowCount ColumnStats Advantages of Spark file1.parquet 2023-05-01 94371840 4096 colA_minmax: (“Alice”, “Bob”) colB_minmax: (18,24) Can read metadata statistics using file2.parquet............ the spark-bigquery connector. Cache properties Parquet Files Staleness between 30 min and 7 days Cloud Storage Automatic or manual refresh BigLake maintains a metadata cache. The cache stores details about external data. For example, it can contain details about Parquet files stored in Cloud Storage, such as file size, row count, and column statistics like minimum/maximum values. This cache allows querying via BigQuery to skip listing all objects, prune files, and partitions faster, and enable dynamic predicate pushdown, resulting in improved query performance. The cache allows querying by Spark to access metadata statistics that the Spark-BigQuery connector can leverage to speed up queries. The metadata cache has configurable staleness from 30 minutes to 7 days and it can be refreshed automatically or manually. Proprietary + Confidential Compare security approaches: Non-native stored data SQL engine External Files User BigQuery table Cloud Storage External tables: the user needs separate permissions to access the table and the data source. External connection SQL engine Service Files BigLake BigQuery account Cloud Storage table BigLake: access is delegated using a service account. It decouples access to the table from the underlying data source. External tables in BigQuery require users to have separate permissions for both the table itself and the underlying data source. This can lead to more complex access management. BigLake tables offer a streamlined approach. Access is delegated through a service account, decoupling table access from the data source. This simplifies permission management and enhances security. Proprietary + Confidential Compare external and BigLake tables External tables BigLake tables Query Engine BigQuery BigQuery, others using BigQuery connector Parquet, Avro, JSON, ORC, CSV Parquet, Avro, JSON, ORC, CSV Apache Iceberg, Delta, Hudi Data types Datastore export, Firestore export Hive partitioned Google Sheets Any unstructured (via object tables) Cloud Storage Cloud Storage Locations Bigtable Amazon S3 Google Drive Azure Blob Storage External connection with access delegation. Direct link to the source. The user needs to Access pattern The service account has access to the have access to the source source. Column-level security, data masking Security None Row-level security In summary, both external and BigLake tables enable querying data residing outside of BigQuery, but BigLake offers broader capabilities. BigLake supports a wider range of data formats and storage locations, including object stores across multiple cloud providers, and provides advanced security features like column-level and row-level security. External tables are simpler to set up, but lack fine-grained security controls. BigLake tables offer enhanced performance, security, and flexibility for querying external data, making them suitable for enterprise data lake use cases. Proprietary + Confidential Lab: BigLake: Qwik Start 45 min Learning objectives Create and view a connection resource. Set up access to a Cloud Storage data lake. Create a BigLake table. Query a BigLake table through BigQuery. Set up access control policies. Upgrade an external table to be a BigLake table. In this lab, you use BigLake to connect to various external data sources. You configure a connection resource and set up access to a Cloud Storage data lake. You create and query a BigLake table and set up access control policies. Finally, you upgrade an existing external table to be a BigLake table.

Use Quizgecko on...
Browser
Browser