Google Cloud Data Engineering Tasks and Components PDF

Summary

This document is a Google Cloud presentation about data engineering tasks and components. It covers topics such as the role of a data engineer, differences between data sources and sinks, data formats, and storage options on Google Cloud. It also explains how to share datasets using their Analytics Hub.

Full Transcript

Proprietary + Confidential 01 Data Engineering Tasks and Components Proprietary + Confidential In this module, you learn to... 01 Explain the role of a data engineer. Understand the difference...

Proprietary + Confidential 01 Data Engineering Tasks and Components Proprietary + Confidential In this module, you learn to... 01 Explain the role of a data engineer. Understand the differences between a data 02 source and a data sink. 03 Explain the different types of data formats. Explain the storage solution options on 04 Google Cloud. Understand the metadata management 05 options on Google Cloud. Understand how to share datasets with 06 ease using Analytics Hub. In this module, first, you learn about the role of a data engineer. Second, we will cover the differences between a data source and a data sink. Then, we will review different types of data formats that a data engineer will encounter. The next topic addresses the options for storing data on Google Cloud. Then, we cover the choices available for metadata management. Finally, we will look at the features of Analytics Hub that allow you to easily share datasets both within and outside your organization. Proprietary + Confidential In this section, you explore The Role of a Data Engineer Data Sources Versus Data Sinks Data Formats Storage Solution Options on Google Cloud Metadata Management Options on Google Cloud Sharing Datasets Using Analytics Hub Proprietary + Confidential A data engineer builds data pipelines to enable data-driven decisions Get the data to where it can be useful raw data ingestion and storage Get the data into a usable condition data transformation Add new value to the data data provisioning and enrichment Manage the data security, privacy, discovery, governance Productionize data processes pipeline monitoring and automation What does a data engineer do? At a basic level, a data engineer builds data pipelines. Why does the data engineer build data pipelines? Because they want to get their data into a place, such as a dashboard or report or machine learning model, from where the business can make data-driven decisions. The data has to be in a usable condition so that someone can use this data to make decisions. Many times, the raw data is, by itself, not very useful. Once data becomes useful, the data engineer will often apply updates or transformations to add new value to the data. Of course, new data environments require data management practices to ensure currency and accuracy. Finally, data engineers create processes and operations to move data usage into production settings. Proprietary + Confidential Data engineering tasks evolve around ingesting, transforming, and storing data Replicate and Ingest Transform Store migrate Transfer raw Raw data is Process data Processed data data into available in a using EL, ELT, or is available in a Google Cloud data source ETL tools data sink In the most basic sense, a data engineer moves data from data sources to data sinks in four stages: replicate and migrate, ingest, transform, and store. Proprietary + Confidential Replication and migration services onboard your data into Google Cloud gcloud Online or offline transfer Datastream storage Scheduling capabilities Storage Transfer Transfer Appliance Service Change data capture Replicate and Ingest Transform Store migrate The replicate and migrate stage of a data pipeline focuses on the tools and options to bring data from external or internal systems into Google Cloud for further refinement. There are a wide variety of tools and options at your disposal. They will be covered in more detail throughout this course. Proprietary + Confidential In this section, you explore The Role of a Data Engineer Data Sources Versus Data Sinks Data Formats Storage Solution Options on Google Cloud Metadata Management Options on Google Cloud Share Datasets Using Analytics Hub Proprietary + Confidential Data sources are the origin point of your raw data on Google Cloud Cloud Asynchronous messaging Spanner Storage Unstructured or structured Pub/Sub … many more! Relational databases Replicate and Ingest Transform Store migrate The ingest stage of a data pipeline is the point where data becomes a data source and is available for usage downstream. Think of a data source as the starting point of your data journey. It is raw, unprocessed data waiting to be transformed into valuable insights. Any system, application, or platform that creates, stores, or shares data can be considered a data source. Two examples of Google Cloud products used in the ingest phase are Cloud Storage, a data lake holding various types of data sources, and Pub/Sub, an asynchronous messaging system delivering data from external systems. Proprietary + Confidential Transformation services add new value to your data Extract and load Dataproc Dataform Extract, load, transform Dataflow … many more! Extract, transform, load Replicate and Ingest Transform Store migrate The transform stage of a data pipeline represents action taken on a data source to adjust, modify, join, or customize a data source so that it matches a specific downstream data or reporting requirement. There are three main transformation patterns: extract and load; extract, load, and transform; and extract, transform, and load. You explore each of these patterns in their own modules later in the course. Proprietary + Confidential Data sinks store your processed data on Google Cloud Streaming and batch load BigQuery Dataplex Security and governance Analytics Bigtable Hub Simple data sharing Replicate and Ingest Transform Store migrate The store stage of a data pipeline represents the last step, when we deposit data in its final form. A data sink is the final stop in the data journey. It's where processed and transformed data is stored for future use, analysis, and decision-making. Think of it as the reservoir at the end of the river, where valuable information is collected and readily available. Two examples of Google Cloud products used in the store phase are BigQuery, a serverless data warehouse, and Bigtable, a highly scalable NoSQL database. Proprietary + Confidential In this section, you explore The Role of a Data Engineer Data Sources Versus Data Sinks Data Formats Storage Solution Options on Google Cloud Metadata Management Options on Google Cloud Share Datasets Using Analytics Hub Proprietary + Confidential Data can have different formats Unstructured Structured Storage/process data Data type Business need Documents Tables Images Rows Audio files Columns Data exists in two primary formats, unstructured and structured. Unstructured data is information stored in a non-tabular form such as documents, images, and audio files. Unstructured data is usually suited for Cloud Storage, but BigQuery also offers the capability to store unstructured data via object tables. There is also structured data, which represents information stored in tables, rows, and columns. Proprietary + Confidential In this section, you explore The Role of a Data Engineer Data Sources Versus Data Sinks Data Formats Storage Solution Options on Google Cloud Metadata Management Options on Google Cloud Share Datasets Using Analytics Hub Proprietary + Confidential Cloud Storage holds your unstructured data Standard storage Nearline storage Coldline storage Archive storage Application data Database backups Log files Hot data Once per month Once every 90 days Once a year Compliance data Reliability and scalability Accessed by HTTP request Retrieved by object name Max object size of 5 TB There are several key products on Google Cloud that are used by data engineers. One main product is Cloud Storage. Unstructured data is usually well-suited to be stored in Cloud Storage. Within Cloud Storage, objects are accessed by using HTTP requests, including ranged GETs to retrieve portions of the data. The only key is the object name. There is object metadata but the object itself is treated as unstructured bytes. The scale of the system allows for serving large static content and accepting user-uploaded content including videos, photos, and files. Objects can be up to 5 Terabytes each. Cloud Storage is built for availability, durability, scalability, and consistency. It's an ideal solution for hosting static websites and storing images, videos, objects and blobs, and any unstructured data. Cloud Storage has four primary storage classes: standard storage, nearline storage, coldline storage, and archive storage. The classes are differentiated by the expected period of object access. Proprietary + Confidential Options for storing structured data Transactional Local/regional Cloud SQL workload scalability SQL High PostgreSQL scalability AlloyDB Global scalability Spanner Structured NoSQL Firestore SQL BigQuery Analytical NoSQL Bigtable workload You have a full range of cost-effective storage services for structured data to choose from when developing with Google Cloud. No one size fits all, and your choice of storage and database solutions will depend on your application and workload. Cloud SQL is Google Cloud’s managed relational database service. AlloyDB is a fully managed, high-performance PostgreSQL database service from Google Cloud. Spanner is Google Cloud’s fully managed relational database service that offers both strong consistency and horizontal scalability. Firestore is a fast, fully managed, serverless, NoSQL document database built for automatic scaling, high performance, and ease of application development. BigQuery is a fully managed, serverless enterprise data warehouse for analytics. Bigtable is a high-performance NoSQL database service. Bigtable is built for fast key-value lookup and supports consistent sub 10 millisecond latency. Proprietary + Confidential Data lake versus data warehouse Data lake Data warehouse Native (unstructured, semi-structured, Data format Schema (structured or semi-structured) structured) Pre-processed and aggregated from Data type Raw multiple data sources Data science, applications, business Purpose Long-term business analysis decisions Tools/processes to enable data discovery, Dependencies governance, security, metadata Standalone management Service Cloud Storage BigQuery The two key concepts in data engineering are that of the data lake and the data warehouse. A data lake is a vast repository for storing raw, unprocessed data in various formats, including unstructured, semi-structured, and structured. It serves as a centralized storage solution for diverse data types, enabling flexible use cases like data science, applications, and business decision-making. A data warehouse is a structured repository designed for storing pre-processed and aggregated data from multiple sources. Primarily used for long-term business analysis, it enables efficient querying and reporting for informed decision-making. Data warehouses often operate as standalone systems, independent of other data storage solutions. Proprietary + Confidential BigQuery is a serverless, fully managed data warehouse Security on dataset, Rich ecosystem for data table, column and row transformations level Built-in machine learning and Integration with other geographic storage services information system BigQuery Scalable storage and Real-time analytics analytics services on streaming data BigQuery is a fully managed, serverless enterprise data warehouse for analytics. BigQuery has built-in features like machine learning, geospatial analysis, and business intelligence. BigQuery can scan terabytes in seconds, and petabytes in minutes. BigQuery is a great solution for online analytical processing, or OLAP, workloads for big data exploration and processing. BigQuery is also well-suited for reporting with business intelligence tools. Proprietary + Confidential Connecting to BigQuery is easy RUN > bq query --use_legacy_sql=false \ 'SELECT contributor_username, \ # Get count of comments by user for articles COUNT(comment) AS comments \ # with 'Google' in the title FROM...;' SELECT contributor_username, COUNT(comment) AS comments FROM `bigquery-public-data.samples.wikipedia` Web UI with SQL Editor WHERE title LIKE "%Google%" AND contributor_username IS NOT NULL bq command line tool GROUP BY 1 ORDER BY 2 DESC LIMIT 100; REST API (7 languages supported) BigQuery has several easy-to-use options for accessing data. The first is via the Google Cloud console’s SQL editor. The second is via the bq command line tool which is part of the Cloud SDK. The last is via a robust REST API which supports calls in seven programming languages. Proprietary + Confidential How BigQuery’s resources are organized `project.dataset.table` Project X Project Y Dataset Dataset A Dataset C Tables Views ML models Dataset B Dataset D Routines BigQuery organizes data tables into units called datasets. These datasets are scoped to your Google Cloud project. When you reference a table from the command line in SQL queries or in code, you refer to it by using the construct project.dataset.table. Proprietary + Confidential You can secure your BigQuery resources on multiple levels dataset IAM table Dataset access Table/view access user_id contact.email country 100 #####@example.com US 200 #####@example.com US 300 #####@example.com US BigQuery 400 #####@example.com UK 500 #####@example.com UK Column-level security Row-level security 600 #####@example.com IT 700 #####@example.com IT Access control is through IAM and is at the dataset, table, view, or column level. In order to query data in a table or view, you need at least read permissions on the table or view. Proprietary + Confidential In this section, you explore The Role of a Data Engineer Data Sources Versus Data Sinks Data Formats Storage Solution Options on Google Cloud Metadata Management Options on Google Cloud Share Datasets Using Analytics Hub Proprietary + Confidential Centrally discover, manage, monitor, and govern distributed data with Dataplex Analytics BigQuery Dataproc Dataflow Vertex AI Data to AI governance Data governance in BigQuery Unified metadata Auto-discovery Data lifecycle Dataplex Insights and Data quality Data classification Data organization semantic search Unified security Unified governance Data discovery with Data Catalog Storage End-to-end data lineage Cloud Storage Multi-cloud* On-premises* Streaming* * future capabilities Metadata is a key element to making data more manageable and useful across an organization. Dataplex is a comprehensive data management solution that allows you to centrally discover, manage, monitor, and govern distributed data across your organization. With Dataplex, you can break down data silos, centralize security and governance, while enabling distributed ownership, and easily search and discover data based on business context. Dataplex also offers built-in data intelligence, support for open-source tools, and a robust partner ecosystem, helping you to trust your data and accelerate time to insights. Proprietary + Confidential Example: group and share your data based on readiness using Dataplex Data Data Data All users engineers engineers scientists Landing zone Raw zone Curated zone ingested data cleaned processed Dataplex limited access immutable source of trust Bucket Bucket Bucket Dataset Dataset Dataset Dataplex lets you standardize and unify metadata, security policies, governance, classification, and data lifecycle management across this distributed data. Another common use case is when your data is accessible only to data engineers, and is later refined and made available to data scientists and analysts. In this case, you can set up a lake to have the following: A raw zone for the data which is accessed by data engineers and data scientists. A curated zone for the data which is accessed by all users. Proprietary + Confidential In this section, you explore The Role of a Data Engineer Data Sources Versus Data Sinks Data Formats Storage Solution Options on Google Cloud Metadata Management Options on Google Cloud Share Datasets using Analytics Hub Proprietary + Confidential How about sharing data outside your organization? This is challenging Your organization External organization Export and copy? Data freshness Dataset Pipeline management Tables ETL process? BigQuery Views No data usage visibility ML models Routines Onboard users in Complex permissions IAM? Sharing data is challenging especially outside of your organization. You need to consider security and permissions, destination options for data pipelines, data freshness and accuracy, and finally, usage monitoring. Analytics Hub was created to meet these data sharing challenges. Proprietary + Confidential Sharing data across organizations with Analytics Hub is easy Publisher project Public data exchange Subscriber project Shared Public listing 1 Publish dataset Users Analytics Hub BigQuery 2 Search 1 2 4 Private data exchange 3 Subscribe Shared Linked Private listing dataset dataset Analytics Hub 4 Query BigQuery 3 BigQuery Link Analytics Hub helps organizations unlock the value of data sharing, leading to new insights and business value. With Analytics Hub, you create a rich data ecosystem by publishing and subscribing to analytics-ready datasets. Because data is shared in place, data providers are able to control and monitor how their data is being used. Analytics Hub provides a self-service way to access valuable and trusted data assets, including data provided by Google. Finally, Analytics Hub provides an opportunity to monetize data assets. Analytics Hub removes the tasks of building the infrastructure required for monetization. Proprietary + Confidential Lab: Loading Data into BigQuery 30 min Learning objectives Load data into BigQuery from various sources. Load data into BigQuery using the CLI and the Google Cloud console. Use DDL to create tables. In this lab, you practice loading data into BigQuery. The primary objective of this lab is to load data into BigQuery using both the command-line interface and the Google Cloud console. You also experience loading several datasets into BigQuery and using the Data Description Language or DDL. Proprietary + Confidential

Use Quizgecko on...
Browser
Browser