Data Replication and Migration PDF

Proprietary + Confidential 02 Data Replication and Migration Proprietary + Confidential In this module, you learn to... Explain the baseline Google Cloud data 01 replication and migration architecture. Understand the options and use cases for 02 the gcloud command line tool. Explain the functionality and use cases for 03 Storage Transfer Service. Explain the functionality and use cases for 04 Transfer Appliance. Understand the features and deployment of 05 Datastream. In this module, first, you review the baseline Google Cloud data replication and migration architecture. Second, you will cover the options and use cases for the gcloud command line tool. Then you will review the functionality and use cases for the Storage Transfer Service. The next topic addresses the functionality and use cases for the Transfer Appliance. Finally, you will look at the features and deployment of Datastream. Proprietary + Confidential In this section, you explore Replication and Migration Architecture The gcloud Command Line Tool Moving Datasets Datastream Proprietary + Confidential Replication and migration services onboard your data into Google Cloud gcloud Transfer Online/offline transfer storage Appliance Scheduling capabilities Storage Transfer Datastream Service Change data capture Replicate and Ingest Transform Store migrate The replicate and migrate stage of a data pipeline focuses on the tools and options to bring data from external or internal systems into Google Cloud for further refinement. Google Cloud provides a comprehensive suite of tools to migrate and replicate your data. Start replicating and migrating data by using tools like the “gcloud storage” command, Transfer Appliance, Storage Transfer Service, or Datastream. You can then transform the data as needed before finally storing it within Google Cloud. Proprietary + Confidential Common replication and migration scenarios for ingesting your data into Google Cloud On-premises, Google Cloud multicloud One-off transfer File system Cloud Storage Object store Scheduled HDFS BigQuery RDBMS Change data capture Data can originate from on-premises or multicloud environments, including file systems, object stores, HDFS, and Relational Databases. Google Cloud offers options for one-off transfers, scheduled replications, and change data capture, ultimately landing data in Cloud Storage or BigQuery. Proprietary + Confidential Google Cloud provides additional options for migrating your workloads On-premises, Google Cloud multicloud Oracle Cloud Database Migration Service SQL MySQL PostgreSQL AlloyDB SQL Server ETL (e.g., Dataflow) BigQuery NoSQL Google Cloud provides additional workload migration with options for various database types. Leverage Database Migration Service for seamless transitions from Oracle, MySQL, PostgreSQL, and SQL Server. For other data formats or complex migrations, use ETL tools like Dataflow with a wide range of templates that handle NoSQL or non-relational databases. Your target destination can be Cloud SQL, AlloyDB, or BigQuery depending on your needs. Proprietary + Confidential Choosing between migration options depends on your amount of data and network bandwidth 1 TB 100 Gbps 1 TB 100 Mbps 2 minutes 30 hours Storage gcloud Transfer Transfer storage Appliance Service small amount of data large amount of data The ease of migrating data depends heavily on data size and network bandwidth. With 1 Terabyte of data, a 100 Giga bits per second network takes about 2 minutes to transfer, while the same size on a 100 Mega bits per second network takes 30 hours. The "gcloud storage" command or Storage Transfer Service are suitable for smaller datasets. For larger datasets, consider Transfer Appliance for faster offline transfer. Proprietary + Confidential In this section, you explore Replication and Migration Architecture The gcloud Command Line Tool Moving Datasets Datastream Proprietary + Confidential Use the gcloud storage command to move small amounts of data to Cloud Storage On-premises Google Cloud gcloud storage File system Cloud Object store Storage HDFS > gcloud storage cp *.csv gs://mybucket Small to medium-sized transfers Executed on an as-needed basis You can use the “gcloud storage” command to transfer small to medium-sized datasets to Cloud Storage. The data can originate from various on-premise sources like file systems, object stores, or HDFS. The “cp” command, shown in the code snippet, facilitates these ad-hoc transfers directly to Cloud Storage. Proprietary + Confidential In this section, you explore Replication and Migration Architecture The gcloud Command Line Tool Moving Datasets Datastream Proprietary + Confidential Move large amounts of data with Storage Transfer Service On-premises, Google Cloud multicloud Storage Transfer Service File system Cloud Object store Support for Amazon S3 and Storage Azure Blob Storage HDFS Up to several 10s of Gbps speed Medium to large-sized transfers Scheduling capabilities To move larger datasets, consider using Storage Transfer Service. Storage Transfer Service efficiently moves large datasets from on-premises, multicloud file systems, object stores (including Amazon S3 and Azure Blob Storage), and HDFS into Cloud Storage. It boasts high transfer speeds (up to tens of Giga bits per second) and supports scheduled transfers for convenient data migration. Proprietary + Confidential Use Transfer Appliance to move large amounts of data in offline mode On-premises Google Cloud Transfer Appliance File system Cloud Object store Google-owned hardware is sent Storage to your data center. HDFS Transfer data onto an appliance and ship it back to Google. Large transfers with limited bandwidth Multiple appliance sizes available Transfer Appliance is Google's solution for moving massive datasets offline. Google provides the hardware, you transfer your data onto it, then ship it back. It is ideal for scenarios with limited bandwidth or very large transfers, and it comes in multiple sizes to suit your needs. Proprietary + Confidential In this section, you explore Replication and Migration Architecture The gcloud Command Line Tool Moving Datasets Datastream Proprietary + Confidential Datastream continuously replicates your RDBMS into Google Cloud for analytics On-premises, Google Cloud multicloud Datastream Oracle Cloud Storage MySQL Change data capture options: PostgreSQL Historical backfill and new BigQuery changes SQL Server New changes only Public or private connectivity options Include or exclude schema, table, column Datastream enables continuous replication of your on-premises or multicloud relational databases such as Oracle, MySQL, PostgreSQL, or SQL Server into Google Cloud. Datastream offers change data capture options for historical backfill or allows you to just propagate new changes, with data landing in Cloud Storage or BigQuery for analytics. You have flexibility in connectivity options and can selectively replicate data at the schema, table, or column level. Proprietary + Confidential Datastream use cases Analytics with database replication into BigQuery Analytics with custom data processing in Dataflow into BigQuery Data processing using an event-driven architecture Example: database replication and migration using Dataflow templates CDC Normalized Stream Database replication data processing Spanner Datastream Cloud Storage Dataﬂow Source systems Datastream enables real-time data replication from source systems for various use cases. It supports direct replication into BigQuery for analytics, allows custom data processing in Dataflow before loading into BigQuery, and facilitates event-driven architectures. Additionally, Datastream can be used with Dataflow templates for seamless database replication and migration tasks, making it a versatile tool for integrating data into Google Cloud. Proprietary + Confidential Datastream uses the database’s write-ahead log (WAL) to process change events Event in RDBMS Event in WAL Event processing Event is stored INSERT LogMiner (Oracle) AVRO Datastream UPDATE Binary log (MySQL) DELETE JSON Logical decoding Event (PostgreSQL) Payload Transaction logs Table (SQL Server) Datastream taps into the source database's write-ahead log (WAL) to capture and process changes for propagation downstream. Datastream supports reading the logging mechanisms for the specific source database such as LogMiner for Oracle, binary log for MySQL, PostgreSQL's logical decoding, and transaction logs from SQL Server. These change events such as inserts, updates, and deletes are then processed by Datastream and transformed into structured formats like AVRO or JSON, ready for storage in Google Cloud, typically in BigQuery tables, enabling near real-time data replication for analytics and other use cases. Proprietary + Confidential Event messages contain generic metadata and payload { generic "stream_name": "[...]", "read_method": "oracle-cdc-logminer", object: name of the table or object in the source "object": "SAMPLE.TBL", "uuid": read_timestamp: when Datastream "d7989206-380f-0e81-8056-240501101100", read the record "read_timestamp": source_timestamp: when the record "2023-11-07T07:37:16.808Z", changed on the source "source_timestamp": "2023-11-07T02:15:39", "payload": { payload "USER_ID": "1231535353", "FIRST_NAME": "Jane", key-value pairs for each column and "LAST_NAME": "Smith", the corresponding value } Datastream event messages contain two main sections: generic metadata and payload. Metadata provides context about the data, like source table, timestamps, and related information. Payload contains the actual data changes in a key-value format, reflecting column names and their corresponding values. This structure allows for efficient and organized data replication and tracking of changes. Proprietary + Confidential Event messages also contain source-specific metadata "source_metadata": { Oracle example "log_file": "[...]" "scn": 15869116216871, database: the database associated "row_id": "AAAPwRAALAAMzMBABD", with the event "is_deleted": false, "database": "DB1", schema: the schema associated with "schema": "ROOT", the table from the event "table": "SAMPLE" "change_type": "INSERT", table: the table associated with the "tx_id": event "rs_id": "0x0073c9.000a4e4c.01d0", "ssn": 67, change_type: the DML operation (e.g. }, INSERT) } Datastream event messages also include source-specific metadata in addition to generic metadata and payload. This metadata provides context about the data's origin within the source system, including details like the database name, schema, table, change type (such as INSERT), and other system-specific identifiers. This additional information helps track data lineage and understand the context of changes replicated from the source database. Proprietary + Confidential Datastream uses unified types to map source to destination data types AVRO Oracle MySQL DECIMAL NUMBER DECIMAL Datastream JSON NUMBER DECIMAL PostgreSQL SQL Server Table NUMERIC DECIMAL NUMERIC Datastream simplifies data replication by using unified data types to map between different source and destination databases. This means that regardless of whether your source data is in Oracle as number, MySQL as decimal, PostgreSQL as numeric, or SQL Server as decimal, Datastream will consistently represent it as decimal during replication. When this data lands in Google Cloud, it can be further transformed into format-specific data types in different file types or destinations, such as Avro as decimal, JSON as number, or stored natively in BigQuery tables as numeric. This ensures data type consistency and compatibility across different database systems, streamlining the data replication process. Proprietary + Confidential Compare migration and replication options Storage Transfer gcloud storage Datastream Transfer Service Appliance Transfer type online online offline online On-premises, On-premises, Source On-premises, Google Cloud, On-premises Google Cloud, locations Google Cloud multicloud multicloud less than 1 TB more than 1 TB 7 TB, 40 TB, 10,000 tables Data range recommended recommended 300 TB per stream Batch (hourly at Batch and Velocity Batch Batch minimum) Stream Data format any any any structured In summary, Google Cloud offers several data migration and replication options. The “gcloud storage” command is suitable for smaller, online transfers. Storage Transfer Service handles larger online transfers efficiently. Transfer Appliance is ideal for massive offline data migrations. And Datastream provides continuous, online replication of structured data, supporting both batch and streaming velocities. Choose the option that best fits your data size, transfer type, and data availability requirements. Proprietary + Confidential Lab: Datastream: PostgreSQL Replication to BigQuery 45 min Learning objectives Prepare a Cloud SQL for PostgreSQL instance using the Google Cloud console. Import data into the Cloud SQL instance. Create a Datastream connection profile for the PostgreSQL database. Create a Datastream connection profile for the BigQuery destination. Create a Datastream stream and start replication. Validate that the existing data and changes are replicated correctly into BigQuery. In this lab, you use Datastream to replicate data from PostgreSQL to BigQuery. You prepare and load a Cloud SQL for PostgreSQL instance. You create Datastream connection profiles for the source and target. You then create a Datastream processing stream and start replication. Finally, you validate the replication in BigQuery.

Data Replication and Migration PDF

Document Details

Tags

Related

Summary

Full Transcript