(Delta) Ch 9 Delta Sharing.pdf
Document Details
Uploaded by EnrapturedElf
Tags
Full Transcript
CHAPTER 9 Delta Sharing The data-centric nature of today’s economy necessitates extensive data exchange among organizations and their customers, suppliers, and partners...
CHAPTER 9 Delta Sharing The data-centric nature of today’s economy necessitates extensive data exchange among organizations and their customers, suppliers, and partners. While efficiency and immediate accessibility are crucial, they often clash with security concerns. Organizations require an open and secure approach to data sharing to thrive in the digital economy. Often data sharing is required internally within an organization. Organizations have geographically dispersed locations with local cloud solutions. These companies often seek to implement a data mesh architecture, where ownership is decentralized and data management is distributed and federated. Efficient and secure data sharing is a critical enabler to efficiently share data products across the organization. The different business groups across an enterprise need access to data to make critical business decisions. Data teams want to integrate their solutions to create a comprehensive enterprise view of the business. Conventional Methods of Data Sharing In the past, sharing data across various platforms, companies, and clouds has always presented a complex challenge. Organizations were reluctant to share data due to concerns about security risks, competition, and the considerable costs associated with implementing data-sharing solutions. Conventional data-sharing technologies face difficulties in meeting the demands of modern requirements, such as compatibility with multiple cloud environments, and support for open formats, while still delivering the required performance. Many data- sharing solutions are tied to a specific vendor, creating problems for data providers and consumers operating on incompatible platforms. 205 Data-sharing solutions have been developed in three formats: legacy and homegrown (custom-built) solutions, modern cloud object storage, and proprietary commercial solutions. Each of these approaches has its pros and cons. Legacy and Homegrown Solutions Organizations have built homegrown systems to implement data-sharing solutions based on legacy technology like email, SFTP, or custom APIs, as shown in Figure 9-1. Figure 9-1. Homegrown solutions to data sharing Advantages of these solutions: Vendor agnostic FTP, email, and APIs are well-documented protocols, enabling data consumers to utilize a variety of clients to access the data provided to them. Flexibility Many custom-built solutions are based on open source technologies, allowing them to function both on premises and in cloud environments. 206 | Chapter 9: Delta Sharing Disadvantages of these solutions: Data movement Extracting data from cloud storage, transforming it, and hosting it on an FTP server for different recipients requires significant effort. This approach also leads to data duplication, hindering organizations from instantly accessing real-time data. Complexity of data sharing Custom-built solutions often involve complex architectures due to replication and provisioning. This complexity adds substantial time to data-sharing activities and can result in outdated data for end consumers. Operational overhead for data recipients Data recipients need to perform data extraction, transformation, and loading (ETL) for their specific use cases, further delaying the time to gain insights. Whenever providers update the data, consumers must rerun the ETL pipelines repeatedly. Security and governance As modern data requirements become more stringent, securing and governing homegrown and legacy technologies becomes increasingly challenging. Scalability Managing and maintaining such solutions is costly, and they lack the scalability to accommodate large datasets. Proprietary Vendor Solutions Commercial data-sharing solutions are widely chosen by companies seeking alterna tives to building in-house solutions. These solutions provide a balance between not wanting to allocate extensive time and resources to developing a proprietary solution, and desiring greater control than what cloud object storage can provide, as shown in Figure 9-2. Conventional Methods of Data Sharing | 207 Advantages of this solution: Simplicity Commercial solutions offer an easy way for users to share data with others on the same platform. Disadvantages of this solution: Vendor lock-in Commercial solutions often lack interoperability with other platforms, making it difficult to share data with users of competing solutions. This limitation reduces the accessibility of data and results in vendor lock-in. Additionally, platform disparities between data providers and recipients introduce complexities in data sharing. Data movement Data needs to be loaded onto a specific platform, which involves additional steps, such as ETL and creating copies of the data. Scalability Commercial data-sharing solutions may have limitations on scaling imposed by the vendors. Cost The aforementioned challenges contribute to additional costs for sharing data with potential customers, as data providers need to replicate data for different recipients across various cloud platforms. 208 | Chapter 9: Delta Sharing Cloud Object Storage Object storage is highly regarded as a well-suited solution for cloud environments due to its elastic nature and seamless scalability, allowing it to handle vast amounts of data and effortlessly accommodate unlimited growth. Leading cloud providers, such as Amazon S3, Azure Data Lake Storage (ADLS), and Google Cloud Storage (GCS), offer cost-effective object storage services and deliver exceptional scalability and reliability. One noteworthy feature of cloud object storage is the capability to generate signed URLs. These URLs provide time-limited permissions for downloading specific objects. By sharing a pre-signed URL, anyone possessing it can conveniently access the designated objects, facilitating efficient data sharing. Advantages of this solution: Sharing data in place Object storage can be shared in place, allowing consumers access to the latest available data. Scalability Cloud object storage profits from availability and durability guarantees that typically cannot be achieved on premises. Data consumers retrieve data directly from the cloud providers, saving bandwidth for the providers. Disadvantages of this solution: Limited to a single cloud provider Recipients have to be on the same cloud to access the objects. Cumbersome security and governance There is complexity associated with assigning permissions and managing access. Custom application logic is needed to generate signed URLs. Complexity Personas managing data sharing (database administrators, analysts) find it diffi cult to understand identity and access management policies and how data is mapped to underlying files. For companies with large volumes of data, sharing via cloud storage is time-consuming, cumbersome, and nearly impossible to scale. Operational overhead for data recipients The data recipients must ETL pipelines on the raw files before consuming them for their end use cases. Conventional Methods of Data Sharing | 209 The lack of a comprehensive solution creates a struggle for data providers and con sumers to share data easily. Cumbersome and incomplete data sharing also constrains the development of business opportunities from shared data. Open Source Delta Sharing Unlike proprietary solutions, open source data sharing is not associated with a vendor-specific technology that introduces unnecessary limitations and financial burdens. Open source Delta Sharing is readily available to anyone who needs to share data at scale. Delta Sharing Goals Delta Sharing is an open source protocol designed with the following objectives: Open cross-platform data sharing Delta Sharing provides an open source, cross-platform solution that avoids ven dor lock-in. It allows data sharing in Delta Lake and Apache Parquet formats with any platform, whether on premises or another cloud. Share live data without data movement Data recipients can directly connect to Delta Sharing without replicating the data. This feature enables the easy and real-time sharing of existing data without unnecessary data duplication or movement. Support a wide range of clients Delta Sharing supports a diverse range of clients, including popular tools like Power BI, Tableau, Apache Spark, pandas, and Java. It offers flexibility for con suming data using the tools of choice for various use cases, such as business intelligence, machine learning, and AL Implementing a Delta Sharing connector is quick and straightforward. Centralized governance Delta Sharing provides robust security, auditing, and governance capabilities. Data providers have granular control over data access, allowing them to share an entire table or specific versions or partitions of a table. Access to shared data is managed and audited from a single enforcement point, ensuring centralized control and compliance. Scalability for massive datasets Delta Sharing is designed to handle massive structured datasets, and supports sharing unstructured data and future data derivatives such as machine learning models, dashboards, notebooks, and tabular data. Delta Sharing enables the economical and reliable sharing of large-scale datasets by leveraging the cost effectiveness and scalability of cloud storage systems. 210 | Chapter 9: Delta Sharing Delta Sharing Under the Hood Delta Sharing is an open protocol that defines REST API endpoints that enable secure access to specific portions of a cloud dataset. It leverages the capabilities of modern cloud storage systems like Amazon S3, ADLS, or GCS to ensure the reliable transfer of large datasets. The process involves two key parties: data providers and recipients, as depicted in Figure 9-3. Data provider Access permission Data recipient "S Any use case Any use case Any cloud/ Delta Lake on prem sharing ( Analytics ) ( Power Bl ) (Google Cloud) 11 protocol ( Bl ) (Apache Spark) ( Bl ) Delta Delta No replication ( Data science ) ( pandas ) Lake Lake Easy to manage ( Azure ) 11 table Secure ( Tableau ) ( Java ) And many more Onpremises J Figure 9-3. Overview of the Delta Sharing protocol Data Providers and Recipients As the data provider, Delta Sharing lets you share existing tables or parts thereof (e.g., specific table versions of partitions) stored on your cloud data lake in Delta Lake format. The data provider decides what data they want to share and runs a sharing server in front of it that implements the Delta Sharing protocol and manages access for recipients. Open source Delta Lake includes a reference sharing server, and Databricks provides one for its platform; other vendors are expected to soon follow. As a data recipient, you only need one of the many Delta Sharing clients supporting the protocol. Open source Delta Lake has released open source connectors for pan das, Apache Spark, Rust, and Python, and is working with partners on more clients. The actual exchange is carefully designed to be efficient by leveraging the functional ity of cloud storage systems and Delta Lake. The Delta Sharing protocol works as follows (see Figure 9-4): 1. The recipient’s client authenticates to the sharing server (via a bearer token or other method) and asks to query a specific table. The client can also provide filters on the data (e.g., “country = US”) as a hint to read just a subset of the data. 2. The server verifies whether the client is allowed to access the data, logs the request, and then determines which data to send back. This will be a subset of the data objects in ALDS (on Azure), S3 (on AWS), or GCS (on GCP) that make up the table. Delta Sharing Under the Hood | 211 3. To transfer the data, the server generates short-lived pre-signed URLs that allow the client to read these Parquet files directly from the cloud provider, so that the transfer can happen in parallel with massive bandwidth, without streaming through the sharing server. This powerful feature available in all the major clouds makes it fast, cheap, and reliable to share very large datasets. Access permissions * Request to read table "accounts" Power Bl OK. here are short-lived URLs to read: Apache Spark https://adls. azure. com/part1?sig=... Delta Delta pandas Delta https://adls. azure. com/part2?sig=... Lake Lake > i sharing table ------------------------------- ► Tableau client - /" B Direct access from ADLS ADLS objects (in Parquet format) - X. Figure 9-4. Delta Sharing protocol details Benefits of the Design The Delta Sharing design provides many benefits for both providers and consumers: Data providers can easily share an entire table, or just a version of a partition of the table, because clients are only given access to a specific subset of the objects in it. Data providers can update data reliably in real time using the ACID transactions on Delta Lake, and recipients will always see a consistent view. Data recipients don’t need to be on the same platform as the provider, or even in the cloud at all, sharing works across clouds and even from cloud to on-premise users. The Delta Sharing protocol is simple for clients to implement if they already leverage Parquet. Data transfer using the underlying cloud system is fast, cheap, reliable, and parallelizable. 212 | Chapter 9: Delta Sharing The delta-sharing Repository You can find the delta-sharing GitHub repository online. It contains the following components: The Delta Sharing protocol specification. The Python connector. This is a Python library that implements the Delta Sharing protocol to read shared tables as pandas or PySpark DataFrames. An Apache Spark connector. This connector implements the Delta Sharing proto col to read shared tables from a Delta Sharing Server. You can then use SQL, Python, Scala, Java, or R to access the tables. A reference implementation of the Delta Sharing Protocol in a Delta Sharing Server. Users can deploy this server to share existing tables in Delta Lake and Parquet format on Azure, AWS, or GCP storage systems. Next, let’s use the Python connector to access Delta tables in an example Delta Sharing Server, hosted by delto-io. Step 1: Installing the Python Connector The Python connector is offered as a PyPi library named delta-sharing, so we just need to add this library to our cluster, as shown in Figure 9-5. Figure 9-5. Installing the delta-sharing PyPi library on our cluster Step 2: Installing the Profile File The Python connector accesses shared tables based on profile files. You can download the profile file for the example Delta Sharing Server by following the link. This file will download as a file named open-datasets.share. This is a simple JSON file with the credentials for the server (the bearer token in this example is obfuscated): f "shareCredentialsVersion": 1, "endpoint": "https://sharing.delta.io/delta- sharing/", "bearerToken": "faaiexxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" } The delta-sharing Repository | 213 Upload the share file to a dbfs:/ location in the Databricks filesystem using the dbfs cp command: C:\Users\bhael\Downloads>dbfs cp open-datasets.share dbfs:/mnt/.../delta-sharing/ C:\Users\bhael\Downloads>dbfs Is dbfs:/mnt/datalake/book/delta-sharing open-datasets.share C:\Users\bhael\Downloads> Step 3: Reading a Shared Table In the “01 - Sharing Example” notebook we can then reference the file: # Point to the profile file. It can be a file on the local # file system or remote file system. In this case, we have # uploaded the file to dbfs profile_path = "/dbfs/mnt/datalake/book/delta-sharing/open-datasets.share" K Depending on how you will access a shared table, you will have to use a different path syntax. The profile_path specified here will work when you access the table as a pandas DataFrame. If you want to access the table with Spark, you will have to use the dbfs: / syntax instead of the /dbfs/ port. Next, we can create a SharingClient, passing it the profile path, and list all shared Delta tables: # Create a SharingClient and list all shared tables client = delta_sharing.SharingClient(profile_path) client.list_all_tables() This produces the following output: Out: [Tabletname='C0VID_19_NYT', share='delta_sharing', schema='default1), Table(name='boston-housing1, share='delta_sharing', schema='default1), Table(name='flight-asa_20081, share='delta_sharing', schema='default'), Table(name=1lending_club', share='delta_sharing', schema='default1), Table(name='nyctaxi_2019', share='delta_sharing', schema='default1), Table(name=1nyctaxi_2019_part', share=1delta_sharing', schema='default1), Table(name='owid-covid-data', share='delta_sharing', schema='default1)] To create a URL to a shared table, we use the following syntax: #.. 214 | Chapter 9: Delta Sharing We can now build the URL and read the contents of the shared Delta table as a pandas DataFrame: # Create a URL to access a shared table # A table path is the profile path following with # ('..) # Here, we are loading the table as a pandas DataFrame table_url = profile_path + "#delta_sharing.default.boston-housing" df = delta_sharing.load_as_pandas(table_url, limit=10) df.head() Output (only showing relevant portions): +- -+ +- —+ + + + + |ID|crim |zn |Indus| chas| nox | rm | +- -+ +- —+ + + + + U |0.00632|18 | 2.31| 0 |0.538| 6.575| 12 |0.02731| 0 | 7.0 | 0 |0.469||6.421| |4 |0.03237| 0 | 2.18| 0 |0.458| 6.998| 15 |0.06905| 0 | 2.18| 0 |0.458| 7.147| 17 |0.08829|12.5| 7.87| 0 |0.524| 6.012| +- -+ +- —+ + + + + If we want to load the table as a standard PySpark DataFrame, we can use the load_as_spark() method: # We can also access the shared table with Spark. Note that we have to use the # dbfs:/ path prefix here profile_path_spark = "dbfs:/mnt/datalake/book/delta-sharing/open-datasets.share" table_url_spark = profile_path_spark + "#delta_sharing.default.boston-housing" df_spark = delta_sharing,load_as_spark(table_url_spark) display(df_spark.limit(5)) Notice the slight change in the URL, as discussed earlier. This will produce the same output as the pandas example. Conclusion Enabling data exchange using open source technology opens up many benefits for both internal and external use. First, it offers significant flexibility, allowing the team to tailor the data exchange process to meet specific business use cases and requirements. Support from the active open source community ensures continuous improvements, bug fixes, and access to a vast amount of knowledge, further empow ering the team and business users to stay at the forefront of data sharing practices. Conclusion | 215 Among the key benefits of using Delta Sharing for data providers and data recipients, the following are the most important: Scalability is critical for data teams working with ever-growing datasets and high-demand use cases. Interoperability is another significant benefit. Delta Sharing, as an open source technology, is designed to work in harmony with other components of the data ecosystem, facilitating seamless integration. In addition, transparency and security are improved compared to the proprietary solutions, as the Delta Sharing source code is available for review, which allows for stronger security measures and the ability to respond to and proactively address identified vulnerabilities. By using Delta Sharing, teams avoid vendor lock-in by having the freedom to switch between tools or vendors with no investment needed in adapting to the new architec ture. The rapid pace of innovation in the open source community allows teams to embrace cutting-edge features and quickly adapt to new trends in data management and analytics. The ability to share data using Delta Sharing allows for a more agile, cost-effective, and innovative data ecosystem by delivering better data-driven solutions and insights for organizational success in an ever-changing environment and data landscape. Building on the foundational components you have learned about to this point, in Chapter 10 you will dive into the details of how to build a complete data lakehouse. 216 | Chapter 9: Delta Sharing