Podcast Beta
Questions and Answers
Commercial data-sharing solutions provide unlimited scalability.
False
Delta Sharing is a vendor-specific technology.
False
Object storage allows handling limited amounts of data.
False
Commercial data-sharing solutions do not involve data movement.
Signup and view all the answers
Delta Sharing allows data sharing only in Delta Lake format.
Signup and view all the answers
Commercial data-sharing solutions provide open cross-platform data sharing.
Signup and view all the answers
Delta Sharing replicates data for different recipients across various cloud platforms.
Signup and view all the answers
Cloud object storage is not scalable.
Signup and view all the answers
Delta Sharing supports a limited range of clients, including only Power BI and Tableau.
Signup and view all the answers
Delta Sharing provides robust security and governance capabilities for data providers.
Signup and view all the answers
Data providers need to convert their data to a proprietary format to share it using Delta Sharing.
Signup and view all the answers
As a data recipient, you need to implement the Delta Sharing protocol to access shared data.
Signup and view all the answers
The sharing server generates long-lived pre-signed URLs for data recipients to access shared data.
Signup and view all the answers
A shared table can be accessed using a Delta Sharing client by passing the table name directly.
Signup and view all the answers
Delta Sharing is a vendor-agnostic open protocol for secure data sharing.
Signup and view all the answers
Delta Sharing only supports SQL data types.
Signup and view all the answers
Delta Sharing focuses on ease of consumption, strong security, and scalability.
Signup and view all the answers
Delta Sharing is not part of the Delta Lake project.
Signup and view all the answers
Data providers need to implement the Delta Sharing protocol to share data.
Signup and view all the answers
Delta Sharing is scalable to sharing massive data sets.
Signup and view all the answers
Cloud object storage is not used in Delta Sharing for data transfer.
Signup and view all the answers
Delta Sharing supports fine-grained access control and auditing.
Signup and view all the answers
Delta Sharing allows sharing of static tables only.
Signup and view all the answers
Delta Sharing is a vendor-locked technology.
Signup and view all the answers
Delta Sharing requires data to be converted to a proprietary format.
Signup and view all the answers
Delta Sharing supports a limited range of clients, including only pandas and Spark.
Signup and view all the answers
Delta Sharing replicates data for different recipients across various cloud platforms.
Signup and view all the answers
Delta Sharing lacks robust security and governance capabilities for data providers.
Signup and view all the answers
Study Notes
Commercial Data-Sharing Solutions
- Companies prefer commercial data-sharing solutions over in-house solutions as they offer a balance between control and resource allocation.
- These solutions provide simplicity for users to share data with others on the same platform.
- Limitations of commercial solutions include:
Vendor Lock-in
- Lack of interoperability with other platforms, making it difficult to share data with users of competing solutions.
Data Movement
- Data needs to be loaded onto a specific platform, involving additional steps like ETL and creating copies of the data.
Scalability and Cost
- Commercial solutions may have limitations on scaling imposed by vendors.
- Additional costs arise from replicating data for different recipients across various cloud platforms.
Cloud Object Storage
- Object storage is suitable for cloud environments due to its elastic nature and seamless scalability.
- It allows handling vast amounts of data and effortlessly accommodating unlimited growth.
Open Source Delta Sharing
- Open source data sharing avoids vendor-specific limitations and financial burdens.
- Delta Sharing is an open source protocol designed for open cross-platform data sharing.
- It allows sharing data in Delta Lake and Apache Parquet formats with any platform, whether on premises or another cloud.
Delta Sharing Goals
- Open cross-platform data sharing to avoid vendor lock-in.
- Share live data without data movement, allowing data recipients to directly connect.
- Support a diverse range of clients, including popular tools like Power BI, Tableau, Apache Spark, pandas, and Java.
- Centralized governance with robust security, auditing, and governance capabilities.
Delta Sharing Protocol
- Data providers share existing tables or parts thereof stored on their cloud data lake in Delta Lake format.
- The data provider runs a sharing server that implements the Delta Sharing protocol and manages access for recipients.
- Recipients can use any Delta Sharing client supporting the protocol.
Delta Sharing Under the Hood
- The recipient's client authenticates to the sharing server and asks to query a specific table.
- The server verifies access, logs the request, and determines which data to send back.
- The server generates short-lived pre-signed URLs for the client to read Parquet files directly from the cloud provider.
Reading a Shared Table
- Access a shared table using a Delta Sharing client, passing the profile path, and listing all shared Delta tables.
- Create a URL to access a shared table using the syntax: profile_path + "#delta_sharing..".
- Read the shared table as a pandas DataFrame or a standard PySpark DataFrame using load_as_pandas() or load_as_spark().
Data Sharing Challenges
- Current data sharing solutions are vendor-dependent, leading to vendor lock-in risks.
- Siloed systems make data hard to consume and manage.
Introducing Delta Sharing
- An open protocol for secure data sharing, enabling easy and secure data sharing between organizations.
- Supports multiple data types and languages, not just SQL.
- Focuses on ease of consumption, strong security, and scalability.
- Part of the Delta Lake project under the Linux Foundation.
How Delta Sharing Works
- Involves a data provider and a data recipient.
- Data provider sets up a Delta Sharing server, manages access permissions, and decides who gets access to which data subsets.
- Data recipient connects to the server using an open protocol client (e.g., Apache Spark, pandas, Tableau).
- Server filters data, checks access permissions, and generates short-lived URLs for secure transfer.
Benefits of Delta Sharing
- Easy to implement Delta Sharing client for existing Parquet-supporting systems.
- Fast, cheap, reliable, and parallelizable data transfer using cloud object stores.
- Scalable to sharing massive data sets.
- Supports fine-grained access control, auditing, and compliance.
Ecosystem and Adoption
- Multiple open-source projects and commercial systems support Delta Sharing.
- Leading data providers and vendors back the project.
- Databricks is implementing Delta Sharing, providing an integrated solution for secure data sharing.
Demo and Use Cases
- Demo: sharing vaccination data with the CDC using Delta Sharing.
- Use cases: data sharing across organizations, real-time data sharing, and secure data sharing.
Delta Sharing Syntax and Interface
- Create a share object using SQL commands.
- Add tables to the share.
- Grant permissions to recipients using standard grant statements.
- Use REST APIs for programmatic management.
Key Features
- Allows access to data in Databricks directly through S3.
- Enables processing and counting of data with ease.
- Can connect to Delta Sharing using favorite tools like Spark, pandas, and more.
Connecting to Delta Sharing
- Can connect to Delta Sharing using pandas on a single machine.
- Load data from Delta Sharing into a pandas data frame.
- Perform analysis and visualization on the data frame.
Business Intelligence Tools
- Can connect Delta Sharing to business intelligence tools like Tableau and Power BI.
- Load data into Tableau and Power BI without setting up a separate data warehouse.
- Perform analysis and visualization on the data in real-time.
Features and Roadmap
- Envisions sharing of streams, machine learning models, table views, and arbitrary files.
- Working on governance capabilities, including time-limited sharing and restricted clean room analytics.
- Released a reference server and clients for pandas, Spark, and Rust.
- Working on open-source connectors and commercial connectors with partners.
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.
Description
Commercial data-sharing solutions are chosen by companies as an alternative to building in-house solutions. They offer a balance between not wanting to allocate extensive time and resources to developing a proprietary solution and desiring greater control than what cloud object storage can provide.