(Delta) Ch 10 Building a Lakehouse on Delta Lake.pdf

CHAPTER 10 Building a Lakehouse on Delta Lake Chapter 1 introduced the concept of a data lakehouse, which combines the best elements of a traditional data warehouse and a data lake. Throughout this book you have learned about the five key capabilities that help enable the lakehouse architec ture: the storage layer, data management, SQL analytics, data science and machine learning, and the medallion architecture. Before diving into building a lakehouse on Delta Lake, let’s quickly review the indus try’s data management and analytics evolution: Data warehouse From the 1970s through the early 2000s, data warehouses were designed to col lect and consolidate data into a business context, providing support for business intelligence and analytics. As data volumes grew, velocity, variety, and veracity also increased. Data warehouses had challenges with addressing these require ments in a flexible, unified, and cost-effective manner. Data lake In the early 2000s, increased volumes of data drove the development of data lakes (initially on premises with Hadoop and later with the cloud), a cost-effective central repository to store any format of data at any scale. But again, even with added benefits there were additional challenges. Data lakes had no transactional support, were not designed for business intelligence, offered limited data gover nance support, and still required other technologies (e.g., data warehouses) to fully support the data ecosystem. This led to overly complex environments with a patchwork of different tools, systems, technologies, and multiple copies of the data. The emergence of a coexisting data lake and data warehouse still leaves much to be desired. The incomplete support for use cases and incompatible security and 217 governance models has led to increased complexity from disjointed and duplica tive data silos that contain subsets of data across different tools and technologies. Lakehouse In the late 2010s, the concept of the lakehouse emerged. This introduced a mod ernized version of a data warehouse that provides all of the benefits and features without compromising the flexibility of a data lake. The lakehouse leverages a low-cost, flexible cloud storage layer, a data lake, combined with data reliability and consistency guarantees through technologies that feature open-table formats with support for ACID transactions. This flexibility helps support diverse work loads such as streaming, analytics, and machine learning under a single unified platform, which ultimately enables a single security and governance approach for all data assets. With the advent of Delta Lake and the lakehouse, the paradigm of end-to-end data platforms has begun to shift due to the key features enabled by this architectural pattern. By combining the capabilities of the lakehouse with what you have learned in this book, you will learn how to enable the key features offered by a lakehouse architec ture and be fully up and running with Delta Lake. Storage Layer The first step, or layer, in any well-designed architecture is deciding where to store your data. In a world with increasing volumes of data coming in different forms and shapes from multiple heterogeneous data sources, it is essential to have a system that allows for storing massive amounts of data in a flexible, cost-effective, and scalable manner. And that is why a cloud object store like a data lake is the foundational storage layer for a lakehouse. What Is a Data Lake? Previously defined in Chapter 1, a data lake is a cost-effective central repository to store structured, semi-structured, or unstructured data at any scale, in the form of files and blobs. This is possible in a data lake because it does not impose a schema when writing data, so data can be saved as is. A data lake uses a flat architecture and object storage to store data, unlike a data warehouse, which typically stores data in a hierarchical structure with directories and files while imposing a schema. Every object is tagged with metadata and a unique identifier so that applications can use it for easy access and retrieval. Types of Data One of the key elements of a data lake is that a cloud object store provides limitless scalability to store any type of data. These different types of data have been covered in 218 | Chapter 10: Building a Lakehouse on Delta Lake this book, but it is best to define the three classifications of data to demonstrate how they are structured and where they originate: Structured data What is it? In structured data all the data has a predefined structure, or schema. This is most commonly relational data coming from a database in the form of tables with rows and columns. What produces it? Data like this is typically produced by traditional relational databases and is often used in enterprise resource planning (ERP), customer relationship management (CRM), or inventory management systems. Semi-structured data What is it? Semi-structured data does not conform to a typical relational format like structured data. Rather it is loosely structured with patterns or tags that sepa rate elements of the data, such as key/value pairs. Examples of semi-structured data are Parquet, JSON, XML, CSV files, and even emails or social feeds. What produces it? Common data sources for this type of data can include nonrelational or NoSQL databases, IoT devices, apps, and web services. Unstructured data What is it? Unstructured data does not contain an organized structure; it is not arranged in any type of schema or pattern. It is often delivered as media files, such as photo (e.g., JPEG) or video files (e.g., MP4). The underlying video files might have an overall structure to them, but the data that forms the video itself is unstructured. What produces it? A vast majority of an organization’s data comes in the form of unstructured data, and is produced from things like media files (e.g., audio, video, photos), Word documents, log files, and other forms of rich text. Unstructured and semi-structured data are often critical for Al and machine learning use cases, whereas structured and semi-structured data are critical for BI use cases. Because it natively supports all three types of data classifications, you can create a unified system that supports these diverse workloads in a data lake. These workloads can complement each other in a well-designed processing architecture, which you will learn about further on in this chapter. A data lake helps solve many of the challenges related to data volumes, types, and cost, and while Delta Lake runs on top of a data lake, it is optimized to run best on a cloud data lake. Key Benefits of a Cloud Data Lake We’ve discussed how a data lake helps address some of the shortcomings of a data warehouse. A cloud data lake, as opposed to an on-premises data lake or a data warehouse, best supports a lakehouse architecture as the storage layer for a number of reasons: Storage Layer | 219 Single storage layer One of the most important features of a lakehouse is unifying platforms, and a cloud data lake helps eliminate and consolidate data silos and different types of data into a single object store. The cloud data lake allows you to process all data on a single system, which prevents creating additional copies of your data moving back and forth between different systems. Decreased movement of data across different systems also results in fewer integration points, which means fewer places for errors to occur. A single storage layer reduces the need for mul tiple security policies that cover different systems and helps resolve difficulties with collaboration across systems. It also offers data consumers a single place to look for all sources of data. Flexible, on-demand storage layer Whether it is velocity (streaming versus batch), volume, or variety (structured versus unstructured), cloud data lakes allow for the ultimate flexibility to store data. According to Rukmani Gopalan in his recently published book, The Cloud Data Lake,1 “these systems are designed to work with data that enters the data lake at any speed: real-time data emitted continuously as well as volumes of data ingested in batches on a scheduled basis.” Not only is there flexibility with the data, but there is flexibility with the infrastructure as well. Cloud providers allow you to provision infrastructure on demand, and quickly scale up or down elastically. Because of this level of flexibility, the organization can have a single storage layer that provides unlimited scalability. Decoupled storage and compute Traditional data warehouses and on-premises data lakes have traditionally had tightly coupled storage and compute. Storage is generally inexpensive, whereas compute is not. The cloud data lake allows you to decouple this and independ ently scale your storage and store vast amounts of data at very little cost. Technology integration Data lakes offer simple integration through standardized APIs so organizations and users with completely different skills, tools, and programming languages (e.g., SQL, Python, Scala, etc.) can perform different analytics tasks all at once. Replication Cloud providers offer easy to set up replication to different geographical loca tions for your data lake. The ease of enabling replication can make it useful in meeting compliance requirements, failover for business-critical operations, disas ter recovery, and minimizing latency by storing the objects closer to different locations. 1 Gopalan, Rukmani (2022). The Cloud Data Lake, 1st ed. Sebastopol, CA: O’Reilly. 220 | Chapter 10: Building a Lakehouse on Delta Lake Availability Most cloud systems offer different types of data availability for cloud data lakes. This means that for data that is infrequently accessed, or archived, compared to “hot” data that is accessed frequently, you can set up lifecycle policies. These lifecycle policies allow you to move data across different storage availability classes (with lower costs for infrequently accessed data) for compliance, business needs, and cost optimization. Availability can also be defined through service-level agreements (SLAs). In the cloud, this is the cloud providers guarantee of the resources’ minimal level of service. Most cloud providers guarantee greater than 99.99% uptime for these business-critical resources. Cost With cloud data lakes you typically pay for what you use, so your costs always align with your data volumes. Since there is only a single storage layer, less data movement across different systems, availability settings, and decoupled storage versus compute, you have isolated and minimized costs for just data storage. For greater cost allocation, most cloud data lakes offer buckets, or containers (filesys tems, not to be confused with application containers), to store different layers of the data (e.g., raw versus transformed data). These containers allow you to have finer-grained cost allocation for different areas of your organization. Since data sources and volumes are growing exponentially, it is extremely important to allocate and optimize costs without limiting the volume or variety of data that can be stored. Figure 10-1 illustrates the different types of popular cloud-based data lakes at the time of writing, along with the different types of data that are stored in them. Cloud-based data lake Scalable, low-cost Google Azure Data Amazon S3 cloud-based data Cloud Lake IBM cloud storage storage Storage Storage TwfZZ Batch and streaming Structured, semi-structured, and unstructured data - Figure 10-1. Example of cloud-based data lakes and the types of data they support Storage Layer | 221 Overall, the storage layer is a critical component of a lakehouse architecture, as it enables organizations to store and manage massive amounts of data in a cost-effective and scalable manner. Now that you have a defined place to store your data, you also need to appropriately manage it. Data Management Although a cloud data lake allows you to elastically store data at scale in its native format, among other benefits, the next piece of the lakehouse foundation is facilitat ing data management. According to Gartner, data management (DM) consists of the practices, architectural techniques, and tools for achieving consistent access to and delivery of data across the spectrum of data subject areas and data structure types in the enterprise, to meet the data consumption requirements of all applications and business processes.2 Data management is a key function for any organization and greatly hinges on the tools and technologies used in the access and delivery of data. Traditionally, data lakes have managed data simply as a “bunch of files” in semi-structured formats (e.g., Parquet), which makes it challenging to enable some of the key features of data management, such as reliability, due to lack of support for ACID transactions, schema enforcement, audit trails, and data integration and interoperability. Data management on a data lake begins with a structured transactional layer for reliability. This reliability comes from a transaction layer that supports ACID transac tions, open-table formats, integration between batch and streaming data processing, and scalable metadata management. This is where Delta Lake is introduced as a core component of the lakehouse architecture that supports data management. In Figure 10-2, you can see the different types of data types that a cloud-based data lake supports, with Delta Lake built on top of the data lake, acting as a structured transactional layer. 2 “Data Management (DM).” Gartner, Inc, Accessed April 24, 2023, https://oreil.ly/xK4ws. 222 | Chapter 10: Building a Lakehouse on Delta Lake Transactional layer ( ACID transactions ) ( No infrastructure management ) Open, transactional ( Change data feed ) ► layer curated for Delta Lake data ( Data sharing ) ( Integrations ) (Advanced performance tuning features) ✓- Cloud-based data lake Scalable, low-cost Google Azure Data Amazon S3 cloud-based data Cloud Lake IBM cloud storage storage Storage Storage -... t..... Batch and streaming Z" "X Structured, semi-structured, and unstructured data Figure 10-2. Delta Lake transactional layer Delta Lake brings durability, reliability, and consistency to your data lake. There are several key elements of Delta Lake that help facilitate data management: Metadata At the core of Delta Lake is the metadata layer. This layer provides extensive and scalable metadata tracking that enables most of the core features of Delta Lake. It provides a level of abstraction to implement ACID transactions and a variety of other management features. These metadata layers are also a natural place to begin enabling governance features, such as access control and audit logging. ACID transactions Delta Lake will ensure that data is kept intact in case concurrency transactions are active on a single table. Support for ACID transactions brings consistency and reliability to your data lake, which is made possible through the transaction log. The transaction log keeps track of all commits, and uses an isolation level to guarantee consistency and accurate views of the data. Versioning and audit history Since Delta Lake stores information about which files are part of a table as a transaction log, it allows users to query old versions of data and perform Data Management | 223 rollbacks, also referred to as time traveling. This provides full audit trails of your data for business needs or regulatory requirements, and can also be supportive of machine learning procedures. Data integration This can be defined as the consolidation of data from different sources into a single place. Delta Lake can handle concurrent batch and streaming operations on a table; it provides a single place for users to view data. Not only this, but the ability to perform DML operations such as INSERTS, UPDATES, and DELETES allows you to perform effective data integration and maintain consistency on all tables. Schema enforcement and evolution Like traditional RDBMS systems, Delta Lake offers schema enforcement along with flexible schema evolution. This helps guarantee clean and consistent data quality through constraints while providing flexibility for schema drift caused by other processes and upstream sources. Data interoperability Delta Lake is an open source framework that is cloud agnostic, and since it inter acts seamlessly with Apache Spark by providing a set of APIs and extensions, there are a vast number of different integration capabilities across projects, other APIs, and platforms. This enables you to easily integrate with existing data man agement workflows and processes through different tools and frameworks. Delta UniForm offers format interoperability with Apache Iceberg, further expanding the set of systems and tools you can integrate with. Together with Delta Sharing, which makes it simple to securely share data across organizations, Delta Lake avoids vendor lock-in and enables interoperability. Machine learning Because machine learning (ML) systems often require processing large sets of data while using complex logic that isn’t well suited for traditional SQL, they can now easily access and process data from Delta Lake using DataFrame APIs. These APIs allow ML workloads to directly benefit from the optimizations and features offered by Delta Lake. Data management across all different workloads can now all be consolidated to Delta Lake. By adding a structured transactional layer better equipped to handle data manage ment, the lakehouse simultaneously supports raw data, ETL/ELT processes that curate data for analytics, and ML workloads. The curation of data through ETL/ELT has traditionally been thought of and presented in the context of a data warehouse, but through the data management features offered by Delta Lake, you can bring those processes to a single place. That also allows ML systems to directly benefit from the features and optimizations offered by Delta Lake to complete the management and consolidation of data across different workloads. By combining all these efforts, you 224 | Chapter 10: Building a Lakehouse on Delta Lake can bring greater reliability and consistency to your data lake and create a lakehouse that becomes the single source of truth across the enterprise for all types of data. SQL Analytics In the world of data analysis, business intelligence, and data warehousing, it is generally known that SQL is one of the most common and flexible languages. Part of the reason for this is not only because it offers an accessible learning curve with low barriers to entry, but because of the complex data analysis operations it can perform. While allowing users to interact with data quickly, SQL also allows users of all skill levels to write ad hoc queries, prepare data for BI tools, create reports and dashboards, and perform a wide array of data analysis functions. For reasons like this, SQL has become the language of choice for business intelligence and data warehousing by the likes of everyone from data engineers to business users. This is why it is necessary that the lakehouse architecture achieves great SQL performance with respect to scalability and performance, and enables SQL analytics. Fortunately, a lakehouse architecture built around Delta Lake as the transactional layer for curated data has scalable metadata storage that is easily accessible through Apache Spark and Spark SQL. SQL Analytics via Spark SQL Spark SQL is Apache Sparks module, also referred to as a library, for working with structured data. It provides a programming interface to work with structured data using SQL and DataFrame APIs, all of which is underpinned by the Spark SQL engine. Similar to other SQL engines, the Spark SQL engine is responsible for gener ating efficient queries and compact code. This execution plan is adapted at runtime. The Apache Spark ecosystem consists of different libraries, where the Spark Core and Spark SQL engine are the substrate on which they are built (Figure 10-3). * ' Spark SQL Spark streaming MLlib GraphX DotaFrames + datasets Structured streaming Machine learning Graph computations interactive queries — Apache Spark Spark Core and Spark SQL engine * z *F 1* ' Scala Python Java R SQL -. Figure 10-3. Apache Spark ecosystem, including Spark SQL SQL Analytics | 225 The Spark SQL library supports ANSI SQL, which allows users to query and analyze data using SQL syntax that they are familiar with. As we have seen in previous chapters, Delta tables can easily be queried using Spark SQL and the sql() method in PySpark, for example: %python spark.sql("SELECT count(*) FROM taxidb.tripData") Or, similar to how most queries in this book have been written, you can use magic commands and %sql in notebooks or some IDEs to specify the language reference and just write Spark SQL directly in the cell: %sql SELECT count(*) as count FROM taxidb.tripData And not only SQL, but the Spark SQL library also allows you to use the DataFrame API to interact with datasets and tables: %python spark.table("taxidb.tripData").count() Analysts, report builders, and other data consumers will typically interact with data through the SQL interface. This SQL interface means that users can leverage Spark SQL to perform their simple, or complex, queries and analysis on Delta tables, taking advantage of the performance and scalability that Delta tables offer, while also taking advantage of the Spark SQL engine, distributed processing, and optimizations. Since Delta Lake ensures serializability, there is full support for concurrent reads and writes. This means that all data consumers can confidently read the data even as data is updated through different ETL workloads. In short, the Spark SQL engine generates an execution plan that is used to optimize and execute queries on your Spark cluster to make queries as fast as possible. Figure 10-4 illustrates that you can express SQL queries using the Spark SQL library, or you can use the DataFrame API to interact with datasets and leverage the Spark SQL engine and execution plan. Whether you are using Spark SQL or the DataFrame API, the Spark SQL engine will generate a query plan used to optimize and execute the command on the cluster. 226 | Chapter 10: Building a Lakehouse on Delta Lake Figure 10-4. Spark SQL execution plan K In Figure 10-4, it is important to note that the Resilient Distributed Datasets (RDDs) in the execution plan are not referring to tradi tional user-defined RDDs that leverage low-level APIs. Rather, the RDDs mentioned in the figure are Spark SQL RDDs, also referred to as DataFrames or datasets, that do not add additional overhead, as they are optimized specifically to structured data. Generally, it is recommended to use the DataFrame API for your ETL and data ingestion processes3 or machine learning workloads, whereas most of the data con sumers (e.g., analysts, report builders, etc.) will interact with your Delta tables using Spark SQL. You can also interact with the SQL interface over standard JDBC/ODBC database connectors or using the command line. The JDBC/ODBC connectors mean that Spark SQL also provides a bridge between external tools such as Power BI, Tableau, and other BI tools to interact with and consume tables for analytics. SQL Analytics via Other Delta Lake Integrations While Delta Lake provides powerful integration with Spark SQL and the rest of the Spark ecosystem, it can also be accessed by a number of other high-performance query engines. Supported query engines include: Apache Spark Apache Hive Presto Trino 3 Damji, Jules S, et al. (2020). Learning Spark, 2nd ed. Sebastopol, CA: O’Reilly. SQL Analytics | 227 These connectors help bring Delta Lake to big-data SQL engines other than just Apache Spark and allow you to read and write (depending on the connector). The Delta Lake connectors repository includes: Delta Standalone, a native library for reading and writing Delta Lake metadata Connectors to popular big-data engines (e.g., Apache Hive, Presto, Apache Flink) and to common reporting tools like Microsoft Power BI. There are also several managed services that allow you to integrate and read data from Delta Lake, including: Amazon Athena and Redshift Azure Synapse and Stream Analytics Starburst Databricks Please consult the Delta Lake website for a complete list of supported query engines and managed services. When it comes to performing SQL analytics, it is important to leverage a high- performance query engine that can interpret SQL queries and execute them at scale against your Delta tables for data analysis. In Figure 10-5, you can see that the lakehouse is comprised of three different compounding layers, plus APIs that allow different layers to communicate with one another: Storage layer Data lake used for scalable, low-cost cloud data storage for structured, semi structured, and unstructured data. Transactional layer ACID-compliant open-table format with scalable metadata made possible through Delta Lake. APIs SQL APIs enable users to access Delta Lake and perform read and write opera tions. Then metadata API help systems understand the Delta Lake transaction log to read data appropriately. High-performance query engine A query engine that interprets SQL so users can perform data analysis through Apache Spark SQL or another query engine that Delta Lake integrates with. 228 | Chapter 10: Building a Lakehouse on Delta Lake High-performance query engine High-performance Databricks Apache Flink Presto query engines Spark - -< I SQL APIs I xv.v.v.v.v. Metadata APIs i Open, transactional ’ layer curated for data Scalable, low-cost cloud-based data storage t. Batch and streaming Structured, semi-structured, and unstructured data - Figure 10-5. Example of the lakehouse layered architecture, with the high-performance query engine used for BI and reports Data for Data Science and Machine Learning Since Delta Lake helps provide reliable and simplified data management across your lakehouse, it helps bring a variety of benefits to data and data pipelines leveraged for data science activities and machine learning. Data for Data Science and Machine Learning | 229 Challenges with Traditional Machine Learning Generally speaking, machine learning operations (MLOps) are the set of practices and principles involved in the end-to-end machine learning lifecycle. In MLOps, there are a number of common challenges that a vast majority of organizations and data scientists face when attempting to build and finally productionalize machine learning models in their organization: Data silos Data silos often start to develop as the gap between data engineering activities and data science activities begins to grow. Data scientists frequently spend the majority of their time creating separate ETL and data pipelines that clean and transform data and prepare it into features for their models. These silos usually develop because the tools and technologies used for data engineering don’t sup port the same activities for data scientists. Consolidating batch and streaming data Machine learning models use historical data to train models in order to make accurate predictions on streaming data. The problem is that traditional architec tures don’t support reliably combining both historical and streaming data, which creates challenges in feeding both types of data into machine learning models. Data volumes Machine learning systems often need to process large sets of data while using complex code that isn’t necessarily well suited for SQL. And if data scientists wish to consume data via ODBC/JDBC interfaces from tables created through data engineering pipelines, these interfaces can create a very inefficient process. These inefficient processes are largely because these interfaces are designed to work with SQL and offer limited support for non-SQL code logic. This results in inefficient non-SQL queries that can be caused by data volume, data conversions, and complex data structures that non-SQL code logic can often include. Reproducibility MLOps best practices include the need to reproduce and validate every stage of the ML workflow. The ability to reproduce a model reduces the risk of errors, and ensures the correctness and robustness of the ML solution. Consistent data is the most difficult challenge faced in reproducibility, and an ML model will only reproduce the exact same result if the exact same data is used. And since data is constantly changing over time, this can introduce significant challenges to ML reproducibility and MLOps. Nontabular data While typically we think of data as being stored in tables, there are growing use cases for machine learning workloads to support large collections of nontab ular data, such as text, image, audio, video, or PDF files. This nontabular data 230 | Chapter 10: Building a Lakehouse on Delta Lake requires the same governance, sharing, storage, and data exploration capabilities that tabular data requires. Without having some type of feature to catalog these collections of directories and nontabular data, it is very challenging to provide the same governance and management features used with tabular data. Due to the challenges that traditional architectures create for productionalizing machine learning models, machine learning often becomes a very complex and siloed process. These siloed complexities introduce even more challenges for data management. Delta Lake Features That Support Machine Learning Fortunately, Delta Lake helps negate these data management challenges traditionally introduced by machine learning activities, and bridges the gap between data and processes used for Bl/reporting analytics and advanced analytics. There are several different features of Delta Lake that help support the machine learning lifecycle: Optimizations and data volumes The benefits that Delta Lake offers all start with the fact that it is built on top of Apache Spark. Data science activities can access data directly from Delta Lake tables using DataFrame APIs via Spark SQL, which allows machine learning workloads to directly benefit from the optimization and performance enhance ments offered by Delta Lake. Consistency and reliability Delta Lake provides ACID transactions, which ensures consistent and reliable data. This is important for machine learning and data science workflows because model training and predictions require this level of reliability to avoid negative impacts from inaccurate or inconsistent data. Consolidating batch and streaming data Delta Lake tables can seamlessly handle the continuous flow of data from both historical and streaming sources, made possible through Spark Streaming and Spark Structured Streaming. This means that you can simplify the data flow process for machine learning models since both types of data are consolidated into a single table. Schema enforcement and evolution By default, Delta Lake tables have schema enforcement, which means that better data quality can be enforced, thus increasing the reliability of data used for machine learning inputs. But Delta Lake also supports schema evolution, which means data scientists can easily add new columns to their existing machine learning production tables without breaking existing data models. Data for Data Science and Machine Learning | 231 Versioning You have learned in previous chapters that Delta Lake lets you easily version your data and perform time travel via the transaction log. This versioning helps mitigate the challenges often seen with ML reproducibility because it allows you to easily re-create machine learning experiments and model outputs given a specific set of data at a specific point in time. This helps significantly reduce some of the challenges seen in the MLOps process since simplified versioning can provide greater traceability, reproducibility, auditing, and compliance for ML models. Integrations In Chapter 1, and in this chapter, you read about how Delta Lake can be accessed using the Spark SQL library in Apache Spark. You also read about Spark Stream ing and Delta Lake’s integration with that library in Chapter 8. An additional library in the Spark ecosystem is MLlib. MLlib gives you access to common learning algorithms, featurization, pipelines, persistence, and utilities. In fact, many machine learning libraries (e.g., TensorFlow, scikit-learn) can also leverage DataFrame APIs to access Delta Lake tables. The Spark ecosystem consists of multiple libraries running on top of Spark Core to provide multifunctional support for all types of data and analytics use cases (Figure 10-6). Outside of the Spark standard libraries, Spark also has integrations with other platforms such as MLflow, a popular open source platform for manag ing the end-to-end machine learning lifecycle, which allows Spark MLlib models to be tracked, logged, and reproduced. Spark SQL Spark streaming MLlib GraphX DataFrames + datasets Structured streaming Machine learning Graph computations interactive queries Apache Spark Spark Core and Spark SQL engine - Scala Python Java R SQL Figure 10-6. Apache Spark ecosystem showing the Spark SQL, Spark Streaming, and MLlib libraries 232 | Chapter 10: Building a Lakehouse on Delta Lake As previously mentioned, machine learning models are typically built using libraries such as TensorFlow, PyTorch, scikit-learn, etc. And while Delta Lake is not directly a technology for building models, it does focus on addressing and providing valuable, foundational support for many of the challenges that machine learning activities face. MLOps and models are reliant on the data quality and integrity, reproducibility, relia bility, and unification of data that Delta Lake provides. The robust data management and integration features offered by Delta Lake simplify MLOps and make it easier for machine learning engineers and data scientists to access and work with the data used to train and deploy their models. Putting It All Together Through the features enabled by Delta Lake, it becomes much easier to unify both the machine learning and data engineering lifecycles. Delta Lake enables machine learning models to learn and predict from historical (batch) and streaming data, all sourced from a single place while natively leveraging Delta table optimizations. ACID transactions and schema enforcement help bring data quality, consistency, and reliability to the tables used for machine learning model inputs. And Delta Lake’s schema evolution helps your machine learning outputs change over time without introducing breaking changes to existing processes. Time travel enables easy auditing or reproduction of your machine learning models, and the Spark ecosystem brings additional libraries and other machine learning lifecycle tools to further enable data scientists. In Figure 10-7, you can see all of the resulting layers of a fully constructed lakehouse environment. All together, the features of Delta Lake help bridge the gap between data engineers and data scientists in an effort to reduce silos and unify workloads. Together with Delta Lake and a robust lakehouse architecture, organizations can start to build and manage machine learning models in a faster, more efficient way. Data for Data Science and Machine Learning | 233 [e /* Z" Data Machine Reports science learning - k. High-performance High-performance query engine «- ----------- ' 1 Apache query engines Databricks Spark Flink Presto - - - - : sql apis : : DataFrameAPIs : « I. i...............: : I Metadata APIs Open, transactional layer curated for data Scalable, low-cost cloud-based data storage.1. Batch and streaming Structured, semi-structured, and unstructured data - Figure 10-7. The lakehouse architecture with the addition of data science and machine learning workloads Medallion Architecture The lakehouse is centered around the idea of unification and combining the best elements of different technologies in a single place. This means it is also important that the data flow within the lakehouse itself supports this unification of data. In order to support all use cases, this data flow requires merging batch and streaming data into a single data flow to support scenarios across the entire data lifecycle. 234 | Chapter 10: Building a Lakehouse on Delta Lake Chapter 1 introduced the idea of the medallion architecture, a popular data design pattern with Bronze, Silver, and Gold layers that is ultimately enabled via Delta Lake. This popular pattern is used to organize data in a lakehouse in an iterative manner to improve the structure and quality of data across your data lake, with each layer having specific functions and purposes for analytics while unifying batch and streaming data flows. An example of the medallion architecture is provided in Figure 10-8. "Landing zone" for Define structure, enforce Deliver continuously updated raw data, no schema schema, evolve schema clean data to downstream required as needed users and apps Figure 10-8. Data lakehouse solution architecture The Bronze, Silver, and Gold layers shown in Figure 10-8 are summarized in Table 10-1. For each layer, you will see its business value, properties, and implementa tion details (such as “how its done”). Table 10-1. Medallion architecture summary 1 Bronze Silver Gold 1 Business Audit on exactly what was First layer that is useful to the Data is in a format that is easy value received from the source business for business users to navigate Ability to reprocess without going Enables data discovery, self- Highly performant back to the source service, ad hoc reporting, advanced analytics, and ML Properties No business rules or Prioritize speed to market Prioritize business use cases transformations of any kind and write performance—just and user experience Should be fast and easy to get new enough transformations Precalculated, business-specific data to this layer Quality data expected transformations Can have separate views of the data for different consumption use cases Medallion Architecture | 235 Bronze Silver Gold How it's Must include a copy of what was Delta merge Prioritize denormalized, read- done received Can include light modeling optimized data models Typically, data is stored in folders (3NF, vaulting) Fully transformed based upon the date received Data quality checks should be Aggregated included In the following sections, you will get a more detailed look at the different layers that make up the medallion architecture. The Bronze Layer (Raw Data) Raw data from the data sources is ingested into the Bronze layer without any trans formations or business rule enforcement. This layer is the “landing zone” for our raw data, so all table structures in this layer correspond exactly to the source system structure. The format of the data source is maintained, so when the data source is a CSV file, it is stored in Bronze as a CSV file, JSON data is written as JSON, etc. Data extracted from a database table typically lands in Bronze as a Parquet or AVRO file. At this point, no schema is required. As data is ingested, a detailed audit record is maintained, which includes the data source, whether a full or incremental load was performed, and detailed watermarks to support the incremental loads where needed. The Bronze layer includes an archival mechanism, so that data can be retained for long periods of time. This archive, together with the detailed audit records, can be used to reprocess data in case of a failure somewhere downstream in the medallion architecture. The ingested data lands in the Bronze layer “source system mirrored,” maintaining the structure and data types of the source system format, although it is often augmen ted with additional metadata, such as the date and time of the load, and ETL process system identifiers. The goal of the ingestion process is to land the source data quickly and easily in the Bronze layer with just enough auditing and metadata to enable data lineage and reprocessing. The Bronze layer is often used as a source for a Change Data Capture (CDC) process, allowing newly arriving data to be immediately processed downstream through the Silver and Gold layers. 236 | Chapter 10: Building a Lakehouse on Delta Lake The Silver Layer In the Silver layer we first cleanse and normalize the data. We ensure that standard formats are used for constructs such as date and time, enforce the company’s column naming standard, de-duplicate the data, and perform a series of additional data quality checks, dropping low-quality data rows when needed. Next, related data is combined and merged. The Delta Lake MERGE capabilities work very well for this purpose. For example, customer data from various sources (sales, CRM, POS systems, etc.) is combined into a single entity. Conformed data, which are those data entities that are reused across different subject areas, is identified and normalized across the views. In our previous example, the combined customer entity would be an example of such conformed data. At this point, the combined enterprise view of the data starts to emerge. Note that we apply a “just-enough” philosophy here, where we provide just enough detail with the least amount of effort possible, making sure that we maintain our agile approach to building our medallion architecture. At this point, we start enforcing schema, and allow the schema to evolve downstream. The Silver layer is also where we can apply GDPR and/or PII/PHI enforcement rules. Because this is the first layer where data quality is enforced, and the enterprise view is created, it serves as a useful data source for the business, especially for purposes such as self-service analytics and ad hoc reporting. The Silver layer proves to be a great data source for machine learning and Al use cases. Indeed, these types of algorithms work best with the “less polished” data in the Silver layer instead of the consumption formats in the Gold layer. The Gold Layer In the Gold layer, we create business-level aggregates. This can be done through a standard Kimball star schema, an Inmon snowflake schema dimensional model, or any other modeling technique that fits the consumer business use case. The final layer of data transformations and data quality rules is applied here, resulting in high- quality, reliable data that can serve as the single source of truth in the organization. The Gold layer continuously delivers business value by offering high-quality, clean data to downstream users and applications. The data model in the Gold layer often includes many different perspectives or views of the data, depending on the con sumption use cases. The Gold layer will implement several Delta Lake optimization techniques, such as partitioning, data skipping, and Z-ordering, to ensure that we deliver quality data in a performant way. Medallion Architecture | 237 Curated for optimal consumption by BI tools, reporting, applications, and business users, this becomes the primary layer where data is read using a high-performance query engine. The Complete Lakehouse Once you have implemented the medallion architecture as part of your Delta Lake based architecture, you can start seeing the full benefits and extensibility of the lakehouse. While building the lakehouse throughout this chapter, you have seen how the different layers complement each other in order to unify your entire data platform. Figure 10-9 illustrates the entire lakehouse, including how it looks with the medallion architecture. While the medallion architecture is certainly not the only design pattern for data flows in a lakehouse, it is one of the most popular, and for good reason. Ulti mately, the medallion architecture, through features enabled by Delta Lake, supports the unification of data in a single data flow to support batch and streaming workload, machine learning, business-level aggregates, and analytics as a whole for all personas. 238 | Chapter 10: Building a Lakehouse on Delta Lake r Data Bl Reports science - - t t High-performance query engine i Z" High-performance Apache query engines Databricks Flink Spark -7 ,1...................1. —7 i SQLAPIs I Data Frame APIs I ,.XV.'.VLV.WAl'. I MetadataAPIs "i.................... r..................... Transactional layer Silver Z" X 0pen, transactional Delta Lake > layer curated for data Filtered, cleansed, augmented X. X. Z" Cloud-based data lake X Scalable, low-cost Google Azure Data cloud-based data Amazon S3 Cloud Lake storage storage Storage Storage IZI Batch and streaming Structured, semi-structured, and unstructured data ________________________________________ Figure 10-9. The complete end-to-end lakehouse architecture Medallion Architecture | 239 Conclusion Throughout this book you have learned how the emergence of an open-table format, open source standard in Delta Lake has brought reliability, scalability, and overall better data management to data lakes. By overcoming many of the limitations of traditional technologies, Delta Lake helps bridge the gap between traditional data warehouses and data lakes, bringing to organizations a single unified, big data man agement platform in the form of a lakehouse. Delta Lake continues to transform how organizations can store, manage, and process data. Its robust features, such as ACID transactions, data versioning, streaming and batch transaction support, schema enforcement, and performance tuning techniques, have made it the de facto open-table format of choice for data lakes. At the time of writing, Delta Lake is the worlds most widely adopted lakehouse format with millions of downloads per month and a strong community of growing contributors. As contributor strength and adoption of Delta Lake continues to grow, the Delta ecosystem continues to expand, and as the overall field of big data continues to evolve, naturally so will the functionality of Delta Lake due to its open source format and the contributions of the open source community. The continued rise of Delta Lake and the lakehouse paradigm poses a significant milestone in the evolution of data platforms and data management. Delta Lake pro vides the functionality and features to succeed at scale in today’s data-driven world, while the lakehouse provides a unified, scalable architecture to support it. Delta Lake and the lakehouse will continue to play a critical role in simplifying architectures and driving innovation in a growing data and technologies ecosystem. 240 | Chapter 10: Building a Lakehouse on Delta Lake

(Delta) Ch 10 Building a Lakehouse on Delta Lake.pdf

Document Details

Tags

Related

Full Transcript