🎧 New: AI-Generated Podcasts Turn your study notes into engaging audio conversations. Learn more

10_Delta Lake.pdf

Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...

Full Transcript

In this video, we will talk about Delta Lake. Delta lake is an open source storage framework that brings reliability to data lakes. As you may know, data lakes have many limitations, such as data inconsistency and performance issues. Delta lake technology helps overcoming these challenges. Let us se...

In this video, we will talk about Delta Lake. Delta lake is an open source storage framework that brings reliability to data lakes. As you may know, data lakes have many limitations, such as data inconsistency and performance issues. Delta lake technology helps overcoming these challenges. Let us see this comparison to better understand what is Delta lake As we said, Delta Lake is an open source technology and not a proprietary technology. It's a storage framework or a storage layer, but it is not a storage format or a storage medium. It enables building lakehouse architecture. Lakehouse is a platform that unify both data warehouse and advanced analytics. Delta Lake itself is not a data warehouse, and of course it's not a database service. Let's go into more details. Delta Lake is a component which is deployed on the cluster as part of the Databricks runtime. If you are creating a Delta Lake table, it gets stored on the storage in one or more data files in parquet format. But along with these files, Delta stores a transaction log as well. But what is this transaction log ? The Delta Lake transaction log, also known as Delta Log, is ordered records of every transaction performed on the table since its creation. It serves as a single source of truth. So, every time you query the table, Spark checks this transaction log to retrieve the most recent version of the data. Each committed transaction is recorded in a JSON file. It contains the operation that has been performed, whether, for example, it's an insert or update and the predicates such as conditions and filters used during this operation. In addition to all the files that have been affected because of this operation. Let us see some concrete examples. In this scenario. We have a writer process and a reader process. Once the writer process starts, it stores the Delta Lake table in two data files in a parquet format. As soon as the writer process finishes writing, it adds the transaction log 000.json into the _delta_log directory. A reader process always starts by reading a transaction log. In this case, it reads the 000.json transaction log that contains information of the files number 1 and 2. So, it can start reading them. In our second scenario, the writer process wants to update a record which presents in the file number 1, but in Delta Lake, instead of updating the record in the file itself, it will make a copy of this file and make the necessary updates in the new file. File number 3. It then updates the log by writing a new JSON file. This new log file knows that file number 1 is no longer needed. Now, the reader process reads the transaction log that tells that only files 2 and 3 are part of the current table version so it can start reading them. Let us see one more scenario. Here, both processes want to work at the same time. The writer process starts writing the file number 4. On the other hand, the reader process reads the transaction log that only has information about files 2 and 3 and not file number 4 as it is not fully written yet. So it starts reading those two files, 2 and 3 which represent the most recent data at the moment. So as you can see here, Delta Lake guarantees that you will always get the most recent version of the data. Your read operation will never have a deadlock state or conflicts with any ongoing operation on the table. Finally, the writer process finishes and it adds a new file to the log. Here is our last scenario. The writer process starts writing the file number 5 to the lake, but this time there is an error in the job, which leads to adding an incomplete file. Because of this failure, Delta Lake module does not write any information to the log. Now, the reader process reads the transaction log that has no information about that incomplete file number 5. That's why the reader process will read only files 2, 3 and 4. So as you can see Delta Lake guarantees that you will never read dirty data. Great. So the transaction log is the magic behind the scene. It allows Delta Lake to perform ACID transactions on data lakes. And it allows also to handle scalable metadata. This log also provides the full audit trail of all the changes that have happened on the table. And as we saw, the underlying file format for Delta is nothing but parquet and JSON format. Great. Let's now switch to Databricks to work with Data Lake in notebooks.

Use Quizgecko on...
Browser
Browser