matt's blog

What is Delta Lake?

"Delta Lake" sounds more like a fun weekend hike than a part of the modern data stack. I've made my case before and I fully expect a data retreat to the Tetons in 2024 (yes, Databricks, I have room for a sponsorship).

Of course, Delta Lake is primarily an open-source lakehouse storage framework. It's designed to enable a lakehouse architecture with compute engines like Spark, PrestoDB, Flink, Trino, and Hive. It has APIs for Scala, Java, Rust, Ruby, & Python.

Storage frameworks like Delta have played a major role in lakehouse architectures, but I've found the technology behind them unapproachable. What is it? Git for data? (no) How does it work? Why should I use this instead of Parquet? (this is Parquet!)

As always, we'll break things down to the basics and give you a comprehensive picture of what Delta Lake is and how you can get started.

What is Delta Lake?

While Delta is an open-source framework, it's important to note that it also underpins the Databricks platform. That means Databricks uses Delta Lake for storing tables and other data. It should be no surprise they created the format & they're subsequently responsible for most work on the library and its APIs.

Notably, as we'll discuss in a follow-up post, Iceberg is a very similar format and powers some of Snowflake's offerings... If you know about the ongoing feud between Databricks and Snowflake you can probably guess where this is headed. In typical fashion, we now have folks proclaiming "Iceberg won this" or "Delta won that." In reality, they're storage formats. Extremely similar storage formats... but what does that mean?

If you peruse the Databricks/Delta docs, you'll probably find something like this:

I don't find this particularly helpful for understanding Delta Lake. For some reason, the majority of their docs describe a lakehouse framework with a medallion architecture, not the underlying technology. A lakehouse is a fancy buzzword, also created by Databricks, used to describe a solution combining data lakes and data warehouses, i.e. leveraging cloud storage with semi-structured data formats and traditional OLAP solutions.

The very simple truth is that Delta files are just Parquet files with a metadata layer on top. That's it. Not to understate the ingenuity and usefulness of Delta, but it's a pretty simple concept.

Now, if you have a bit of background with these technologies, you might remark "Hey, so is Iceberg" or "Huh, that sounds like Hudi" and you'd be right. Those formats are pretty much the same thing. They even use very similar marketing materials.

All of these formats originated at companies that needed production-grade data lakes at scale, specifically those with a heavy reliance on Spark for processing. They're designed to address shortcomings of Hive, primarily.

You might notice most features comprise reliability, quality, and disaster recovery. These companies needed to process massive amounts of data and be certain they could roll back changes, make simultaneous edits, and store reliable data: Hudi was started by Uber, Iceberg by Netflix, Delta by Databricks.

The key benefits of these formats are almost entirely obtained from those metadata layers— ACID guarantees, scalability, time travel, unified batch/streaming, DML ops, and audit histories, to name a few.

Now, this sounds like a whole lot, but remember this all comes from the metadata layer, so it can't be too complicated (or can it?)

I'll give a brief overview of key features, then dive into maybe the most important characteristic, the transaction log.

Key Features of Delta

SO let's talk about the underlying technology that makes most of these possible.

The Transaction Log

Here's the part where I save you from reading a 15,000-word technical document! See, 4 readers can't be wrong— we're just delivering value here, nonstop.

So if Delta files are Parquet + the transaction log, we know the transaction log must be pretty special… otherwise, we'd just have Parquet files. While I do love Parquet files, I'm not sure they would warrant as much hype.

Apache, I'm open to sponsorships.

The transaction log works by breaking down transactions into atomic commits, the atomicity in ACID. So, whenever you perform an operation that modifies a table, Delta Lake breaks it down into one of the following discrete steps:

In that sense, it's sort of like Git, but not really. There isn't the ability to branch, rollback, merge, or do actual Git stuff. You just get the versioning. If you're looking for a true "git for data lakes" consider Delta + lakeFS, which promises just that.

Some of you might be saying "That's great, but how is Delta storing all of that transaction data? That could get pretty cumbersome!"

That's a valid concern! Delta creates a _delta_log subdirectory within every Delta Lake table. As changes are made, each commit is recorded in a JSON file, starting with 0 and incrementing up.

Delta will automatically generate checkpoint files for good read performance on every 10th commit, e.g. 0000010.json. Checkpoint files save the entire state of the table at a point in time, in native Parquet.

Let's take a look (in real-time)!

Demo time

I recommend opening this in full-screen to see the code! Here's the full code from the demo, if you'd like to try it at home.

The transaction log sees all

UniForm

Up to this point, we've only discussed the fundamentals of Delta, but the team has been really innovating lately— at the Databricks Data+AI Summit a few weeks ago, Delta 3.0 was announced.

There are three headline features of the release, but we'll focus on numero uno: Delta Universal Format (UniForm). With UniForm, ANY tool that can read Iceberg or Hudi… can also read Delta Lake.

With UniForm, customers can choose Delta with confidence, knowing that by choosing Delta, they'll have broad support from any tool that supports lakehouse formats.

The new functionality is pretty impressive. Since all three formats are Parquet-based and solely differentiated by their metadata, UniForm provides the framework for translating this metadata. Have an Iceberg table? It's now effectively a Delta table (with some limitations).

Perhaps more impressive is that UniForm improves read performance for Iceberg. So not only does UniForm consolidate formats and unify infrastructure— it can result in performance enhancements for non-native formats, too!

This is some 4D chess-type stuff from Databricks. It's a way for them to say "Hey, we're unifying the data space," while continuing to build market share and keep their formats in production. It's a clever one, for sure. We saw the same theme at the Data+AI Summit, where a major focus was on providing services that were universally accessible (marketplace, apps, etc).

This parallels Snowflake's announcement of managed Iceberg Catalog support. My hunch is that Databricks saw the writing on the wall and wanted to avoid a situation where using Iceberg = using Snowflake.

WAIT! We don't talk about things without explaining them first, so what's a catalog? Let's hear it straight from Snowflake:

A key benefit of open table formats, such as Apache Iceberg, is the ability to use multiple compute engines over a single copy of the data. What enables multiple, concurrent writers to run ACID transactions is a catalog. A catalog tracks tables and abstracts physical file paths in queries, making it easier to interact with tables. A catalog serves as an option to ensure all readers have the same view, and to make it cleaner to manage the underlying files.

That's a great description! So the catalog "orchestrates" the transaction logging and functionality of individual Delta/Iceberg/Hudi tables across organizations with concurrent reads/writes.

If you thought, "Huh, I wonder if Databricks has a competing service," you'd be 100 percent correct. Unity Catalog is a much more complex topic, so we'll save that discussion for another post, but you can read more here.

While I'm largely against the Databricks/Snowflake beef, it's awesome to see competition producing something good for the community— interoperability is undoubtedly a win. Improving performance across formats benefits everyone. I applaud them on their decision and work— I think it's insanely cool.

Conclusion

That's pretty much it! Delta's a great example of powerful technology: simple at its core, but insanely functional in practice. With Delta 3.0, the team continues to innovate and takes a major step towards the unification of lakehouse storage frameworks.

Yes, the term lakehouse is annoying and every presentation makes a joke about it, but it's here to stay so I might as well use it. At the very least, we, as a data community, need to come up with better puns.

In a future post, we might dig into some more advanced features of Delta like Liquid Clustering (and ZORDER by relation), Unity Catalog, and Optimistic Concurrency Control.

If you have a topic you'd like to see here, drop me a note!

#data #delta #opinion #tutorial