matt's blog

OLAP vs. OLTP

The universe of data jargon is pretty large— today, we have data warehouses, lakes, and lakehouses. While I simply want to hike to Delta Lake, it’s now a storage technology, too.

Data warehouses are the OG of data jargon— coined by the IBM Systems Journal in 1988, a data warehouse is described as “a system used for the analysis and reporting of structured and semi-structured data.”

We’ll go even simpler than that— it’s just a relational database. Good, at least that part was easy.

Storage Wars

The biggest choice in data might just be database selection. If you’re starting a data team, whether to use a cloud-native data warehouse (BigQuery, Snowflake) or a cloud-hosted traditional database (Postgres, MySQL) will occupy a large portion of your time.

These are more commonly referred to as OLAP (Online Analytical Processing) and OLTP (Online Transactional Processing) databases, respectively. We’ll talk about why you might choose one or the other and the future of data warehousing.

Why OLTP?

OLTP systems are engineered to handle transactional data originating from multiple users. This usually takes the form of a row-oriented database. Many traditional database systems are OLTP: Postgres, MySQL, etc.

If you currently work at a company and you hear engineers talk about a “prod” database (or you work on the prod database), it’s likely an OLTP store. Why? Because OLTP is great for inserting and retrieving transactional data (one row at a time).

To insert a transaction, you need to write an entire row. This is perfect, since OLTP systems operate by row. Data is written and retrieved one row at a time. Historically, databases were used to power production systems, so they were designed to effectively write and retrieve one thing at a time, really well. Cool, right?

Why OLAP?

As data became big, this started to present a problem. What happens when you need to query 1 million rows? 10 million rows? Since we just talked about how row-oriented databases work, let’s assume you want to sum revenue by month over a table with 10 million rows. Since OLTP sources read data one row at a time, you’ll have to process every column of every row in the table. Not good.

Furthermore, traditional OLTP databases require configuration. You’ll have to understand horizontal vs. vertical scaling, provisioning, and a whole bunch of DevOps stuff. That’s not a bad thing per se, but it’s not data. It’s DevOps. So now, as a data engineer, you’ll have to go through a DevOps team, which can be a huge bottleneck, or learn DevOps, which isn’t part of your competitive advantage (Hint: it’s data).

From this need, we saw the rise of “serverless” OLAP systems. These systems, think BigQuery, Snowflake, and Redshift allow for elastic scale in both storage and transactional volume. That means they can scale up to process a demanding query, then return to baseline once executed. Running low on space? You can add more storage with a single click. They’re highly optimized for analytic workloads.

What do we mean by “analytic workloads?” Well, tasks that are common in analytics: aggregating data, joining data, etc. These correspond to GROUP BY, JOIN, and WINDOW in SQL.

By definition, analytic workloads are not evenly distributed. You might have a complex query that takes 10 minutes to run, then 5 hours of downtime, then another complex query and an ETL pipeline— serverless makes sense to scale up/down according to uneven loads throughout the day (and hopefully no load at night).

Now let’s look back at our revenue query. Before, we were doing something like this:

No bueno. But now, we’re reading one column at a time— so that ten million row aggregation? You only need to scan two columns, date and revenue. In a table with 10 columns, that’s 20% of the original data! Now it looks something like:

So now we’re crazy efficient… and our systems can automatically scale. That’s the power of serverless OLAP systems.

OLAP systems are most commonly used by analytics and data science teams for their speed, stability, and low maintenance cost. To summarize:

The Future of Data

So, there are a lot of folks out there saying “data warehouses are dead,” but most of them have products that are built entirely on data lakes and competitors with data warehousing solutions.

“Data warehouses are dead. Oh and we sell data lakehouses as a service.”

Now, don’t get me wrong, data lakes are pretty cool, but most data teams still need a data warehouse. More than the architecture, they provide massive amounts of computing power with a SQL interface that anyone can understand. That means analysts and analytics engineers can write SQL workflows that parallelize and process relatively large amounts of data, with virtually zero upkeep.

Once ubiquitous, tech like Spark and Hadoop is now reserved for the largest data teams— Facebook, Netflix, etc. So, unless you’re working with petabyte-scale data, data warehouses are still very much useful.

But data is a fast moving space and the future is bright. Tools like DuckDB and metadata management like Delta, Hudi, and Iceberg are paving the way for systems that can read Parquet directly into a relational system or create virtualized databases that can leverage in-memory, relational operations.

If we’re going to do data puns, they might as well be ducking good.

Personally, I’m excited for the day we can have “virtual” (in quotes because, well, everything is virtual now) data warehouses that sit atop Parquet files and act nearly identically to current warehousing solutions. I welcome our new Duck overlords. I think we’re far from that point, but until then, a boy can dream.

#data #opinion