case when

Declarative & imperative code for data engineering

Intro

Programming paradigms classify languages based on common characteristicsā€”some deal with execution, while others focus on how code is organized (e.g. object-oriented). An understanding of paradigms is useful for solution architectureā€”knowing how code works is a prerequisite to selecting an efficient solution.

We can classify solutions according to the same paradigms. Today, we'll be concerned with two classes of code paradigms and how they relate to data engineering: imperative and declarative code.

šŸ’” Note that we're not referencing a particular language, rather we're using software engineering terms to understand patterns of data engineering solutions. These terms can be used to describe code, products, or entire architecture.

Understanding declarative & imperative

Imperative code tells a machine precisely how to produce a desired outcome. Think Python scripts, dbt macros, custom DAGsā€”detailed code that procedurally performs a complex task. They're often written from scratch and bespoke.

Declarative code merely describes a resultā€”the calculation is left to some underlying process. To obtain a list of active users, I might run the SQL SELECT * FROM accounts.active_users. How we arrived at accounts.active_users is unspecified, I'm merely stating the values I'd like returned. Declarative code abstracts away underlying computations. I neither know nor care how active_users came to be, only that I can obtain the calculation.

The imperative approach

Many data engineering systems are imperativeā€”Airflow DAGs begin as an empty canvas, dbt projects are a clean slate. This introduces possibility and flexibility to the system.

Imperative code allows data engineers to write custom logic tailored to specific requirements: you might have an Airflow DAG that needs to interface with a unique data source... so unique that no prebuilt tool exists for the task! No problem, as a data engineer with an imperative tool, you whip up some Python!

As I'm sure we're all aware, no two datasets are alike. Hence, there is no one-size-fits-all solution to data processing. Imperative tooling, i.e. Python and SQL, allows us to build the most precise pipelines possible.

Once we have cleaned datasets, we need to apply analytics and ML logic to derive insight. Imperative code lets us define the exact logic we need for our analysis, regardless of the underlying data.

Sounds great, right? There's a catch...

Well, there are a few:

These facts are often overlooked in open-source, imperative tools. There is a cost to implement any tool, regardless of its price. Building a data stack from scratch can wind up being more costly than purchasing one off-the-shelf. Labor is hella expensive these days.

By definition, imperative solutions don't generalize, i.e. they're difficult to reuse. This brings us further from DRY (don't repeat yourself) principles and means that you might be spending days/weeks writing very similar bits of code.

Lastly, the steep learning curve means that imperative tools impose a technological barrier to contribution. Need to make a small change to that dbt model? If you need to know Bash, Git, SQL, and Python, it's likely you (a) are on the data team or (b) need to ping someone on the data team. This creates a bottleneck to development.

The declarative approach

On the other hand, declarative solutions are attractive because they're typically more concise and have a gradual learning curve. Declarative code abstracts away implementation details and allows users to focus on defining the desired results: you only need to understand what you need, not how to get it.

This can be a huge win. Perhaps the most salient declarative solution in data engineering is ingestion. Sitting high atop our data thrones, we bequeath: "I want my Intercom data in Snowflake" and... voila! Fivetran makes it happen.

Ingestion is a perfect problem space for declarative solutions. There are a predefined set of inputs and outputs: sources and targets. By reusing common components and being intelligent about architecture, Fivetran was able to serve a declarative solution to an age-old problem.

The downside? What happens when Fivetran doesn't have the connector you need?

Furthermore, because declarative solutions abstract away implementation details, they can be harder to debug and maintainā€”it's not always apparent why something breaks. Without access to the underlying code, it can be impossible to triage the issue. For Fivetran, while you do have vendor support, you'd better be willing to fork over the šŸ’°šŸ¤‘.

In my experience, even vendor support isn't the most helpful for obscure pipelines... Though I'm not a die-hard Fivetran fan, it is a solution that works well enough.

Declarative or imperative? An analogy

If you've ever heard Enzo Ferrari speak about his cars, you might have confused his effervescence for that of a passionate lover, and with good reasonā€”his drive and legacy for manufacturing live on today.

Each Ferrari is custom made, from start to finishā€”this begets quality, but also scarcity. Around 10,000 are produced per year, with prices ranging from $200,000-400,000 USD: they're inaccessible to all but a fortunate few. Furthermore, while a Ferrari might be beautiful and really good at one thing (going fast), they aren't exactly known for their utility, fuel efficiency, or carrying capacity.

Ferraris are like imperative tools: custom, expensive, and great at what they were designed for, but not much else!

By contrast, Toyota has a very different business model. They pioneered a system for reducing waste, improving efficiency, and mitigating errors swiftly. Over the years, they've focused on procuring the most cost-effective components and delivering vehicles that are durable and suitable for many use cases.

Toyota sold 536,740 cars in 2022 with several models under $30k (it's wild that this is a low price for a car these days, but talk to Jay Powell, not me). Despite this affordability, Toyota has become renowned for its quality and durability.

While you can't buy a Toyota that goes zero to sixty in under 3 seconds, their cars would be suitable for 95% of us. I think you can see where I'm going here... This is the declarative equivalent.

The problem with tooling today

We can think of solutions like Ferraris or Toyotas. Do I need an expensive, custom solution to solve my bleeding-edge problem? Or am I after the durable, extensible solution that doesn't break the bank? There is no right answer, but it's important to understand which path you're headed down.

Today, there is no middle ground in data engineering products.

Tools like Airflow and dbt come with hefty implementation costs, steep learning curves, and OH so much wasted energy (have you ever built a dbt project from scratch?)

By contrast, overly-declarative GUI tools, i.e. Matillion, Informatica, Wherescape, are tough to debug, mandate hacky workarounds, and have UIs reminiscent of the Vista rendition of Microsoft Minesweeper. The development experience is eerily similar, too.

If I have to click something more than 3 times to accomplish a task, I'm out.

However, I believe this is about to change. The next wave of great data engineering tools will be both declarative and imperative. Existing tools will adapt... or die. The ideal tool combines both paradigmsā€” it handles the common remarkably well, but also allows for robust solutions at the edge.

Synergy

Leveraging declarative and imperative components, tools like Meltano, Mage, and even Airflow (with some third-party integrations) can be incredibly powerful.

Take Meltano as an example: in addition to its declarative "marketplace," you can also build sources and targets. Functionally fungible, Meltano taps combine paradigms powerfully.

This is the pattern we'll focus on for the rest of the article: the hybrid declarative/imperative tool.

Analytics engineers are paradigm ninjas

SQL is a great example of a language where imperative and declarative patterns are already used to construct high-level transformations. Many analytics engineers are familiar with the following pattern:

  1. Store common transformations in tables.
  2. Use common tables as inputs to queries.
  3. Leverage CTEs as the "building blocks" of calculations.
  4. Chain CTEs, tables, and aggregates to construct a query.

One area where AE's fall short, in my experience, is recycling ā™»ļø at the query level. "Query libraries" remain an unsolved problem. SQL is written, stashed, and lost more than any company will admit. Worse, there's no marketplace to go find common SQL tidbits.

Surprisingly, the imperative/declarative framework and query libraries are absent from most SQL tooling. I find this odd since many seem enamored with the semantic layer, which one could argue is tangential to developing transformation at scale.

To be fair, Coalesce is pioneering a hybrid approach (they call it Data Architecture as a Service, or DAaaS for short), but their product is targeted at the Snowflake enterprise market.

Shared Resources

What we need is a transformation tool that allows users to share patterns. Not just for data transformation, orchestration & data engineering, too! One that democratizes data transformation in the most meaningful way possible: by making common code available in a marketplace-like setting.

A prime example? GitHub Actions.

GitHub Actions revolutionized CI/CD. I say this because I can remember a time when I knew absolutely nothing about CI/CD. While some claim that's still true, I have been able to build some pretty awesome (self-proclaimed) stuff with Actions. šŸ˜‚

The innovation? GitHub open-sourced the CI/CD "job." Anyone can create one in the marketplace. Now, to create a pipeline, I'm defining my problem, grabbing pre-built code, and plugging it in. Do I need to know how to get a list of changed files on merge? Nope. Do I need to spend hours deploying to Kubernetes? Nope.

All I need to know is:

  1. What I want to accomplish.
  2. What Actions are available.
  3. (Possibly) how to build a custom component if I'm doing something obscure.

Thanks to Google, #2 is pretty easy. So really, all I need to understand is the solution and edge cases... That's insanely powerful.

Could you imagine if the same thing were true for data orchestration? Transformation? Analysis? The technical barrier to entry would be effectively zero.

How many times have you written the same code someone else wrote last month? What if we could capture 10% of those solutions and open-source them? 30%? 75%? That would revolutionize data transformation.

Conclusion

Declarative and imperative patterns both have their place in data engineering. Unfortunately, most tools in the Modern Data Stack are declarative or imperative, resulting in fragmented implementations and the need for far too many tools.

A hybrid approach leverages the best qualities of both solutions and nicely complements collaborative implementations. Architects can "build" imperative solutions, which can then be implemented with declarative language. This promotes knowledge sharing while eliminating bottlenecks.

We're at a crossroads in data tooling. The MDS giants of the future will leverage both declarative and imperative patterns, with code and GUIs, to create tooling that not only democratizes data transformation but open-sources common code via an Actions-like marketplace. Some innovative teams are already building the start of these solutions.

Until then, I advocate leveraging declarative frameworks atop imperative tools (e.g. AstroSDK) or seeking out solutions that have flexibility built-in, like Meltano or Mage.

Data/analytics engineering is currently limited by a lack of solution-sharing. We need a tool that enables us to share solutions and a place to do so. Until then, we'll be confined to only what our teams can accomplish rather than building on the work of engineers before us.

#collaboration #data #meta #opinion