matt's blog

The State of Streaming

Disclaimer: This was assembled through a hodgepodge of personal research, conferences, and chatting with industry professionals. This was my best attempt at understanding wtf is going on with streaming, which must be pretty difficult even for streaming aficionados, given how hard it was. A special thank you goes out to Zander Matheson of Bytewax for his help!

Background

The landscape of streaming data has seen a rapid evolution in the past decade, driven by increasing data volumes and the necessity for real-time processing (IoT, sensor data, real-time consumer applications). This has given rise to complex platforms and tools that, while powerful, often carry a steep learning curve.

Streaming was once thought of as a low-latency and inaccurate solution, characterized by the popularity of the Lambda Architecture, a system where batch and streaming solutions run in parallel with the batch output "correcting" the streaming output.

Today, well-designed streaming systems commonly provide the same quality of output as their batch counterparts in no small part thanks to the robust development and community around stream platform & processing. Before we jump in, a quick note on what streaming data is.

Bounded and Unbounded Data

There are two types of data that typically lend themselves to the terms "batch" and "streaming," while we commonly use those terms to describe the underlying data, they actually refer to the processing of data.

Any data can be processed as a batch or stream, but we are commonly referring to the characteristics of data that we would like to stream. More appropriate terms are bounded and unbounded data.

Bounded data

Bounded data describes a finite pool of resources that is transformed from one state to another. We extract, load, and transform it in our desired system. Easy, manageable, finite.

Unbounded data

Unbounded data represents most datasets. While unnerving, the data never really ends, does it? Transactions and sensor data will continue to flow (so long as the lights remain on). Today, unbounded data is the norm. There are a number of ways to process unbounded data:

It's crucial to differentiate between the actual event time and the processing time, since discrepancies often arise.

This demonstrates an important point:

Most solutions described above are actually batch solutions! We should think about most problems as batch problems, as complexity grows geometrically as we require "real-time" solutions, for now at least.

Even streaming unbounded data technically involves processing finite chunks through windowing.

With that in mind, we can continue onto streaming platforms.

Streaming Platform

When we say "platforms," we refer to "transportation layers" or systems for communicating and transporting streaming data.

Stream Processing

Stream processing is about analyzing and acting on real-time data (streams). Given Kafka's longevity, the two most popular and well known stream processing tools are:

Simplifying Stream Processing

Several newcomers are attempting to simplify stream processing by offering Python-native, open-source streaming clients, with a focus on performance and simple development cycles.

Unification of Stream Models

Several streaming services seek to unify stream models, by providing an API or platform for creating a model compatible across these technologies:

Streaming DBs

Also seeking to simplify the complexity of streaming data stacks, streaming databases attempt to combine stream processing and storage.

They’re built to handle real-time data and offer continuous insights without the need to transfer data to another processing system. Streaming databases are particularly attractive for:

Streaming databases are particularly attractive for reducing dependencies and improving ease-of-use. RisingWave & Materialize are examples of modern streaming databases.

Wrap

The streaming landscape, while complex, has seen strides in simplification and user-friendliness. As the demand for real-time data processing continues to grow, we can anticipate even more tools and platforms that cater to a broader range of users and use-cases.

I’m pretty excited to see what the future of streaming looks like— I anticipate systems that abstract away the complexity of streaming data will continue to proliferate to the point where, one day, all data will be streamed.

#data #opinion #streaming