A Crash Course in Streaming

Processing infinite data sets

7 min readJan 28, 2022

Audience

This article is aimed at engineers with a basic understanding of when we may want to collect and analyse large amounts of data. You don’t necessarily need hands on experience of designing such systems, but will need an appreciation of when these sorts of problems occur.

Argument

Let’s begin with why we need streaming.

Motivation

We first explore a problem we are trying to solve. Imagine we work for a company, and we are running a number of servers all over the world. We would like to collect and parse the logs from these servers to see how many requests are succeeding or failing. How would we go about doing so?

We imagine an architecture similar to the below.

All of our servers send their logs to queue using an agent, which is then pulled from by a leader node. This distributes the logs between nodes responsible for their aggregation. These results are then written to storage.

However, the logs keep coming, and there’s thousands a second! This is a simple issue with a tough logistical solution.

Let’s clarify the properties we need from a streaming engine.

We want data and insights in near real-time, even if it is not 100% accurate (low-latency, approximate results).
We want to deal with data that is continuously, infinitely generated (unbounded data).
We want to process data consistently as it comes in, spreading our processing more evenly over time (unbounded data processing).

Introducing Streaming

Now we’ve got our head round when we might need streaming, let’s talk about how we define it. This article written by a Google engineer shortens it to a type of data processing engine that is designed with infinite data sets in mind. In the above definitions this means we can deal with unbounded data.

Note, this doesn’t limit the other two points. We don’t need to process data as it comes, and we are allowed to provide absolutely correct results (i.e. they’re not always approximate, we can get strongly-consistent data).

Streaming vs Batch Processing vs Micro-Batching

Just in case you haven’t heard of batch processing, this entails chopping your data into groups to carry out some work on it. We collect data over time, then process it in chunks. As we are generally collecting it over a window this can means some latency in gathering results (i.e. we have to wait for the window to close before we process anything).

An example of batch processing might be payroll. We collect work data over a month, then process it at the end to calculate how much people should be paid. This is a long window, but you get the idea.

On the other hand, we may want to implement fraud detection for a bank. If we can analyse in real time where purchases are coming from (say one comes from the UK, then immediately another comes from the US), then it’s easy to see that there is something odd happening to that account.

Choosing batch vs streaming depends on when you need your results and the complexity of your analysis. If you need results instantly, then streaming is a good solution. If you can wait, or you can’t implement streaming, then batch may be for you. Additionally if you need to do complex analysis (say over an extended timeframe) then batch processing is also helpful.

Something that frequently comes up in the streaming world is the idea of micro-batching. This is when the batch window is so small that it essentially becomes streaming, and is worth noting for later.

Use Cases for Streaming and Batching

So when would we use these technologies? Let’s delineate some examples:

Filtering is a good start. Say we have a lot of logs coming into our imaginary system and we only want to look at some of them. We can remove uninteresting ones from our stream.
Joining data is another good area. If we have a start log and an end log to tell us when a transaction has completed we could store the start in some persistent storage, then wait for the end, join the two and send on the aggregated result.
Approximation algorithms allow us to take data in, do some computation and then return a rough result (remember our quick, but not necessarily accurate data insights?).

All of the above are great as long as we aren’t worried about when events occurred (they are time-agnostic or stateless). The minute we introduce a temporal element we can run into issues with both streaming and batching. This is partly what introduces our idea of streaming giving approximate results.

To explain this, let’s mark the difference between event times and processing times. Event times are when we considered the data generated. Processing times are when it comes through our engine. They are by no means the same. Network latency, slow machines, servers dropping out and a host of other things can mean that they become quite different.

Returning to our banking example. Our fraudulent transactions may happen milliseconds after each other, but are processed seconds apart. How do we detect this?

Dealing with Temporal Information

Batch processing tends to split our data down into windows. This is great for things like the payroll example, where our window is fixed. However, if we don’t know the start and the end time then things become complex.

Streaming tries to deal with this problem in a more flexible way. Initially, we can still use the fixed window method of batching. However, we can also look into using sliding windows (where we have some overlap), or sessions (where we define a window as having ended when there is a certain level of inactivity).

Unfortunately we can only do the above based on processing time. We have no idea when an event will make its way to our system, so it’s incredibly hard to guarantee we have all events within a given period!

We can introduce more complex techniques like watermarks, triggers and accumulation, but this is outside the scope of the article.

Streaming Technologies

So far we have introduced the idea behind streaming, compared it to batching, and delineated some of the problems we might experience when using it. However, to get a solid understanding it may help to introduce some of the relevant technologies. We will begin with some batch processing technologies, then move onto streaming.

For batch processing there is Spring Batch, AWS Batch and MapReduce. We won’t dwell on them too long, as they are essentially frameworks for running jobs based on conditions, with some functionality for start/ stop/ retries.

The more pertinent technologies are those concerned with streaming, and there’s a load! AWS Kinesis, Apache Flink, Kafka, Spark Streaming and Apache Storm to name just a few.

Kinesis and Flink I won’t cover in detail as I have a whole article using them here. It will really help your understanding to go check it out then come back. I’ll be waiting!

Kinesis comes in a few flavours, but the most relevant two are Kinesis Data Streams and Kinesis Data Analytics. Circling all the way back to our logging example, Kinesis Data Streams acts as our queue. It allows multiple producers to write to it, and multiple consumers to read from it. Kinesis Data Analytics then lets us host Apache Flink jobs for doing things like aggregating our data, whilst also remaining stateful (i.e. we can deal with event times).

Kafka is similar. It allows us to read, write, store and process events as they happen, or in retrospect. In terms of producers, consumers, topics etc. it acts as a message broker in a way that is conceptually very similar to Kinesis. The Kafka ‘getting started’ documentation is excellent and can be found here.

One interesting thing to note is Kafka Streams. These do processing, but are embedded within your application. This is different to Kinesis in that in processing is generally hosted on a Kinesis Data Analytics stream. Kafka Streams is just a dependency for your Java project, that you use and deploy as with any other application.

The final technology we will touch on is Spark Streaming (sorry Apache Storm). I once wrote a test project in Spark which you can find here. Spark Streaming is a component of the wider Spark ecosystem. It is marginally different to the others in that it uses micro-batching, and different to Kafka streams in that it is hosted standalone.

Hosting Streaming

This final short section will focus on how we host streaming. We have briefly touched on it in the previous section, but we can delve into more detail here.

Kinesis is simple as it is a managed AWS service. We can use Kinesis Data Analytics to host Flink as well. The other AWS service that is widely used for this sort of thing is Amazon Elastic MapReduce, which can be used to run things like Spark Streaming,

Finally we can use AWS MSK to host Kafka (with the caveats we spoke about earlier).

Conclusion

In conclusion, we have gone over the problem streaming tries to solve, some of the issues involved with streaming, some of the technologies, and how we might host a solution.