Batch vs Stream Processing

A central design choice in data systems is whether to process data in batches or as a stream. Batch processing collects a large, bounded set of records and runs a job over the whole set at once, typically on a schedule. This is the model of MapReduce and of the original Apache Spark engine, which excel at chewing through large fixed datasets where some delay before results appear is acceptable.

Stream processing instead handles data continuously, record by record, as it arrives. Apache Flink frames this in terms of stream boundedness: an unbounded stream “has a start but no defined end” and must be processed as events flow in, while a bounded stream is one with a defined end. Engines such as Flink and the Kafka Streams library compute results in near real time, which matters for fraud detection, monitoring, and live dashboards where waiting for a nightly batch is too slow.

The line between the two has blurred. Flink’s site presents batch as a special case of streaming, a bounded stream processed by the same engine. Apache Spark, coming from the batch side, added the ability to unify processing of batch and real-time streaming data. So a single platform can often serve both styles.

Understanding this axis explains much of the shape of a modern data stack: where latency requirements are loose, scheduled batch jobs over a data lake suffice; where freshness is essential, a streaming pipeline carries events continuously from source to result.

Sources

Related