DDIA Chapter 11: Stream Processing
Tony Duong
Jun 2, 2026 ・ 7 min
Chapter 11 of Designing Data-Intensive Applications picks up where batch left off. If batch processing answers "what happened over this huge dataset?", stream processing asks "what is happening right now, as events arrive?" This chapter connects messaging systems, databases, and real-time computation into one coherent picture.
Where stream processing sits
Recall the three-system model from Chapter 10:
- Services (online): respond to user requests; latency in milliseconds.
- Batch (offline): process a fixed dataset; latency in minutes or hours.
- Stream (near-real-time): process events as they flow in; latency in seconds — not instant like a request/response, but far faster than batch.
Stream processing is the glue between "something happened" and "react to it soon."
Events and event streams
An event is a small, immutable, self-contained record of something that happened at a point in time — a page view, a sensor reading, a payment.
Events differ from commands (do this) and state (current value). Events are facts about the past.
An event stream is an unbounded sequence of events, usually ordered by time. Producers append; consumers read forward (and sometimes replay from an earlier offset).
Encoding matters: JSON is easy but schema evolution is painful. Avro, Protobuf, and schema registries help producers and consumers evolve independently.
Message brokers and logs
Message brokers (RabbitMQ, Kafka, Pulsar, etc.) decouple producers from consumers — the same decoupling pattern as queues in system design interviews, but oriented around streams of events rather than one-off jobs.
Two broad models:
| Model | Behavior | Examples |
|---|---|---|
| Queue (AMQP/JMS style) | Message deleted after ack; competing consumers load-balance | RabbitMQ, SQS |
| Log (Kafka style) | Messages retained; consumers track offset; replay possible | Kafka, Pulsar |
Partitioned logs scale throughput: each partition is an ordered sub-stream. Producers pick a partition key; consumers in a group each own some partitions.
Design tricks that keep brokers fast:
- Batching small writes into larger ones
- Sequential disk writes (append-only logs) instead of random I/O
- Zero-copy transfer where possible
Consumer patterns:
- Load balancing: each message to one consumer in the group
- Fan-out: same message to many independent consumer groups (e.g. indexing + analytics + billing from one stream)
Acknowledgements and redelivery mirror queue semantics: at-least-once is common; consumers must be idempotent.
Databases and streams — two sides of the same coin
The chapter's central insight: a database's write-ahead log (WAL) is already a stream of events. Replication often works by tailing that log. Stream processing extends the idea outward.
Change Data Capture (CDC)
CDC turns database row changes into a stream of events (insert/update/delete). Tools like Debezium tail the DB transaction log.
Use cases: invalidate caches, update search indexes, sync read replicas, feed analytics — all without polling the database.
Event sourcing
Instead of storing current state and overwriting it, store the sequence of events that led to the current state. State is derived by replaying events (or snapshots + replay).
Advantages:
- Complete audit trail
- Temporal queries ("what did the account look like on March 1?")
- Easier evolution — new consumers can rebuild state from the log
- Aligns with immutable, append-only thinking from batch processing
Drawbacks:
- Schema evolution on stored events is hard
- Replaying long histories is slow without snapshots
- "Deleting" data becomes awkward (tombstone events, compaction)
- Not every domain fits naturally as a pure event log
State, streams, and immutability
Batch philosophy (immutable inputs, deterministic functions) carries over: immutable events + derived state is a powerful default. The database holds the current snapshot; the stream holds the history.
Uses of stream processing
Stream processors (Kafka Streams, Flink, Spark Structured Streaming, ksqlDB) apply functions to incoming events:
- Complex Event Processing (CEP): detect patterns across events ("three failed logins then a large transfer")
- Stream analytics: rolling counts, averages, percentiles over windows
- Maintaining materialized views: keep an aggregated table up to date incrementally
- Search/index updates: push documents into Elasticsearch as they change
- Notifications and alerting: react within seconds, not after the nightly batch
Reasoning about time
This is one of the chapter's hardest topics — and one interviewers love.
Two clocks:
- Event time: when the event actually happened (timestamp in the event)
- Processing time: when the processor sees it
Network delays, retries, and mobile offline sync mean event time and processing time diverge. Stream systems must handle out-of-order events.
Windows
Aggregations usually happen over windows of time:
| Window type | Description |
|---|---|
| Tumbling | fixed, non-overlapping (e.g. every 1 minute) |
| Hopping | fixed size, overlapping (e.g. 5-min window every 1 min) |
| Sliding | all events within last N minutes of each event |
| Session | gap-based — group events with pauses < threshold into one session |
Watermarks estimate how far event time has progressed — they tell the processor "we probably won't see events older than X anymore" so windows can close and emit results, even with stragglers.
Stream joins
Joins in streams are trickier than in batch because data arrives continuously and unbounded.
- Stream–stream join: two streams, events matched if they fall in the same window (e.g. clicks joined with impressions within 30 minutes)
- Stream–table join: enrich events with lookup data from a changelog stream of a table (CDC makes this natural)
- Table–table join: both sides are changelog streams; equivalent to maintaining a materialized join view incrementally
Fault tolerance
Stream jobs run for days or months. When a node crashes, state must survive.
Approaches:
- Microbatching (Spark Streaming legacy model): chop the stream into small batch intervals; checkpoint each batch. Simple but higher latency.
- Checkpointing (Flink, Kafka Streams): periodically snapshot operator state + input offsets to durable storage. On recovery, resume from last checkpoint.
- Idempotent writes + at-least-once delivery: same safety net as message queues — design outputs to tolerate retries.
Determinism and replayability (from batch) apply here too: if you can replay events from a log and get the same result, recovery is straightforward.
Batch vs stream — and unifying them
Historically:
- Lambda architecture: batch layer (accurate, slow) + speed layer (approximate, fast) + serving layer merging both. Powerful but operationally painful — two pipelines, two codebases.
- Kappa architecture: treat everything as a stream; batch is just replaying the log with a longer window. Simpler mentally if your log retains enough history.
Modern systems blur the line: Spark Structured Streaming, Flink batch mode, and Kafka's log retention mean one engine, two modes — stream for low latency, replay the same log for backfill.
Key takeaways
- Events are immutable facts; streams are unbounded, ordered sequences of them.
- Message brokers decouple producers and consumers; log-based systems (Kafka) add replay and fan-out that queues don't.
- CDC and event sourcing connect databases to streams — the WAL was always a stream underneath.
- Event time vs processing time and watermarks are essential for correct windowed aggregations.
- Stream joins (stream–stream, stream–table, table–table) extend relational thinking to unbounded data.
- Checkpointing and idempotent consumers make long-running stream jobs fault-tolerant.
- Batch and stream are partners, not opposites — same immutability and determinism ideas, different latency targets.