Batch vs Streaming Windowing in Apache Beam: A Mental Model for BigQuery Sinks · Ismail Ait Bahammou — Portfolio

The problem most tutorials skip

Most introductions to Apache Beam explain windowing as a syntax feature: you call beam.WindowInto(), pick FixedWindows, SlidingWindows, or Session windows, and move on. What they rarely explain is why windowing exists in the first place, and why the concept only becomes non-trivial once your source is unbounded.

When you're processing a batch — a fixed export of order records sitting in Cloud Storage, for example — there's no ambiguity about when the data is "done." The job reads the file, processes every record, and finishes. A window in this context is mostly a grouping mechanism: a way to bucket records by time for aggregation, no different conceptually from a GROUP BY DATE(event_timestamp) in BigQuery.

Streaming breaks that assumption entirely. In an unbounded pipeline — say, clickstream events arriving continuously from a mobile app — there is no natural "end" of the data. Beam has to answer a much harder question for every window: how do I know when I've seen enough data to emit a result, given that events can arrive late, out of order, or not at all? That question is what windowing, watermarks, and triggers exist to solve, and it's the piece that trips up most engineers coming from a batch-SQL background.

Watermarks: Beam's best guess at "now"

A watermark is Beam's estimate of event-time progress — a moving line that says "I believe I have seen all data with an event timestamp before this point." It is explicitly a heuristic, not a guarantee. The watermark is computed differently depending on the source: for Pub/Sub, it's derived from publish timestamps and per-key backlog; for Kafka, from broker offsets and timestamps.

This matters because the watermark is what triggers window completion by default. A FixedWindows(60) window doesn't fire the instant 60 seconds of wall-clock time passes — it fires when the watermark crosses the end of that window, which depends on how far behind the source is and how much out-of-order arrival Beam is accounting for. If your ingestion pipeline has a client that buffers events for several minutes before uploading (common with mobile apps in poor connectivity conditions), your watermark estimate needs to account for that lag, or you'll be closing windows before the relevant data has actually arrived.

Triggers and allowed lateness: deciding when "good enough" is good enough

Once you accept that the watermark is a guess, the next question is what to do about events that arrive after their window has already fired. Beam gives you three levers to control this:

Triggers decide when a window emits a pane of results. The default is watermark-based (fire once, when the watermark passes the window end), but you can add early triggers (emit speculative results every N seconds while the window is still open) or late triggers (re-emit when a late record arrives).
Allowed lateness defines how long after the watermark passes a window's end you're still willing to accept and incorporate late data into that window, after which the window's state is discarded.
Accumulation mode determines whether a re-fired window emits only the new data since the last pane (DISCARDING) or the full, updated result including everything seen so far (ACCUMULATING).
This combination is the actual design surface of a streaming pipeline. A batch job never has to think about "what if this record shows up three hours after I already reported the daily total" — a streaming job has to make that decision explicitly, and the decision has direct downstream consequences for correctness.

Where this collides with BigQuery as a sink

This is the part that matters most in practice, because it's where an elegant Beam windowing strategy can silently produce a messy, duplicate-riddled BigQuery table if you don't think it through.

When a windowed streaming pipeline writes to BigQuery, each firing of a window produces a write, not necessarily a final, immutable record. If you're using ACCUMULATING mode with early and late triggers, a single logical window (say, "orders placed between 14:00 and 14:05") might write to BigQuery three or four times as more data arrives — first a speculative partial result, then a watermark-triggered "complete" result, then one or two late-data corrections. If your sink is a plain WriteToBigQuery using streaming inserts with WRITE_APPEND, every one of those firings becomes a new row, not an update. You end up with multiple rows representing the same logical window, and your downstream BigQuery consumers — dashboards, aggregation jobs — need to know to deduplicate by taking the latest pane, not sum everything blindly.

This is precisely the same shape of problem as deduplicating a batch table keyed by some natural identifier, just moved earlier in the pipeline and driven by trigger semantics instead of upstream data duplication. The fix pattern is familiar too: write with an explicit window-close timestamp or pane-index as part of the row, and either use MERGE/QUALIFY ROW_NUMBER() in a downstream batch step to collapse to one row per window, or configure Beam to only use the default single, watermark-triggered firing with no early triggers and a DISCARDING accumulation mode if your use case can tolerate a few extra minutes of latency in exchange for a clean one-row-per-window write.

The other collision point is table design. Streaming inserts land in BigQuery's streaming buffer before being finalized into the table's storage, which affects how quickly MERGE and UPDATE operations can target those rows — recently streamed rows aren't immediately mutable. If your architecture depends on correcting late-arriving windows via MERGE, you need to account for that buffer delay, or design the correction step to run on a short delay itself rather than immediately after ingestion.

A practical mental model

The cleanest way to reason about this, especially if most of your instincts come from batch SQL, is:

Batch windowing is a GROUP BY you control completely. You know exactly what data exists before you group it, so the "window" is just a bucketing key.

Streaming windowing is a GROUP BY where you don't control when the group is considered final. The watermark, trigger, and allowed-lateness settings are your negotiation with reality about how long you're willing to wait for a group to be "done" before you report on it — and every time you report early, you're implicitly signing up to correct that report later.

Once that reframe clicks, the BigQuery sink design follows naturally: decide up front whether your downstream consumers need speculative low-latency results (accept multiple writes per window, dedupe downstream) or a single authoritative write per window (accept higher latency, use watermark-only triggering). Trying to get both — low latency and single-write correctness — without a downstream dedup step is the mistake that produces the "why do I have three rows for the same five-minute bucket" debugging session almost everyone runs into eventually.

Takeaway

Windowing in Beam isn't a syntax detail — it's the mechanism by which an unbounded, out-of-order stream gets reshaped into something a bounded, ACID-adjacent system like BigQuery can reason about. The moment you design a Dataflow pipeline that writes to BigQuery, you're implicitly choosing a tradeoff between latency and write-cardinality, whether or not you realize it. Making that choice explicit — and handling it with the same deduplication discipline you'd already apply to a batch table — is what separates a pipeline that "usually works" from one that produces trustworthy numbers under real-world, out-of-order conditions.