Engineering

Beyond Batch: Architecting Fast Ingestion for Near Real-Time Iceberg Queries

Karthic Rao

April 3, 2025

Solutions for shifting from leisurely batch to near real-time querying

Fast ingestion for near real-time querying

Want to see e6data in action?

Learn how data teams power their workloads.

Get Demo

Data analytics, as we know it, is undergoing a seismic shift. SQL query engines traditionally designed for batch workloads on Lakehouse architectures like Apache Iceberg face unprecedented pressure to deliver near real-time querying capabilities with strict Service Level Agreements (SLAs).

What once took a day (processing and querying large datasets in batches) has evolved to hourly updates. Now, businesses expect data to be queryable within minutes of its generation at the source. This evolution from leisurely batch processing to near-instantaneous insights introduces challenges for systems streaming data into Iceberg tables.

This isn’t about merely speeding things up; it requires fundamentally rethinking ingestion strategies, with query ability SLA as the final non-negotiable goal. With that said, let’s explore these challenges and arrive at innovative solutions to meet the demands of real-time analytics on Apache Iceberg tables.

‍

The Problem: From Leisurely Batch to Near Real-Time

Historically, batch query systems operated on relaxed timelines; daily reports were the norm. As organizations sought fresher insights, this cadence tightened to hourly. Today, the expectation has escalated dramatically–data needs to be queryable within minutes of its event time (when it happened). Consider the demands:

Fraud Detection: To prevent loss, suspicious activity must be identified in near real-time and not in tomorrow’s report.
Operational Monitoring: Dashboards tracking application logs or IoT sensors must reflect the current state immediately to catch failures.
Real-time Personalization: E-commerce platforms must react instantly to user clicks and actions generated moments ago.

This requirement for insights mere minutes after event generation clashes fiercely with traditional ingestion and query patterns designed for batch efficiency.

‍

Why This Shift is Challenging for Iceberg Ingestion

Transitioning Iceberg from its batch-optimized roots to handling near real-time query demands isn’t a simple tweak. It introduces fundamental challenges for streaming ingestion systems:

Bursty Event Generation vs. Fixed Commits

Data rarely arrives smoothly; it often comes in unpredictable bursts. Imagine an e-commerce platform during a flash sale: order events spike dramatically. Traditional ingestion systems rely on fixed commit intervals (e.g., every 5 minutes or 1 million records) using internal barriers to synchronize and create new Iceberg snapshots. During a spike, this rigidity causes a backlog. Events queue up faster than they can be committed, and the event time of the data waiting gets increasingly stale compared to the commit time. Fixed intervals cannot adapt to this variability, leading to missed SLAs.

Example Scenario: An e-commerce site typically handles 100 orders/minute, which is fine for a 5-minute commit interval. A sale triggers 1,000 orders/minute. The fixed interval can’t keep up; the queue grows, and data generated at 10:01 AM might not be queryable until 10:05 AM or later, breaking a “within minutes” SLA.

‍

Snapshot Proliferation, Query Pressure, and Broken Time Travel

Systems might commit very frequently (e.g., every minute) to lower latency. This creates a rapid succession of snapshots, leading to multiple issues:

Query Engine Backpressure: Query engines must process metadata for each new snapshot. Constant, rapid commits force the engine into a game of catch-up to query the HEAD, increasing planning time and resource load.‍
Ineffective Time Travel: A core Iceberg feature–time travel–becomes unreliable. When snapshots are created based on processing time or fixed intervals, a snapshot at T lacks a guarantee of event-time completeness. It might contain some data up to T but miss late-arriving events from T-1. There’s no reliable way to identify the correct snapshot for time-sensitive queries (e.g., “Calculate billing for all usage occurring before 10:00 AM”). Users are forced to query the latest snapshot (which might contain unnecessary newer data) and filter, adding latency and defeating the purpose of efficient time travel for event-time correctness.

‍

The Solution Space: Architecting for Near Real-Time Iceberg Ingestion

Meeting these challenges requires a shift toward specialized, intelligent ingestion systems for Iceberg. Here are the key innovations:

Event-Time Watermark Guarantees Embedded with Snapshots

The Idea: Associate each Iceberg snapshot with an explicit event-time watermark – metadata guaranteeing that all events with an event time less than or equal to this watermark (W) are included in this snapshot or earlier. Unlike watermarks in streaming engines (which pause computation waiting for late data), this Iceberg watermark can be applied retroactively by the ingestion system. Unlike processing-time watermarks, which reflect when data is processed, event-time watermarks ensure all events up to a specific time are captured, regardless of ingestion delays.

Implementation: The ingestion system tracks event times. Upon committing a snapshot, it calculates and stores this watermark W alongside the snapshot metadata, potentially using Puffin files, dedicated metadata fields, or an external metastore.

How It Works: A query needing data up to 10:00 AM event time can now ask the Iceberg metadata: “What’s the earliest snapshot with a watermark >= 10:00 AM?”. The engine can then target this specific snapshot (or an earlier one), knowing it contains all necessary historical data without including irrelevant newer data. This enables precise, efficient time travel based on event time.

Example: An IoT system commits snapshot S5 at 10:05 AM processing time but calculates its event-time watermark as 10:00 AM. A query for sensor readings strictly before 9:55 AM can confidently use an earlier snapshot (e.g., S4 with watermark 9:55 AM) identified via its watermark, avoiding the more considerable, latest S5.

‍

Dynamic, SLA-Driven Commit Intervals

The Idea: Replace rigid, fixed commit intervals with a dynamic approach driven by the target SLA. The ingestion system must constantly monitor the lag between the latest committed event-time watermark and the event times of data currently arriving. The system could monitor metrics like maximum event-time lag across partitions, trigger more frequent commits, or scale resources when lag exceeds the SLA threshold.

Implementation: If this lag exceeds the SLA threshold (e.g., > 5 minutes), the system must react:

Commit More Frequently: Temporarily shorten the commit interval.
Scale Resources: Dynamically increase ingestion task parallelism to process the backlog faster.‍
Synchronize for SLA: Coordinate commit cycles across the pipeline to advance the event-time watermark to meet the query SLA.

How it Works: This creates a responsive, SLA-driven ingestion paradigm. During the e-commerce flash sale spike, the system detects the growing lag and automatically shifts from a 5-minute to a 30-second commit interval, scaling workers as needed. As traffic normalizes, it reverts, balancing freshness with efficiency. Dynamic commits may demand more compute during spikes, requiring careful resource management to maintain cost efficiency.

‍

On-the-Fly Metadata Calculation

The Idea: Query engines need Parquet file statistics (min/max, null counts, etc.) for optimization. Calculating these after the snapshot commit introduces a delay. Near Real-time ingestion requires calculating these statistics during the data-writing process.

Implementation: Integrate metadata calculation directly into the file-writing stage of the ingestion pipeline.
Challenge & Benefit: Ensuring consistency between calculated stats and actual data is critical and adds complexity. However, the benefit is significant: when a snapshot is committed, the necessary query optimization metadata is immediately available, enabling efficient queries without additional delay.

‍

Finally, the Need for a Specialized System for Near Real-Time Ingestion

The essence of these solutions points to a clear need: achieving genuine real-time queryability on Iceberg requires specialized ingestion systems. These systems must be architected to:

Handle bursty data streams using dynamic commit intervals tied to SLAs.
Embed event-time watermarks into snapshot metadata for reliable and efficient time travel.
Compute query optimization metadata on the fly during ingestion for immediate query readiness.

Conclusion

The intensifying pressure for near real-time insights from Iceberg lakehouses is undeniable. While challenges like bursty traffic, snapshot overhead, and the lack of event-time context are significant, they are solvable. By embracing dynamic, SLA-driven commits, embedding event-time guarantees within snapshots, and calculating metadata proactively, we can bridge the gap between Iceberg’s batch-oriented heritage and the demands of real-time analytics.

This evolution requires purpose-built ingestion systems, heralding a new era for lakehouses—one where Apache Iceberg fully delivers on the promise of instant, reliable, and time-precise insights.

As real-time demands grow, how will your ingestion pipeline adapt to keep Iceberg queryable within minutes? We would love to engage and brainstorm how your organization navigated the challenges.

‍

Share on

Build future-proof data products

Try e6data for your heavy workloads!

Get Started for Free

Frequently asked questions (FAQs)

How do I integrate e6data with my existing data infrastructure?

How does billing work?

What kind of file formats does e6data support?

What kind of performance improvements can I expect with e6data?

What kinds of deployment models are available at e6data ?

How does e6data handle data governance rules?