Fast ingestion for near real-time querying
Data analytics, as we know it, is undergoing a seismic shift. SQL query engines traditionally designed for batch workloads on Lakehouse architectures like Apache Iceberg face unprecedented pressure to deliver near real-time querying capabilities with strict Service Level Agreements (SLAs).
What once took a day (processing and querying large datasets in batches) has evolved to hourly updates. Now, businesses expect data to be queryable within minutes of its generation at the source. This evolution from leisurely batch processing to near-instantaneous insights introduces challenges for systems streaming data into Iceberg tables.
This isn’t about merely speeding things up; it requires fundamentally rethinking ingestion strategies, with query ability SLA as the final non-negotiable goal. With that said, let’s explore these challenges and arrive at innovative solutions to meet the demands of real-time analytics on Apache Iceberg tables.
Historically, batch query systems operated on relaxed timelines; daily reports were the norm. As organizations sought fresher insights, this cadence tightened to hourly. Today, the expectation has escalated dramatically–data needs to be queryable within minutes of its event time (when it happened). Consider the demands:
This requirement for insights mere minutes after event generation clashes fiercely with traditional ingestion and query patterns designed for batch efficiency.
Transitioning Iceberg from its batch-optimized roots to handling near real-time query demands isn’t a simple tweak. It introduces fundamental challenges for streaming ingestion systems:
Data rarely arrives smoothly; it often comes in unpredictable bursts. Imagine an e-commerce platform during a flash sale: order events spike dramatically. Traditional ingestion systems rely on fixed commit intervals (e.g., every 5 minutes or 1 million records) using internal barriers to synchronize and create new Iceberg snapshots. During a spike, this rigidity causes a backlog. Events queue up faster than they can be committed, and the event time of the data waiting gets increasingly stale compared to the commit time. Fixed intervals cannot adapt to this variability, leading to missed SLAs.
Example Scenario: An e-commerce site typically handles 100 orders/minute, which is fine for a 5-minute commit interval. A sale triggers 1,000 orders/minute. The fixed interval can’t keep up; the queue grows, and data generated at 10:01 AM might not be queryable until 10:05 AM or later, breaking a “within minutes” SLA.
Systems might commit very frequently (e.g., every minute) to lower latency. This creates a rapid succession of snapshots, leading to multiple issues:
Meeting these challenges requires a shift toward specialized, intelligent ingestion systems for Iceberg. Here are the key innovations:
The Idea: Associate each Iceberg snapshot with an explicit event-time watermark – metadata guaranteeing that all events with an event time less than or equal to this watermark (W) are included in this snapshot or earlier. Unlike watermarks in streaming engines (which pause computation waiting for late data), this Iceberg watermark can be applied retroactively by the ingestion system. Unlike processing-time watermarks, which reflect when data is processed, event-time watermarks ensure all events up to a specific time are captured, regardless of ingestion delays.
Implementation: The ingestion system tracks event times. Upon committing a snapshot, it calculates and stores this watermark W alongside the snapshot metadata, potentially using Puffin files, dedicated metadata fields, or an external metastore.
How It Works: A query needing data up to 10:00 AM event time can now ask the Iceberg metadata: “What’s the earliest snapshot with a watermark >= 10:00 AM?”. The engine can then target this specific snapshot (or an earlier one), knowing it contains all necessary historical data without including irrelevant newer data. This enables precise, efficient time travel based on event time.
Example: An IoT system commits snapshot S5 at 10:05 AM processing time but calculates its event-time watermark as 10:00 AM. A query for sensor readings strictly before 9:55 AM can confidently use an earlier snapshot (e.g., S4 with watermark 9:55 AM) identified via its watermark, avoiding the more considerable, latest S5.
The Idea: Replace rigid, fixed commit intervals with a dynamic approach driven by the target SLA. The ingestion system must constantly monitor the lag between the latest committed event-time watermark and the event times of data currently arriving. The system could monitor metrics like maximum event-time lag across partitions, trigger more frequent commits, or scale resources when lag exceeds the SLA threshold.
Implementation: If this lag exceeds the SLA threshold (e.g., > 5 minutes), the system must react:
How it Works: This creates a responsive, SLA-driven ingestion paradigm. During the e-commerce flash sale spike, the system detects the growing lag and automatically shifts from a 5-minute to a 30-second commit interval, scaling workers as needed. As traffic normalizes, it reverts, balancing freshness with efficiency. Dynamic commits may demand more compute during spikes, requiring careful resource management to maintain cost efficiency.
The Idea: Query engines need Parquet file statistics (min/max, null counts, etc.) for optimization. Calculating these after the snapshot commit introduces a delay. Near Real-time ingestion requires calculating these statistics during the data-writing process.
The essence of these solutions points to a clear need: achieving genuine real-time queryability on Iceberg requires specialized ingestion systems. These systems must be architected to:
The intensifying pressure for near real-time insights from Iceberg lakehouses is undeniable. While challenges like bursty traffic, snapshot overhead, and the lack of event-time context are significant, they are solvable. By embracing dynamic, SLA-driven commits, embedding event-time guarantees within snapshots, and calculating metadata proactively, we can bridge the gap between Iceberg’s batch-oriented heritage and the demands of real-time analytics.
This evolution requires purpose-built ingestion systems, heralding a new era for lakehouses—one where Apache Iceberg fully delivers on the promise of instant, reliable, and time-precise insights.
As real-time demands grow, how will your ingestion pipeline adapt to keep Iceberg queryable within minutes? We would love to engage and brainstorm how your organization navigated the challenges.
We are universally interoperable and open-source friendly. We can integrate across any object store, table format, data catalog, governance tools, BI tools, and other data applications.
We use a usage-based pricing model based on vCPU consumption. Your billing is determined by the number of vCPUs used, ensuring you only pay for the compute power you actually consume.
We support all types of file formats, like Parquet, ORC, JSON, CSV, AVRO, and others.
e6data promises a 5 to 10 times faster querying speed across any concurrency at over 50% lower total cost of ownership across the workloads as compared to any compute engine in the market.
We support serverless and in-VPC deployment models.
We can integrate with your existing governance tool, and also have an in-house offering for data governance, access control, and security.