Engineering

Snappy Vs. ZSTD in Apache Iceberg: A Debate on Fast Writes Amid Small-File Challenges

Snappy Vs. ZSTD for Fast Writes in the Apache Iceberg

Snappy Vs. ZSTD Debate in Apache Iceberg

Want to see e6data in action?

Learn how data teams power their workloads.

Get Demo
Get Demo

Choosing the correct compression codec is crucial when optimizing Apache Iceberg tables for fast writes and reads. Iceberg often ends up writing many small Parquet files in streaming or incremental data pipelines, raising the question: is it better to use Snappy or ZSTD compression for these initial writes? 

Snappy offers blazing-fast compression/decompression at the cost of larger files, while Zstandard (ZSTD) achieves much smaller files with a heavier compute footprint. This analysis compares Snappy vs. ZSTD across several key dimensions relevant to Iceberg: performance trade-offs (CPU vs. file size), considerations for initial streaming writes (where small files are inevitable), and query efficiency before compaction (when thousands of tiny files must be read). This conversational and debate-style blog aims to offer a balanced conclusion so the Iceberg community can weigh the pros and cons of their specific workloads.

Snappy Vs. ZSTD in Apache Iceberg: A Fast-Writes Debate (When Small Files Rule)

Moderator: Welcome to the debate on parquet compression codecs in Apache Iceberg! We focus on comparing Snappy and ZSTD for fast writes, particularly in scenarios where small files are unavoidable during streaming ingestion. We will hear two perspectives: one advocating for Snappy's speed and the other supporting ZSTD's compression efficiency. Together, they will tackle key questions related to their respective advantages. Let’s dive in!

Compute Vs. File Size: Speed or Compression – Which Accelerates Writes?

Moderator: First up, the classic trade-off compute vs. file size. Should we prioritize faster compression (Snappy) or smaller output files (ZSTD)? Which actually makes writing to the Iceberg table faster?

Snappy Advocate: “Use CPU sparingly and keep things moving!” The argument is that Snappy’s lightweight compression minimizes CPU overhead, allowing writes to be completed quickly. Snappy became popular in big data systems precisely for its speed. Starting from version 2.1.0, Spark even made Snappy the default Parquet codec to favor performance over file size. Snappy typically provides the best overall read/write throughput among standard codecs at the cost of a lower compression ratio. Snappy might be optimal if your priority is raw write speed, even at the expense of larger files.  

The philosophy is simple: spend less time compressing and more time writing data out. In practice, that means less CPU per MB of data, which can reduce latency for each write batch. The cumulative CPU saved can be significant, especially when dealing with thousands of small file writes. Snappy focuses on high compression and decompression speed rather than maximum compression.

The saved compute can be redirected to other tasks, and you can avoid making your ingest pipeline CPU-bound. After all, the main objective of compression is to spend CPU to reduce I/O, and Snappy chooses to spend minimal CPU, accepting bigger I/O. In many fast-write scenarios, that’s a win. 

ZSTD Advocate: “Shrink those files and save on I/O – it pays off!” The counterpoint is that smaller files can enhance write speeds in a distributed write scenario. This is because writing a smaller amount of data oftentimes reduces the time spent on storage operations.

Zstandard (ZSTD) offers much higher compression ratios than Snappy. This reduces data size on both disk storage and network transmission. If your primary bottlenecks are network or disk throughput—issues commonly encountered with cloud object storage—then ZSTD's smaller file sizes can help accelerate overall write completion times.

For instance, in a benchmark comparing data conversion from Snappy to ZSTD, the file size decreased by approximately 39%. This significant reduction in file size leads to immediate storage savings and less strain on input/output operations. Current recommendations often suggest using ZSTD as it strikes a good balance; it provides solid read and write performance without sacrificing compression efficiency. In other words, you can achieve decent speed, resulting in much smaller files. ZSTD is designed for speed and effective compression, unlike the slower old-school gzip. Additionally, you can adjust the compression levels to optimize for faster performance if needed. The key takeaway is to initially compress more aggressively to minimize I/O issues later. 

Moderator: At this point, one side highlights compute savings (Snappy’s speed) and the other emphasizes I/O savings (ZSTD’s smaller files). It’s the classic CPU vs. bandwidth trade-off in compression. Next, let’s consider what happens in streaming write scenarios.

Initial Streaming Writes: Compress Now or Later?

Moderator: Imagine streaming ingestion into Iceberg, producing lots of small files. Should we compress aggressively upfront (with ZSTD) or use a lightweight approach since we’ll compact those files later anyway?

Snappy Advocate: “In streaming, every millisecond counts – compress later!” In the context of initial data, the primary focus is usually on maximizing ingest throughput. Writing it out with minimal delay is essential when data is continuously generated, such as from Kafka via any streaming ingestion system. A fast codec like Snappy ensures that each micro-batch or file write is executed quickly.

This approach is based on the fact that these small files are temporary; Apache Iceberg will later consolidate them into larger files through a compaction job. Therefore, spending extra CPU resources on highly compressing data that will be rewritten soon doesn’t make sense. Many practitioners use a ‘write fast now, optimize later’ approach in streaming pipelines. 

For instance, one might write all incoming data using Snappy for speed and then periodically run Spark jobs to merge the small files. This compaction process could employ ZSTD or another heavy compression technique on the larger merged files. By deferring the cost of compression to an offline job, the strategy resembles lazy compression, i.e., applying the bare minimum (Snappy) during ingestion while allowing maintenance tasks to handle optimization later. This method effectively keeps ingestion latency low. It’s also important to note that high CPU utilization during the ingestion of small files might delay compute jobs and create back pressure within the streaming ingestion system, especially when dealing with extremely high data volumes from systems like Kafka.

For context, Apache Iceberg sets the default Parquet write compression to ZSTD. However, not everyone follows this guideline for streaming. Since the 2.1.0 release, Spark and Delta Lake, another table format, have defaulted to Snappy to prioritize initial write speed. The underlying assumption is that small, fragmented files don’t require maximum compression because they will be short-lived. The priority is to rapidly ingest data, with the option to recompress during the compaction process.

ZSTD Advocate: “Compress once, benefit twice – even during streaming writes, it’s feasible.” While many organizations choose to delay compression, a compelling argument exists for using ZSTD from the beginning, even with small streaming writes. Not all organizations can perform compaction frequently; if small files remain uncompressed for hours or days, they consume more storage and degrade query performance.

Compressing upfront with ZSTD helps mitigate these issues. You gain immediate storage savings, which can be significant in a high-volume stream, and you may prevent a surge of data that will need to be rewritten later. Additionally, modern CPUs are quite powerful. Running ZSTD at a reasonable level (e.g., levels 1 or 3) may not significantly slow down performance. ZSTD can even operate in a faster mode; at its lowest settings, it approaches the speeds of LZ4 and Snappy while achieving better compression. This means the ‘compression cost’ can be minimized at the outset. Moreover, the advantages of starting with smaller files are clear. There is less data to upload to cloud storage and less to read back if queries access un-compacted partitions. 

It’s important to note that Iceberg’s default compression method is already ZSTD, indicating that the designers have faith in its efficiency for most writes. Similarly, cloud services are moving in this direction; for example, Amazon Athena, which uses Iceberg, has selected ZSTD as the default compression for Iceberg tables to balance speed and size. This reflects confidence that ZSTD is viable even for continuous data ingestion.

In summary, if your pipeline can accommodate the slight additional CPU usage, adopting ZSTD from the start will keep your data lake efficient and avoid postponing compression tasks. It’s a ‘do it right the first time’ approach.

Moderator: This boils down to ingest speed vs. immediate optimization. Snappy’s camp says, “Write fast now, fix files later,” while ZSTD’s camp says, “Compress now to save time and space down the line.” Both acknowledge Iceberg’s compaction will eventually merge files, but they differ on when to pay the compression cost. Now, what about reading those small files before compaction? That’s our next topic.

Query Performance Before Compaction: CPU Vs. I/O on Many Small Files

Moderator: Until compaction happens, we might have thousands of tiny Parquet files. When querying that un-compacted data, is it better that they were compressed with Snappy or ZSTD? Does Snappy’s speed reduce CPU cost when reading a pile of small files, or do ZSTD’s smaller files improve I/O efficiency?

Snappy Advocate: “Lighter compression means lighter CPU work for queries.” When a query engine such as e6data, Spark, or Trino needs to read many small files, the overhead can become pretty high due to opening and closing files and managing metadata. In this context, a lightweight compression method like Snappy can help ensure that decompression does not become a bottleneck. Snappy’s decompression speed is extremely fast, typically exceeding 500 MB/s per core in various tests, especially compared to compression formats like gzip or lzo.

For example, even if you have 1,000 small files to read, the CPU cost of decompressing them is minimal. Each file decodes quickly, allowing the CPU cycles to be focused on actual query processing tasks, such as filters and aggregations, rather than on decompression algorithms. If these files had been compressed using a heavier codec, the CPU impact for each file could be significantly higher. This added overhead could accumulate across thousands of files, leading to noticeable latency.

Moreover, consider the baseline overhead: a small file might only contain a few MB of data. While the size difference between Snappy and ZSTD might only be a couple of MB, the difference in decode time could be much more significant if using a slower codec. Snappy’s design philosophy is particularly beneficial in this scenario, ensuring minimal negative impact on query performance, even when confronted with numerous small file reads.

Many engineers have reported that they prefer Snappy to prevent CPU overloading with decompression tasks when dealing with many small objects. This approach allows them to exchange the high CPU costs for relatively lower I/O costs. Additionally, if the files are particularly small, they may easily fit into memory or cache, reducing I/O costs. In such cases, minimizing CPU usage becomes increasingly valuable.

Thus, when reading thousands of Snappy-compressed files, there is significant I/O activity with very little decompression work required. Since query engines often utilize multiple CPU cores, keeping each core engaged in lightweight decompression can help avoid stragglers during parallel processing. Therefore, before file compaction, using Snappy can help mitigate the challenges posed by the ‘small files problem.’ While it’s impossible to eliminate the need for multiple file openings, each file can be quickly decoded, making Snappy a practical choice.

ZSTD Advocate: “Fewer bytes read = faster scans – even with more CPU.” Focusing on I/O efficiency is crucial, especially with numerous small files, as I/O can be amplified by reading multiple headers and footers, leading to significant network overhead. Using ZSTD compression can reduce file sizes, which decreases the total data transferred from disk or S3. For instance, reading 10,000 files averaging 5 MB with Snappy results in ~50 GB versus ~30 GB with ZSTD, making the CPU cost for decompression more manageable. Modern CPUs are efficient at ZSTD decompression, often matching Snappy speeds.

Interactive query engines such as e6data/Trino/Presto benefit from smaller compressed files, which allow them to utilize the CPU more efficiently than waiting for data. Ultimately, ZSTD’s reduced I/O tends to outweigh its CPU overhead, making data scans more efficient.

Moderator: Great points. Snappy shines when CPU is at a premium (and I/O is maybe plentiful), whereas ZSTD shines when I/O is the bottleneck (CPU can be spent to save bytes). In real clusters, it often depends on whether your queries are CPU-bound or I/O-bound. Iceberg users have seen both scenarios: some favor Snappy to cut CPU costs for reading many small files, and others report better performance by reducing bytes scanned.

Conclusion

In this debate, we’ve heard that Snappy prioritizes raw speed and low CPU usage, making it great for fast writes and quick reads at the expense of larger file sizes. ZSTD, on the other hand, offers far smaller files and can improve overall I/O efficiency and query performance at the cost of more compression work per write.

Ultimately, there is no one-size-fits-all answer – it’s about the specific workload and priorities. It’s worth conducting experiments at your given scale to determine whether the CPU is the factor slowing down writes or if it’s the network that’s causing delays. The critical bottleneck can vary depending on your specific environment and workload. The best way to figure this out is by running your own benchmarks, as there’s no one-size-fits-all approach. Each system has unique characteristics, and testing will provide the insights you need to optimize performance effectively.

The debate remains open-ended, and I invite the Iceberg community (engineers, data architects, and enthusiasts) to weigh in. What’s your experience been? Have you found Snappy a lifesaver for streaming writes, or has ZSTD delivered better overall performance for your queries? Perhaps you have benchmark results or case studies from your company’s Iceberg deployment. Share your insights!

References:

  1. AWS Iceberg Best Practices – Compression Codec Guidance 
  2. Uber Engineering – Parquet Compression Benchmark (Snappy vs ZSTD) 
  3. Snappy vs Zstd for Parquet in Pyarrow - DEV Community
  4. Compression Methods in MongoDB: Snappy vs. Zstd
  5. Compaction in Apache Iceberg: Fine-Tuning Your Iceberg Table’s Data Files

Share on

Build future-proof data products

Try e6data for your heavy workloads!

Get Started for Free
Get Started for Free
Frequently asked questions (FAQs)
How do I integrate e6data with my existing data infrastructure?

We are universally interoperable and open-source friendly. We can integrate across any object store, table format, data catalog, governance tools, BI tools, and other data applications.

How does billing work?

We use a usage-based pricing model based on vCPU consumption. Your billing is determined by the number of vCPUs used, ensuring you only pay for the compute power you actually consume.

What kind of file formats does e6data support?

We support all types of file formats, like Parquet, ORC, JSON, CSV, AVRO, and others.

What kind of performance improvements can I expect with e6data?

e6data promises a 5 to 10 times faster querying speed across any concurrency at over 50% lower total cost of ownership across the workloads as compared to any compute engine in the market.

What kinds of deployment models are available at e6data ?

We support serverless and in-VPC deployment models. 

How does e6data handle data governance rules?

We can integrate with your existing governance tool, and also have an in-house offering for data governance, access control, and security.