Join us for an open discussion on Apache Iceberg in Bangalore on 21st Dec Learn more ->

The Lakehouse Evolution: Making Object Storage the Backbone of Modern Analytics

By Karthic Rao and Vishnu Vasanth on 12 Nov 2024

Data Analysts using Lakehouses to analyse large volumes of data

Lakehouses and Modern Data Analytics

Lakehouse, or what is often called an open table format like Apache Iceberg, is a specification that helps achieve two key things: 1. Representing a table using data and metadata files. 2. Performing operations on this table while maintaining consistency and data integrity. But hold on—haven't we always known how to represent tables inside data systems, and haven't these systems provided transactional guarantees to perform operations safely? So, why do we need these new, seemingly complex specifications like Apache Iceberg, Delta, or Hudi? It all comes down to one crucial shift: with open lakehouses, the underlying storage has shifted to object storage. Unlike traditional databases, object storages are designed to store unstructured data and cannot handle structured, tabular data or offer transactional guarantees when operating on such data. But why are we trying to make object storage systems designed to handle unstructured data like files, documents, and videos, which are now forced to store tabular, structured information? Why would you want to do that? Let's explore the thought process that led to this novel idea of retrofitting structured table data with transactional guarantees inside object storages, which led us to the world of Lakehouses.

The Shift from Ledgers to Databases

Let's go back in time to understand why the concept of lakehouses became necessary. Do you remember the ledger from our school classrooms from the 1980s or early 1990s, where teachers recorded attendance? Each student's name was listed and the attendance was marked to indicate whether a student was present or absent every day. These ledgers were the bedrock of record-keeping that allowed teachers to store information and perform basic analytics. The structure was inherently tabular, with rows representing students and columns for attendance. The concept of a schema was already present, defining what data to record and how to organize it. This made the data accessible to manage, understand, and analyze, as the mental model of the table format was intuitive for those maintaining the records. So, you see, the concept of a table and its associated schema existed even in manual record-keeping. It allowed for structured storage and effective analytics. In this case, consistency-related issues rarely occurred because only one person—the teacher—managed the ledger, ensuring accurate and conflict-free updates. The schema represented just one table, and operations on the data were sequential and managed by a single person, making consistency easy to achieve. Now, let's move into the digital age, where these systems of record-keeping and analytics transitioned to databases.

The Emergence of a Database World

Unlike manual ledgers, databases had to ensure consistency despite multiple users reading and writing data simultaneously, often across various tables. This required carefully maintaining data integrity and consistency. To achieve this, databases were needed to manage ACID (Atomicity, Consistency, Isolation, Durability) compliance, ensuring that data remained accurate and reliable in a multi-user environment. Transactional databases like Postgres did this safe record-keeping and query, serving exceptionally well even under highly concurrent environments. While this approach was practical for record-keeping, databases struggled with large-scale analytics as data volumes grew. Due to limitations in scalability and performance, performing analytical queries on transactional databases grew increasingly complex. This led to the emergence of new solutions tailored specifically for analytics.

The Rise and Stagnation of Data Warehouses

Data warehouses are specialized systems that store vast amounts of structured data and efficiently perform complex queries. These systems were designed from the ground up to be efficient at analytical workloads. The data world entered an era where traditional databases were used for typical transactional workloads, and the analytical workloads were moved to the warehouses. However, it was only a short time before warehouses hit their ceiling as data volumes exploded. Data warehouses, which were meant to be specialized systems that could efficiently handle analytical workloads on large volumes of data, failed to meet the demands due to an unprecedented rate of data growth within organizations. The cost of scaling warehouses became increasingly high, and they faced limitations in efficiently handling the sheer volume and variety of modern data workloads.

Object Storage Systems and Data Lakes

Scaling and cost concerns were becoming evident in the data analytics world. Object storages like S3 and MinIO were gaining prominence as the de facto destination for storing petabytes or exabytes of data less expensively and reliably. It made sense to separate the storage layer to store the data and then have another layer, called the compute layer, to perform analytics (which requires significant computational capacity to process large amounts of data). While these layers were tightly coupled in databases and data warehouses, a new architecture emerged where computing and storage were decoupled. Hence, the concept of data lakes emerged, with object storage becoming the primary storage solution for data as a cheap, reliable, and scalable alternative. Data lakes are built to store all kinds of structured, semi-structured, and unstructured data in a centralized object storage system. This storage layer acts as a flexible repository, allowing for data ingestion in its raw form without imposing a strict schema upfront. Data lakes enable specialized computing engines to process and analyze data by decoupling storage and computing, providing scalability and cost-efficiency. However, despite these advantages, data lakes face challenges in managing structured data, which requires additional abstraction and management to enable efficient querying, consistency, and data integrity for compute engines to use the stored data in the data lakes for analytics. Although object storage provided scale for storing data and offered rich API-based access to read/write data, it lacked the design for understanding structured data, schema, or tabular formats. Object storage systems exist to store unstructured data, such as files, documents, and videos, without any inherent understanding of tables or schema definitions. They cannot directly represent structured tabular data or enforce transactional guarantees like ACID. As a result, it is challenging to modify table data safely and maintain consistency and integrity. The inevitability of adopting object storage as a data storage layer led to innovation in creating new ways to manage and analyze structured data effectively, despite the challenges of designing object storage for structured data, schema, or tabular data operations. That's how lakehouses were born.

The Rise of Lakehouses

A lakehouse is a specification or a set of rules that helps represent tabular data using metadata and safely manipulate the table data concurrently with full ACID guarantees. It bridges the gap between the scalability of data lakes and the reliability of data warehouses. A key reason for moving to object storage was that warehouses and databases struggled to analyze tabular data at massive scales. Lakehouses provide table abstractions and ACID guarantees and introduce features like partitioning to support distributed compute engines in analyzing data efficiently at scale. While the lakehouse specification defines how data should be stored and safely manipulated, it is the job of compute engines like Spark, E6Data, and Trino to perform analytics on the data. The specification focuses on building table abstractions while maintaining data consistency and integrity but does not define how to analyze the data. However, many design primitives, like partitions, aim to make compute engines more efficient in data analysis. After all, the goal of moving to object storage was not just to store tabular data or support transactions but to enable better analytics capabilities.