DuckDB's DuckLake Proposal: A Radical Rethink of Lakehouse Architecture and Industry Reactions
Excitement over DuckLake, but momentum is with Iceberg as players at AWS, Snowflake weigh in
The landscape of data architecture, particularly the burgeoning field of data lakehouses, is rarely static. A year after Databricks' significant acquisition of Tabular, the company behind the Iceberg table format, which itself followed Databricks' creation of Delta Lake, a new proposal has emerged to challenge the established norms. DuckDB, known for its fast, in-process analytics database, has introduced 'DuckLake' and a novel architectural approach that has ignited debate and drawn reactions from major players like AWS and Snowflake.
For years, the industry has grappled with bridging the gap between data lakes (vast, unstructured or semi-structured data stored cheaply in formats like Parquet or ORC on cloud storage like S3) and data warehouses (structured data in proprietary systems optimized for SQL analytics). The lakehouse architecture emerged as a promising hybrid, aiming to bring the reliability, performance, and management features of data warehouses (like ACID transactions, schema enforcement, and time travel) to the cost-effectiveness and flexibility of data lakes. This is primarily achieved through 'Open Table Formats' (OTFs) such as Delta Lake (originally from Databricks) and Apache Iceberg (developed at Netflix, now an Apache project with significant backing from companies like Snowflake and Apple).
These formats work by adding a metadata layer on top of the data files stored in cloud storage. This metadata, often stored as manifest files or transaction logs alongside the data, tracks which data files belong to a table, manages schema evolution, handles partitioning, and enables features like time travel and ACID compliance. Query engines then read this metadata to understand the table structure and efficiently plan queries, avoiding the need to scan entire directories of files.
Databricks' acquisition of Tabular in 2024 for $1 billion was a major event, consolidating two key players in the OTF space and leading to discussions and signs of potential collaboration or convergence between the Delta Lake and Iceberg formats. Just as this dynamic was unfolding, DuckDB Labs, the company supporting the open-source DuckDB project, proposed a fundamentally different approach.
The DuckLake Proposal: A Database-Centric Lakehouse
DuckDB's proposal centers around DuckLake, a new table format, and an extension to the DuckDB database itself. The core idea is to flip the traditional lakehouse model. Instead of using file-based metadata managed by the table format layer, DuckLake proposes using a dedicated database to store and manage all table metadata. This database would act as the central catalog and transaction manager for data stored in blob storage like S3.
The DuckDB extension would allow DuckDB to function not just as an in-process analytical engine but also as a client-server system capable of interacting directly with data stored externally in S3 or other object stores, using the metadata managed within its own database instance. This is a significant departure from how most query engines interact with Iceberg or Delta Lake, which typically involve reading metadata files from the data lake storage itself or via a separate catalog service that still ultimately relies on file-based metadata.
The DuckDB team argues that managing metadata within a highly optimized analytical database like DuckDB offers significant performance advantages. They point out that existing OTFs, while providing crucial data management features, still involve considerable I/O overhead and complexity in managing and querying the metadata itself. By centralizing metadata in a database, DuckLake aims to leverage database-native optimizations for metadata access, indexing, and transactional consistency, potentially leading to faster query planning and execution, especially for operations involving scanning large numbers of files or complex partition pruning.
This approach also simplifies the architecture from DuckDB's perspective. Instead of building complex connectors to parse and interact with the file-based metadata structures of Iceberg or Delta Lake, DuckDB can rely on its own internal database mechanisms to manage the metadata for DuckLake tables.
Industry Reactions: Excitement, Skepticism, and Momentum
The DuckLake proposal has not gone unnoticed by the data community and industry leaders. Speaking to The Register, Andy Warfield, a VP and distinguished engineer at AWS, expressed considerable enthusiasm. He noted that the announcement was widely circulated within AWS engineering teams and that people have been experimenting with it.
People have been playing with it. It's captured people's imaginations for sure...
Warfield acknowledged that DuckLake highlights some inherent performance challenges in the current implementation of Open Table Formats like Iceberg and Delta Lake. He explained that starting with a serialized, persistent format for metadata (like files) can limit performance compared to a database's I/O layer. DuckLake's approach of replacing the file-based metadata management with a database schema and backend directly addresses this.
However, Warfield also pointed out that the Iceberg community and cloud storage providers like AWS (with S3) are already actively working to address these same performance bottlenecks. Initiatives within Iceberg, such as proposed scan APIs, and techniques like aggressive client-side caching (which DuckDB already employs for Iceberg) are aimed at reducing round trips to storage and improving metadata access speed. He suggested that existing OTFs are likely to evolve by adding performance-focused mid-layers and improving their APIs, effectively catching up to the performance demonstration provided by DuckLake.
The reaction from other corners of the industry was more cautious. Jake Ye, a software engineer at AI database company LanceDB and an AWS veteran, shared his perspective on LinkedIn. While acknowledging the interesting idea of using a SQL database for metadata, he highlighted a broader industry trend towards JSON-based protocols for interoperability in the catalog space.
Ye pointed to standards like the Iceberg REST Catalog (IRC), Polaris, Gravitino, and Unity Catalog, as well as trends in the AI domain with MCP and A2A, and the Lance Namespace specification, all of which are built on JSON-based foundations. He argued that defining a specification in SQL, while interesting, could face adoption challenges due to potential limitations in structured extensibility, versioning, and transport-layer separation compared to JSON-based approaches.
Russell Spitzer, a principal engineer at Snowflake, a company that has been a strong proponent of Iceberg since its early days, also weighed in. He emphasized that many projects and companies are already deeply invested in Iceberg, having come "pretty far along the road" with its adoption and integration.
Spitzer agreed that DuckDB's proposal addresses valid performance concerns, but like Warfield, he believes the Iceberg community is already working on solutions. He views the actual storage mechanism for metadata — whether it's in the file system, a catalog, or a relational store — as less critical than the APIs used to interact with it. He highlighted the importance of the Iceberg REST specification, which allows the underlying metadata storage to be abstracted away, potentially residing in various systems or even cached in memory, independent of relational semantics.
A key concern raised by Spitzer regarding DuckLake's SQL-based metadata approach is the potential for users to have too much direct access to the underlying persistence layer. He drew an analogy to modifying Iceberg files manually instead of using the established SDK. With SQL, users could potentially modify metadata in ways that might not adhere to the transactional semantics or integrity constraints required for a reliable lakehouse, whereas Iceberg's APIs are designed to enforce these rules.
Spitzer also underscored that Iceberg is not a static target. He mentioned the recent release of Iceberg v3, which introduces new features like variant type support. This feature, developed through collaboration within the Iceberg community (including major players like Snowflake), allows tables to handle data with evolving or semi-structured schemas, which is particularly useful for data sources like IoT sensors where new fields might appear unexpectedly. This demonstrates the ongoing innovation and responsiveness within the established Iceberg ecosystem.
Comparing the Approaches: DuckLake vs. Iceberg/Delta Lake
To better understand the debate, it's helpful to compare the core architectural differences:
Iceberg/Delta Lake (Traditional OTFs)
- Metadata Storage: Metadata (manifest files, transaction logs) is stored as files alongside the data files in the data lake (e.g., S3, ADLS, GCS).
- Metadata Management: Handled by the table format library/SDK and potentially a separate catalog service (like Hive Metastore, AWS Glue Catalog, Unity Catalog, Polaris, IRC) which points to the location of the metadata files.
- Query Engine Interaction: Query engines read metadata files (often via a catalog) to discover table structure, partitions, and data files. This can involve multiple I/O operations to storage.
- ACID & Time Travel: Achieved through atomic commits recorded in the metadata logs/manifests.
- Interoperability: Designed for interoperability across multiple query engines (Spark, Trino, Presto, Flink, DuckDB, etc.) that implement the specific table format specification. Metadata format is typically file-based (Parquet, Avro, JSON) or relies on well-defined APIs (like the Iceberg REST Catalog).
- Momentum & Ecosystem: Significant industry adoption, large communities, and extensive tooling support from cloud providers and data vendors.
DuckLake (DuckDB's Proposal)
- Metadata Storage: Metadata is stored within a dedicated DuckDB database instance.
- Metadata Management: Handled directly by the DuckDB database engine, leveraging its internal mechanisms for storage, indexing, and transactions.
- Query Engine Interaction: DuckDB interacts directly with its internal metadata database to understand table structure and locate data files in external storage (S3, etc.). This aims to reduce external I/O for metadata access.
- ACID & Time Travel: Leverages DuckDB's native transactional capabilities for metadata operations.
- Interoperability: Primarily designed for DuckDB. Interoperability with other engines would require them to understand and interact with the DuckLake metadata stored in the DuckDB database, which is a significant hurdle compared to reading standardized file formats or using standard APIs.
- Momentum & Ecosystem: Currently a new proposal with limited adoption, primarily tied to the DuckDB ecosystem.
The Performance Debate: I/O, Caching, and APIs
DuckLake's central performance argument is that managing metadata within a database is inherently faster than reading and processing file-based metadata from object storage, which can involve significant latency and throughput limitations, especially for operations that require scanning large portions of the metadata graph (like listing all files in a large, partitioned table). Databases are optimized for structured data access, indexing, and query processing, which could theoretically make metadata operations much quicker.
However, as AWS's Warfield noted, the existing OTF communities are not ignoring these issues. They are actively developing solutions:
- Improved APIs: Iceberg's proposed scan API aims to provide a more efficient way for query engines to get the necessary file information without reading entire manifest lists.
- Caching: Aggressive client-side caching of metadata is being implemented by query engines (including DuckDB itself when reading Iceberg/Delta Lake) and catalog services to reduce the need for repeated trips to object storage.
- Catalog Services: Dedicated catalog services (like AWS Glue, Unity Catalog, Polaris) act as a layer between query engines and the metadata files, often caching metadata and providing optimized APIs for access, mitigating some of the direct file-access overhead.
These efforts suggest that the performance gap DuckLake highlights might be closing as OTF ecosystems mature and optimize their metadata access layers. The question becomes whether a database-native approach offers a fundamental, insurmountable advantage or if optimized file-based and API-driven methods can achieve comparable performance in practice.
Interoperability and Ecosystem Lock-in
One of the key selling points of Open Table Formats like Iceberg and Delta Lake is interoperability. They aim to provide a common abstraction layer over data files in cloud storage, allowing various query engines and processing frameworks (Spark, Trino, Presto, Flink, DuckDB, Dremio, etc.) to read and write to the same tables reliably. This avoids vendor lock-in at the processing layer and allows organizations to choose the best tool for a specific task while working with a single copy of their data.
DuckLake, in its current proposal, appears to be tightly coupled with the DuckDB database. For other query engines to interact with DuckLake tables, they would need to implement support for reading metadata from a DuckDB database instance, which is a much higher barrier to entry than implementing support for a standardized file format or a REST API. This raises concerns about potential vendor lock-in to the DuckDB ecosystem for users who adopt DuckLake.
While DuckDB is open source, relying on a specific database instance for metadata management could complicate architectures that currently leverage multiple processing engines accessing the same data lake. The industry trend highlighted by LanceDB's Jake Ye towards JSON-based interoperability protocols reinforces the value placed on open, transport-layer-separated standards that are not tied to a specific database technology.
Transactional Semantics and Data Integrity
Snowflake's Russell Spitzer raised a critical point about data integrity and transactional semantics. A primary goal of lakehouse formats is to provide ACID properties (Atomicity, Consistency, Isolation, Durability) over data lake storage, which was traditionally difficult with raw files. OTFs achieve this through controlled metadata updates and commit protocols.
Spitzer's concern is that using a general-purpose SQL interface to manage metadata, as proposed by DuckLake, could potentially allow users to perform operations that violate the intended transactional model of the lakehouse. While a database offers powerful transactional capabilities, directly exposing the metadata tables via SQL might give users too much low-level control, potentially leading to inconsistent states if not managed carefully through a strict API layer on top of the SQL interface.
Iceberg and Delta Lake, by contrast, expose metadata operations through specific APIs (like adding files, committing transactions, rolling back) that are designed to maintain the integrity of the table format and enforce transactional guarantees. This controlled access layer is crucial for ensuring data reliability in a multi-user, multi-engine environment.
Momentum and Adoption Challenges
Perhaps the biggest challenge for DuckLake is the significant momentum already behind Iceberg and Delta Lake. These formats have large, active communities, extensive documentation, and are integrated into numerous data processing platforms and cloud services. Companies like Databricks, Snowflake, AWS, Google Cloud, Microsoft Azure, Apple, Netflix, and many others are investing heavily in supporting and developing these formats.
AWS, for instance, has introduced S3 Tables, a feature that brings table-like capabilities directly to S3 buckets, often leveraging Iceberg or Delta Lake metadata. Snowflake has built its lakehouse strategy heavily around Iceberg. Databricks continues to develop Delta Lake while also embracing Iceberg following the Tabular acquisition.
For organizations that have already invested in Iceberg or Delta Lake, adopting DuckLake would mean migrating their metadata and potentially changing their data processing workflows. This is a significant undertaking, and the perceived benefits of DuckLake would need to be substantial to justify the cost and effort, especially when the existing formats are actively addressing their limitations.
DuckDB's strength lies in its ease of use, speed for analytical queries, and ability to run in-process. Its popularity among data scientists and analysts for local data exploration is undeniable. DuckLake could potentially gain traction within this existing user base, offering a DuckDB-native way to manage external data. However, competing in the broader enterprise lakehouse market, dominated by the likes of Databricks and Snowflake and built around the interoperability of Iceberg and Delta Lake, will be an uphill battle.
The Future of Lakehouse Architecture
The DuckLake proposal, despite the skepticism from some corners, serves as a valuable contribution to the ongoing evolution of lakehouse architecture. It forces the community to consider alternative approaches to metadata management and highlights the performance bottlenecks that still exist in current implementations.
Whether DuckLake gains widespread adoption or remains a niche solution within the DuckDB ecosystem, its core idea — leveraging database principles for metadata management — might influence the development of future lakehouse technologies. The competition and innovation spurred by proposals like DuckLake ultimately benefit users by driving improvements in performance, reliability, and usability across the entire data landscape.
The data industry is clearly not settling on a single, static lakehouse model. The interplay between formats like Iceberg and Delta Lake, the emergence of new ideas like DuckLake, and the continuous optimization efforts by cloud providers and data companies ensure that the architecture for managing and analyzing vast datasets in the cloud will continue to evolve rapidly. While the momentum is currently firmly with Iceberg and Delta Lake, the DuckLake proposal is a reminder that fundamental rethinking can still challenge established paradigms and push the boundaries of what's possible.
As the industry moves forward, the focus will likely remain on achieving the best possible performance, reliability, and interoperability for data stored in cost-effective cloud storage. The debate sparked by DuckLake contributes to this goal by highlighting potential areas for improvement and offering a fresh perspective on the challenges of building the ideal data lakehouse.
With multi-billion-dollar revenue organizations such as AWS, Snowflake, and even Databricks trying to steer the future of Iceberg, DuckDB and DuckLake is going to be paddling furiously to increase its own momentum. ®