Skip to main content
data lakehouseDelta LakeApache IcebergDatabricksdata architecturedata warehouse

What Is a Data Lakehouse?

A data lakehouse is a data architecture that combines the low-cost, scalable storage of a data lake with the structured data management, query performance, and governance capabilities traditionally associated with a data warehouse. The term was popularized by Databricks in a 2021 paper and has since become the architectural framework underlying major platforms including Databricks, Apache Iceberg-based deployments, and Delta Lake.

The lakehouse emerged to solve a structural problem: organizations needed both the scale and flexibility of data lakes (for raw data, unstructured data, and machine learning workloads) and the reliability and performance of data warehouses (for analytics, BI, and business-critical reporting). Running and synchronizing both systems created cost, complexity, and consistency challenges. The lakehouse collapses them into a single architecture.

TL;DR

A data lakehouse stores data in open formats (Parquet, ORC) on cloud object storage, adds a metadata layer (Delta Lake, Apache Iceberg, Apache Hudi) that enables ACID transactions and schema enforcement, and delivers data warehouse performance through caching and query optimization. Governance — cataloging, lineage, access control, quality — is especially critical in lakehouse architectures because raw data, curated data, and ML artifacts coexist in the same storage layer.

The Lake and Warehouse Problem

To understand the lakehouse, it helps to understand what it replaced — or more accurately, what it unified.

Data Warehouses: Performance and Governance, Limited Flexibility

Traditional data warehouses (Teradata, IBM Netezza, later cloud equivalents) offered strong query performance, ACID transactions, schema enforcement, and well-understood governance models. But they were expensive at scale, required structured data, didn't handle unstructured or semi-structured data well, and had limited support for machine learning workloads. Every byte of data cost money to store, which meant careful curation was necessary before data landed in the warehouse.

Data Lakes: Scale and Flexibility, Governance Challenges

Data lakes (typically Hadoop HDFS, then cloud object storage: S3, GCS, ADLS) offered cheap, scalable storage for any data format. Raw data, logs, images, JSON, Parquet — everything could land in the lake. But data lakes had a governance problem: without schema enforcement, transaction support, or access control on individual files, they became "data swamps" — large, cheap, and nearly impossible to use reliably for analytics. Query performance was poor, data quality was unpredictable, and lineage was opaque.

The Two-Tier Problem

Organizations often ran both: a data lake for raw storage, ML data, and unstructured content, and a data warehouse for analytics and BI. This created the "two-tier" problem: ETL pipelines moved data from lake to warehouse, creating synchronization delays, inconsistencies, and operational overhead. The same data lived in two places, potentially at different states of freshness, maintained by different teams with different governance practices.

Lakehouse Architecture

The lakehouse solves the two-tier problem with three components working together:

  1. Open file formats on cloud object storage — Data is stored in Parquet or ORC files on S3, GCS, or ADLS. This is cheap, scalable, and accessible to any compute engine (SQL engines, Spark, Python, R).
  2. Open table format metadata layer — Delta Lake, Apache Iceberg, or Apache Hudi add a transaction log on top of the file layer. This log enables ACID transactions (concurrent reads and writes without data corruption), time travel (query the table as it was 30 days ago), schema evolution (add or change columns safely), and efficient metadata-based query pruning.
  3. High-performance query engines — SQL engines (Trino, Spark SQL, DuckDB, Snowflake on Iceberg) read the metadata layer to execute efficient queries without full file scans. Caching and data skipping bring warehouse-grade query performance to the file layer.

Key Lakehouse Capabilities

ACID Transactions

The metadata layer tracks all changes atomically. A write either completes fully or is rolled back — no partial writes that leave tables in inconsistent states. This is the foundation for concurrent analytics and data engineering workloads on the same storage.

Time Travel

Because the transaction log records every change, you can query any table as it existed at any point in the log's retention window. "Show me the state of this dataset as of last Tuesday" is a SQL query, not a restore operation. Time travel is critical for auditing, debugging, and regulatory requirements.

Schema Enforcement and Evolution

The table format enforces schema — writes that don't match the table's schema are rejected. Schema changes (adding columns, changing nullability) are versioned and tracked in the log, enabling safe evolution without pipeline breakage.

Data Lakehouse — Architecture Layers DATA LAKEHOUSE — ARCHITECTURE LAYERS Data Catalog & Governance Metadata Lineage Access control Quality Classification Ownership Stewardship Consumption Layer SQL / BI Tools Spark / Python ML / AI Workloads Streaming Queries APIs Open Table Format (Metadata Layer) Delta Lake Apache Iceberg Apache Hudi ACID · Time Travel · Schema Enforcement · Efficient Pruning Cloud Object Storage Amazon S3 Google Cloud Storage Azure Data Lake Storage Parquet / ORC JSON / CSV Images / Video Logs / Events ML Artifacts All compute engines access the same data in open formats — no data movement between lake and warehouse
Click to enlarge

Lakehouse Platforms

The lakehouse architecture is now supported by multiple platforms:

  • Databricks — The originator of the term. Uses Delta Lake as its native table format, with Unity Catalog for governance. Strong in ML and AI workloads.
  • Apache Iceberg — The open-source table format gaining the broadest ecosystem support. Supported by Snowflake (as Snowflake Open Catalog), AWS (Glue, Athena, EMR), Azure, and many others.
  • Snowflake — Traditional cloud data warehouse that now supports Iceberg tables and hybrid lakehouse architectures. Continues to excel in SQL and BI workloads.
  • Apache Hudi — Specializes in streaming data ingestion and incremental processing, often used for CDC-heavy workloads.

Governance in the Lakehouse

The lakehouse's flexibility — any data, any format, any compute engine — makes governance both more important and more challenging than in a traditional data warehouse. In a warehouse, the structured schema enforced governance implicitly; in a lakehouse, governance must be explicitly implemented.

A lakehouse without governance becomes a governed data swamp. The raw zone, the curated zone, and the ML zone look identical from a storage perspective — only metadata and governance policies distinguish a trusted analytics dataset from an experimental scratch file. A data catalog is not optional in a lakehouse; it's the mechanism that makes the architecture usable.

The critical governance requirements:

  • Data cataloging — Every table, partition, and file that business users or AI systems should be able to discover needs a catalog entry with schema, ownership, quality, and usage information.
  • Fine-grained access control — Column-level and row-level security must be enforced through the query engine, not just at the storage layer. Unity Catalog, Ranger, and similar tools provide this.
  • Data classification — Identifying which data contains PII, sensitive business information, or regulated content — and applying the appropriate access and handling policies.
  • Lineage — Tracking how data flows from raw zone through transformations to curated zones and downstream reports is more complex in the lakehouse than in a warehouse, but more important: the paths are more varied and the risk of consuming raw instead of curated data is higher.

When to Use a Lakehouse

The lakehouse is the right choice when:

  • You need to support both SQL analytics and ML/AI workloads from the same data
  • You have significant volumes of semi-structured or unstructured data alongside structured data
  • You need to retain raw data for regulatory, audit, or reprocessing reasons
  • You're building on cloud object storage and want to avoid the cost and complexity of a separate data warehouse tier
  • Your team has dbt/Spark expertise and prefers open formats and interoperable tooling

A traditional cloud data warehouse may still be preferable if your workloads are purely SQL-based, your data is all structured, and your team is optimized for BI and reporting rather than data engineering.

Conclusion

The data lakehouse has become the dominant architectural paradigm for enterprise data platforms in 2026 — combining the economics of object storage with the governance and performance capabilities needed for reliable analytics and AI. The critical insight is that the lakehouse shifts the governance challenge: instead of schema enforcement at ingest time (as in a warehouse), governance must be applied explicitly through cataloging, access control, quality monitoring, and lineage tracking. Organizations that invest in this governance layer turn the lakehouse into a trusted enterprise asset. Those that don't end up with an expensive, opaque data swamp.

Dawiso
Built with love for our users
Make Data Simple for Everyone.
Try Dawiso for free today and discover its ease of use firsthand.
© Dawiso s.r.o. All rights reserved