What Is Delta Lake?
Delta Lake is an open-source storage layer that brings ACID transactions, schema enforcement, time travel, and efficient updates to data lakes. Released by Databricks in 2019 and donated to the Linux Foundation in 2020, Delta Lake has become one of the three major open table formats — alongside Apache Iceberg and Apache Hudi — that make modern lakehouse architectures possible.
This guide covers what Delta Lake is, how its transaction log works, how it compares to Iceberg and Hudi, and where it fits in 2026 enterprise architectures including Microsoft Fabric, Databricks, and Snowflake.
Delta Lake is an open table format that wraps Apache Parquet files with a JSON-based transaction log (the _delta_log/ directory). The log adds ACID transactions, schema enforcement, time travel, MERGE/UPDATE/DELETE operations, and Z-ordering — capabilities that raw Parquet lakes lack. Delta is the default format on Databricks and Microsoft Fabric, with Iceberg interop via Delta UniForm. The current major version, Delta Lake 4.0, shipped in 2025 with first-class Iceberg compatibility and the new Delta Connect client/server architecture.
What Is Delta Lake?
A Delta Lake table is a directory of Apache Parquet files, plus a JSON transaction log directory called _delta_log/. Conceptually it is "Parquet with a transaction log." That small addition is what enables every higher-level capability — and why "just use Parquet" no longer suffices for production data lakes.
Delta Lake is supported by:
- Apache Spark — the original and most complete integration. Read, write, and all DML operations (MERGE, UPDATE, DELETE) work natively.
- Databricks Runtime — Delta is the default storage format; the platform's Photon engine optimizes Delta queries.
- Microsoft Fabric — OneLake stores data in Delta format by default. Direct Lake reads Delta directly into Power BI's VertiPaq engine.
- Trino, Presto, and Apache Flink — increasingly first-class Delta support since 2023.
- Delta Rust / delta-rs — a Rust-native client used by Polars, DuckDB, and Python without Spark.
- Snowflake — query Delta tables via external tables (via the Iceberg-compat layer in many cases).
Why Delta Lake Was Created
Before open table formats, "data lake" meant a directory of files in S3, HDFS, or ADLS — usually Parquet, sometimes ORC or Avro. This raw approach worked for read-only analytical workloads but suffered from a known set of problems:
- No atomicity. A multi-file write that crashed midway left the lake in an inconsistent state. Readers could see half-written data.
- No isolation. Concurrent writers and readers could clash. Race conditions corrupted partitions.
- Costly updates. Updating a single row meant rewriting an entire Parquet file. There was no efficient MERGE.
- Schema drift. Files written by different jobs could have different schemas. Readers crashed when columns disappeared or types changed.
- No history. Once a file was overwritten, the previous version was gone. There was no point-in-time query.
- Small files. Streaming pipelines created millions of tiny files that destroyed query performance.
Delta Lake solves all of these by introducing a transaction log that records every change as an immutable, ordered list of metadata operations.
Core Features
ACID Transactions
Every write to a Delta table is atomic. Either all files are committed or none are. Optimistic concurrency control handles concurrent writers — if two writers conflict, one fails and retries. Readers always see a consistent snapshot.
Time Travel
Every commit creates a new version. Past versions remain queryable until they are vacuumed:
-- Query the table as it looked at version 12
SELECT * FROM sales VERSION AS OF 12;
-- Query the table as it looked at a specific time
SELECT * FROM sales TIMESTAMP AS OF '2026-04-15 09:00:00';Time travel is essential for debugging, reproducible ML training, regulatory audit, and rollback after bad writes.
Schema Enforcement and Evolution
Delta Lake refuses writes that violate the table schema. Optional schema evolution lets new columns be added without manual migration. Stricter enforcement is the default; evolution is opt-in via WITH SCHEMA EVOLUTION or table properties.
MERGE / UPDATE / DELETE
SQL DML operations that raw Parquet cannot do efficiently:
MERGE INTO customers AS target
USING staging AS source
ON target.id = source.id
WHEN MATCHED THEN UPDATE SET email = source.email
WHEN NOT MATCHED THEN INSERT *;Delta rewrites only the affected files (data files containing matched/updated rows), not entire partitions. DELETE for GDPR right-to-erasure is a single SQL statement.
OPTIMIZE and Z-Order
Streaming pipelines create many small files. OPTIMIZE compacts them into larger ones. ZORDER BY co-locates rows by specified columns to enable file pruning on those columns:
OPTIMIZE sales ZORDER BY (customer_id, region);On Microsoft Fabric, the equivalent technique is V-Order, a Microsoft-tuned variant of OPTIMIZE optimized for Direct Lake.
History
DESCRIBE HISTORY tablename returns every commit ever made to the table — who, when, what operation, what the cluster ID was. Combined with audit logging in Unity Catalog or another catalog, this provides forensics-grade change tracking.
How Delta Lake Works
The transaction log lives in a _delta_log/ directory next to the data files. Each commit creates a new JSON file:
my_table/
_delta_log/
00000000000000000000.json ← commit 0 (initial CREATE)
00000000000000000001.json ← commit 1 (first INSERT)
00000000000000000002.json ← commit 2 (UPDATE)
...
00000000000000000010.checkpoint.parquet
part-00000-...snappy.parquet
part-00001-...snappy.parquet
...Each JSON file contains the actions taken in that commit: which Parquet files were added (add), which were removed (remove), schema changes (metaData), and protocol updates. To read the table at version N, a client replays the JSON files from 0 to N — or, more efficiently, reads the most recent checkpoint (a Parquet snapshot of the entire log state) and only the JSON files since.
This design means the source of truth for table state is the log, not the file directory. Files outside the log are invisible to Delta. Files referenced by the log but not yet vacuumed are still recoverable via Time Travel.
Delta Lake vs Apache Iceberg
Apache Iceberg, originated at Netflix and donated to Apache in 2018, is the main alternative open table format. The two formats solve overlapping problems with different design choices.
Where Delta wins:
- Maturity on Databricks and Microsoft Fabric — first-class everywhere in those ecosystems.
- Slightly simpler operational model (single transaction log per table).
- Z-Order multi-dimensional clustering.
Where Iceberg wins:
- Broader ecosystem of independent engines (Snowflake, AWS Athena, Trino, BigQuery, Dremio, ClickHouse all read Iceberg natively).
- Hidden partitioning — partition columns can be transformations of data columns, evolved over time without table rewrites.
- More mature spec for partition evolution and schema evolution at scale.
- Catalog-agnostic by design (REST catalog spec, AWS Glue, Hive metastore, Polaris).
The two communities have converged in 2024–2026: Iceberg added concepts inspired by Delta, Delta added Iceberg interop via UniForm. For most organizations the choice now depends more on the engine ecosystem in use than on format-level differences.
Delta Lake vs Apache Hudi
Apache Hudi (originated at Uber, 2017) is the third major format. Hudi is optimized for streaming and CDC use cases with two table types:
- Copy-on-Write (CoW) — similar to Delta. Updates rewrite affected files.
- Merge-on-Read (MoR) — updates write deltas to row-based log files, merged at read time. Faster writes, slower reads.
Hudi's MoR model is genuinely faster for high-frequency streaming updates than either Delta or Iceberg, but reads pay the cost. In 2026, Hudi remains strongest in streaming-heavy environments at AWS Kinesis / Apache Flink shops, while Delta and Iceberg dominate analytical lakehouse use cases.
Implementation with Spark
The most common way to use Delta Lake is through Apache Spark. Basic operations:
# Create or write
df.write.format("delta").save("/lakehouse/sales")
df.write.format("delta").saveAsTable("sales")
# Read
spark.read.format("delta").load("/lakehouse/sales")
spark.table("sales")
# Time travel
spark.read.format("delta").option("versionAsOf", 12).load("/lakehouse/sales")
spark.read.format("delta").option("timestampAsOf", "2026-04-15").load("/lakehouse/sales")SQL operations work the same as on a managed table:
CREATE TABLE sales (id BIGINT, amount DECIMAL(18,2), order_date DATE) USING DELTA;
INSERT INTO sales VALUES (1, 99.99, '2026-05-01');
UPDATE sales SET amount = 109.99 WHERE id = 1;
DELETE FROM sales WHERE order_date < '2024-01-01';
OPTIMIZE sales ZORDER BY (order_date);
VACUUM sales RETAIN 168 HOURS; Delta UniForm and Iceberg Compatibility
Released in 2024, Delta UniForm (Universal Format) lets a single Delta table be read by Iceberg-compatible engines without migration. UniForm writes the Iceberg metadata format alongside the Delta log, so engines like Snowflake or AWS Athena can read the table as if it were Iceberg, while Spark and Databricks read it as Delta.
UniForm is the practical answer to format wars. A Delta table written from Databricks can be queried from Snowflake; an Iceberg table written from Snowflake can (via Iceberg's read path on Databricks Runtime) be read in Databricks. Format choice no longer locks the consumer in.
Format choice matters less than it did two years ago. Delta UniForm and Iceberg's growing adoption mean both formats are interoperable in most engines. Pick the format your primary engine optimizes for — Delta if you live on Databricks or Fabric; Iceberg if you live on Snowflake or run a heterogeneous engine zoo. Then expose the other format through UniForm or Iceberg compat where consumers need it.
Use Cases
- Lakehouse architecture — Delta is the foundation of medallion architecture on Databricks and Fabric. Bronze, Silver, Gold layers are all Delta tables.
- Streaming + batch unification — the same Delta table can be read by Structured Streaming and batch jobs simultaneously.
- GDPR / compliance deletes — efficient row-level
DELETEwith audit history. - Slowly Changing Dimensions —
MERGEhandles SCD Type 1 and Type 2 patterns natively. - Reproducible ML training — Time Travel pins training data to a specific version.
- CDC ingestion — pair Delta with Debezium, Kafka, or Snowflake CDC streams to land changes incrementally.
- Power BI on lakehouse — Direct Lake mode in Microsoft Fabric reads Delta tables directly into VertiPaq.
Delta Lake's success is the success of the lakehouse pattern. By bringing transactional guarantees to lake storage, it eliminated the historical reason organizations duplicated data into proprietary warehouses. Combined with strong governance — through Unity Catalog, Microsoft Purview, or Dawiso — Delta provides the storage substrate for a single lakehouse that can replace separate warehouses, lakes, and operational stores. The format itself is small; what it enables is large.