Skip to main content
data vaultdata vault modelinghub link satelliteenterprise data warehousedata vault 2.0

Data Vault: Complete Guide to Scalable Data Warehouse Modeling

Data Vault is a database modeling methodology designed for building enterprise data warehouses that are auditable, scalable, flexible, and resilient to change. Developed by Dan Linstedt in the early 2000s and formalized as Data Vault 1.0 in 2000 (published 2002), Data Vault has become one of the leading approaches for large-scale enterprise data warehouse construction — particularly in industries with complex audit requirements, multiple source systems, and frequent schema evolution.

TL;DR

Data Vault organizes a data warehouse into three table types — Hubs (business keys), Links (relationships), and Satellites (descriptive attributes over time) — using an insert-only, append-only pattern that preserves complete history and traces every row to its source system. It excels where audit requirements are strict, source systems are numerous and volatile, and large teams need to develop in parallel.

What Is Data Vault?

Unlike dimensional modeling (Kimball star schema) or the normalized approach (Inmon 3NF), Data Vault separates structural concerns from interpretive concerns: the Raw Vault layer captures everything that ever happened in source systems without business interpretation, while a Business Vault or data mart layer applies business rules to produce consumable analytics. This separation makes Data Vault uniquely suited to environments where the single version of truth is contested, evolving, or requires full historical traceability.

The three fundamental building blocks are:

  • Hubs — unique lists of business keys (Customer ID, Product Code, Account Number)
  • Links — relationships, transactions, or associations between Hubs
  • Satellites — descriptive attributes about Hubs or Links, versioned over time

Every table in the Raw Vault carries three mandatory metadata columns: a Load Date (when the row was inserted), a Record Source (which source system provided it), and a Hash Key derived from the business key. Rows are never updated or deleted — only new rows are inserted. This insert-only discipline is the foundation of Data Vault's auditability.

Data Vault Core Components — Hub, Link, Satellite Data Vault Core Components HUB Customer HK_Customer (PK) CustomerID · LoadDate · RecSrc HUB Order HK_Order (PK) OrderID · LoadDate · RecSrc LINK Customer_Order HK_Link (PK) HK_Customer · HK_Order · LoadDate SATELLITE Customer_Details HK_Customer (FK) · LoadDate Email · Country · Phone HashDiff · RecordSource SATELLITE Order_Details HK_Order (FK) · LoadDate Status · Amount · ShipDate HashDiff · RecordSource Hub (business key) Link (relationship) Satellite (attributes over time)
Click to enlarge

History and Origins of Data Vault

Dan Linstedt began developing the Data Vault methodology while working with the US Intelligence Community in the late 1990s. The requirements of that environment — nothing can ever be deleted, source systems change constantly, and data volumes demand concurrent processing — shaped a fundamentally different approach to warehouse modeling.

Linstedt published the Data Vault modeling standard in 2002. The methodology gained traction through the 2000s in financial services, insurance, telecommunications, and government sectors — all domains with strict audit requirements and complex, heterogeneous source system landscapes.

In 2013, Linstedt published Data Vault 2.0, which extended the original methodology to incorporate NoSQL storage, big data platforms, Agile delivery practices, and expanded virtualization patterns. Data Vault 2.0 also formalized the Business Vault and added new table types (Point-in-Time tables, Bridge tables) for improving query performance.

Core Components: Hubs, Links, and Satellites

Data Vault uses three fundamental table types that together capture the full complexity of enterprise data — its business keys, its relationships, and its descriptive attributes over time.

Hubs

A Hub represents a unique list of business keys for a core business concept. Every distinct business entity (Customer, Product, Account, Order) has a corresponding Hub table. A Hub contains exactly:

  • Hash Key: A surrogate key derived from a hash of the business key (MD5, SHA-1, or SHA-256)
  • Business Key: The natural key from the source system (customer ID, product code, account number)
  • Load Date: When this business key was first seen
  • Record Source: Which source system provided this key

Hubs never contain descriptive attributes — those live in Satellites. Hubs never contain foreign keys to other Hubs — those live in Links. This extreme separation is what makes the Data Vault model resilient to change: adding a new source system or relationship requires new Satellites and Links, not modifications to existing structures.

Links

A Link represents a relationship, transaction, or association between two or more Hubs. A Link table contains:

  • Hash Key: A surrogate key for this relationship instance
  • Foreign Hash Keys: Hash keys referencing the participating Hubs
  • Load Date: When this relationship was first observed
  • Record Source: Which source system provided this relationship

Links are insert-only — relationships are never deleted, only superseded. This preserves the full history of every association: a customer linked to an account that was later closed will have the original link recorded forever alongside the closing event.

Satellites

A Satellite stores all descriptive attributes about a Hub or a Link across time. Satellites are where the actual business data lives. Each new version of a record creates a new Satellite row — old rows are never deleted or updated. A Satellite table contains:

  • Parent Hash Key: FK to the Hub or Link it describes
  • Load Date: When this version of the attributes was loaded
  • Load End Date (optional): When this version was superseded
  • Record Source: Source system origin
  • Hash Diff: A hash of all attribute values used to detect changes efficiently
  • Attributes: The actual descriptive columns (name, address, status, amount, etc.)

This insert-only pattern provides a complete temporal history of every attribute change — making Data Vault naturally compliant with audit requirements in financial, healthcare, and government contexts.

Data Vault 2.0 Additions

Data Vault 2.0 (DV2) extended the original methodology with several important additions that address performance, business logic, and modern platform support.

Business Vault

The Business Vault is a layer of computed Satellites and Links that apply business rules to raw vault data. While the Raw Vault captures everything from source systems without interpretation, the Business Vault applies calculations, classifications, and derivations that reflect agreed-upon business logic. Business rule changes are localized to the Business Vault, preserving the integrity of the audit layer.

Point-in-Time (PIT) Tables

Point-in-Time (PIT) tables solve one of Data Vault's main query challenges: joining multiple Satellites for a single Hub to reconstruct a point-in-time snapshot. A PIT table pre-computes, for every Hub key and every load date, which Satellite row was current at that time — dramatically simplifying and accelerating historical queries.

Bridge Tables

Bridge tables are pre-computed denormalized snapshots of multi-hop relationship paths in the Data Vault. Where a fact query would require traversing Hub → Link → Hub → Link → Hub chains at query time, a Bridge table materializes this traversal as a flat table — particularly valuable for multi-tier distribution hierarchies and organizational structures.

NoSQL and Big Data Integration

Data Vault 2.0 formalized patterns for applying the Hub-Link-Satellite model to NoSQL stores and big data platforms (Hadoop, Spark, cloud data lakes), extending its applicability beyond RDBMS environments to modern lakehouse architectures.

Data Vault vs Kimball vs Inmon

The three dominant data warehouse modeling approaches represent fundamentally different philosophies.

Kimball (Dimensional Modeling)

Kimball's dimensional modeling (star schema / snowflake schema) is optimized for analytical query performance and business user accessibility. Fact tables contain measurable events; dimension tables provide context. Kimball warehouses are fast to query and easy to understand, but they apply business rules at load time — making them less flexible when business definitions change. History is typically tracked through slowly changing dimensions (SCDs) rather than full audit trails.

Inmon (Third Normal Form)

Inmon's approach uses a fully normalized (3NF) integrated enterprise data warehouse as the single source of truth, from which denormalized data marts are built for specific analytical needs. Inmon warehouses are well-integrated and avoid redundancy, but they are complex to build, slow to query without data mart layers, and difficult to adapt as source systems change.

When to Choose Each

  • Kimball: Smaller teams, faster time-to-first-value, well-defined and stable business requirements, query performance is paramount
  • Inmon: Large enterprises with strong data governance, clear single-source-of-truth requirements, long-term integration investment
  • Data Vault: Complex multi-source environments, strict audit requirements, frequent source system changes, large teams enabling parallel development, regulated industries

Benefits of Data Vault

Full Auditability

Data Vault's insert-only pattern means every change to every attribute is preserved forever with its load timestamp and source system. This creates a complete, immutable audit trail — the primary reason Data Vault dominates in regulated industries (banking, insurance, healthcare, government).

Resilience to Source System Changes

When a source system adds, removes, or renames a column, Data Vault absorbs the change gracefully. New attributes become new Satellite columns or tables; removed attributes stop being populated; renamed attributes can be mapped without restructuring existing data. This contrasts sharply with Kimball models where schema changes often require expensive backfill migrations.

Parallel Loading

The strict separation between Hubs, Links, and Satellites eliminates inter-table dependencies within the Raw Vault. Hub loads, Link loads, and Satellite loads for different business entities are fully independent and can run concurrently. Large Data Vault implementations routinely parallelize hundreds of concurrent loads, achieving performance impossible with sequentially-dependent dimensional models.

Scalability and Source System Agnosticism

New business concepts are added as new Hubs with their Satellites; new relationships as new Links; new source systems as new Satellites on existing Hubs and Links. The existing model is never restructured, only extended. When multiple source systems provide contradictory values for the same business concept, both values are preserved — business rules for reconciliation live in the Business Vault layer, not in the Raw Vault.

When to Use Data Vault

Data Vault is the right choice when:

  • Audit requirements are non-negotiable: Regulations or internal policies require complete historical traceability of every data change
  • Source systems are numerous and heterogeneous: Many source systems feeding the warehouse, each with different keys, schemas, and change cadences
  • Business definitions are contested or evolving: Different departments have different definitions of "customer," "revenue," or "active account"
  • Large teams work in parallel: The independence of Hub/Link/Satellite loads enables multiple teams to build without stepping on each other
  • Agile delivery is required: Data Vault's incremental, non-destructive extension model fits Agile delivery — new sprints add new structures rather than modifying existing ones

Data Vault may be over-engineered when teams are small, requirements are stable and well-defined, time-to-delivery is paramount, and audit requirements do not demand full historization.

Implementing Data Vault with dbt

dbt (data build tool) pairs particularly well with Data Vault. The dbt ecosystem includes packages specifically designed for Data Vault implementation:

  • AutomateDV (formerly dbtvault): An open-source dbt package providing macros for generating Hub, Link, Satellite, PIT, and Bridge tables using dbt's templating system
  • DataVault4dbt: Another dbt package for Data Vault 2.0 implementation with broad platform support

A typical dbt Data Vault project structure:

  1. Staging layer: dbt staging models rename, cast, and hash columns from source extracts, adding hash keys and load dates
  2. Raw Vault: AutomateDV macros generate Hub, Link, and Satellite models from staging layer sources
  3. Business Vault: Custom dbt models apply business rules over Raw Vault tables
  4. Information Marts: dbt models join Business Vault structures (with PIT and Bridge helpers) into dimensional or flat tables for BI tools

This approach is version-controlled, testable, and integrates with standard dbt lineage visualization — tracking the full transformation chain from source to mart.

Metadata and Lineage in Data Vault

Data Vault has inherently rich metadata. Every row carries Record Source (which system produced this data), Load Date (when it was loaded), and Hash Keys (which entity it belongs to). This embedded metadata makes Data Vault one of the most governable warehouse methodologies. The challenge is surfacing this rich metadata in a way that business users and governance teams can navigate — this is where a data catalog adds significant value.

Dawiso catalogs all Data Vault layers — Raw Vault, Business Vault, and Information Marts — with business descriptions, ownership, and classification. It captures full data lineage from source system tables through staging, Raw Vault, Business Vault, to Information Mart. Dawiso's business glossary links canonical business term definitions to their physical representations: the "Customer" Hub is linked to the "Customer" glossary entry, documenting the agreed-upon definition and the source system keys that map to it. Quality metrics on Data Vault structures — null rates in Satellite attributes, record source distribution anomalies, unexpected gap patterns in Load Dates — are surfaced alongside each model's catalog entry, giving data consumers confidence in data quality.

Dawiso
Built with love for our users
Make Data Simple for Everyone.
Try Dawiso for free today and discover its ease of use firsthand.
© Dawiso s.r.o. All rights reserved