Data Lineage with dbt: Complete Guide
Data lineage is the documented record of how data moves, transforms, and flows from its origin to its final destination — which sources feed which models, which models feed which dashboards, and which columns derive from which upstream fields. Good lineage answers the questions that matter most when something goes wrong: "where did this number come from?" and "if I change this model, what breaks?"
dbt generates lineage automatically as a side-effect of normal development. Every ref() and source() call encodes a dependency edge, and dbt collects those edges into a DAG (directed acyclic graph) serialized in manifest.json. This table-level lineage is always in sync with the code, can be extended to column-level and cross-system lineage, and is consumable by governance platforms like Dawiso to provide enterprise-grade visibility.
What Is Data Lineage?
Data lineage operates at multiple levels of granularity, each answering different questions:
- Table-level (dataset) lineage: "Which tables are used to build
fct_revenue?" — this is what dbt's DAG shows natively - Column-level lineage: "The
revenuecolumn infct_revenue— which source columns contribute to it?" — requires parsing SQL expressions - Job-level lineage: "Which pipeline run produced this table, and when?" — captured by orchestration and observability tools
- Business-level lineage: "Which KPI in which dashboard depends on which data asset?" — connects technical lineage to business impact
dbt natively handles table-level lineage and provides the metadata foundation for the others. Understanding which level you need determines how far you need to go beyond dbt's built-in capabilities.
How dbt Generates Lineage Automatically
dbt's lineage is not manually maintained — it is computed from code. When you write {{ ref('stg_orders') }} in a model file, you are simultaneously:
- Telling dbt to replace that expression with the correct schema-qualified table name at runtime
- Declaring a dependency: the current model cannot be built until
stg_ordersis built
dbt collects all these dependency declarations across every model file and compiles them into the DAG. The DAG is serialized in the manifest.json artifact produced by dbt compile or any dbt run. This means the lineage is always in sync with the code — you cannot have a model that uses a table without that dependency appearing in the DAG.
ref() Creates Model-to-Model Edges
-- models/marts/fct_orders.sql
-- This single file declares two upstream dependencies
select
o.order_id,
o.amount,
c.country
from {{ ref('stg_orders') }} as o -- edge: fct_orders → stg_orders
join {{ ref('stg_customers') }} as c -- edge: fct_orders → stg_customers
using (customer_id) source() Creates Source-to-Model Edges
-- models/staging/stg_orders.sql
-- Declares the lineage edge from the raw source table
select
order_id,
customer_id,
status,
amount_cents / 100.0 as amount
from {{ source('raw', 'orders') }} -- edge: stg_orders → raw.orders Combined, these declarations produce a lineage chain: raw.orders → stg_orders → fct_orders. dbt traces this chain automatically from all models in the project.
The dbt DAG
The DAG (directed acyclic graph) is the central data structure dbt computes from all ref() and source() declarations. It is "directed" because dependencies have a direction (parent → child, upstream → downstream), and "acyclic" because circular dependencies are not allowed (dbt will error on cycles).
The DAG drives not just lineage visualization but also execution order (topological sort), selective rebuilds (dbt run --select +fct_orders runs the model and all its ancestors), and impact analysis (identifying which downstream models are affected by a change to an upstream model).
raw.orders ──────► stg_orders ─────────────────────► fct_orders ──► rpt_monthly_revenue
▲
raw.customers ──► stg_customers ──► int_customers_ltv ──┤
│
raw.products ───► stg_products ──────────────────────────┘ dbt Docs Site: Interactive Lineage Visualization
The dbt docs generate command produces a static documentation website that includes an interactive lineage graph for every node in the project. Running dbt docs serve launches a local server where you can:
- Search for any model, source, or column by name
- Read the description, column definitions, and test configuration for each model
- Click into the lineage graph to visualize the DAG centered on any selected node — showing both upstream sources and downstream dependents
- Filter the lineage view to show only the selected node, all ancestors, all descendants, or the full graph
The docs site can be deployed as a static site (Netlify, S3, GitHub Pages) and shared with the entire organization, giving data consumers self-service access to lineage and documentation without requiring access to the dbt project itself.
Column-Level Lineage in dbt
dbt's native lineage is at the table/model level. It tells you that fct_revenue depends on stg_orders, but not which columns of fct_revenue derive from which columns of stg_orders. Column-level lineage requires parsing the SQL SELECT expressions.
dbt Cloud Native Column Lineage
dbt Cloud (the managed service) has introduced native column-level lineage in its Explorer interface. It uses static SQL analysis to parse SELECT expressions and trace how individual columns flow between models — without requiring additional packages. As of 2025–2026, this is available for Snowflake, BigQuery, Databricks, and Redshift adapters.
SQL Parsing Approach
Tools that implement column-level lineage parse the compiled SQL and identify column derivations:
- Direct references:
select order_id from stg_orders→fct.order_idderives fromstg_orders.order_id - Renamed columns:
select order_id as id from stg_orders→fct.idderives fromstg_orders.order_id - Expressions:
select amount_cents / 100.0 as amount→stg.amountderives fromraw.amount_cents - Aggregations:
select sum(amount) as revenue→fct.revenuederives from all sourceamountrows
Column-level lineage is significantly more valuable for impact analysis (which consumers are affected when a source column changes?) and for regulatory compliance (where exactly does this personal data field come from?).
OpenLineage Integration with dbt
OpenLineage is an open standard for lineage metadata — a vendor-neutral specification for how pipeline jobs emit lineage events (job started, dataset read, dataset written, job completed). It enables lineage to be collected from many different tools (Spark, Airflow, dbt, Flink) into a single unified lineage graph.
The dbt-openlineage Adapter
The dbt-openlineage package emits OpenLineage events during dbt run and dbt test executions. For each model run, it emits a START event (job begins, input datasets identified), a COMPLETE event (job finishes, output dataset identified, row count facet, schema facet), and a FAIL event if the model errors.
# Install
pip install openlineage-dbt
# Run dbt with OpenLineage enabled
OPENLINEAGE_URL=http://marquez:5000 dbt run Events are sent to an OpenLineage-compatible backend — Marquez (the open-source reference implementation), Apache Atlas, or enterprise platforms that support the standard including Dawiso. This enables cross-tool lineage: the same lineage graph that shows dbt model dependencies can be extended to show that raw.orders was written by an Airbyte sync job, which was triggered by an Airflow DAG, which is consumed downstream by a Tableau workbook — all in one unified view.
OpenLineage Facets
OpenLineage events carry typed metadata "facets" that enrich lineage beyond simple edge lists:
- Schema facet: column names and types of the dataset at time of run
- DataQualityMetrics facet: row count, null counts, assertion results
- ColumnLineage facet: column-to-column derivation edges (when the emitter supports it)
- SourceCode facet: the compiled SQL of the dbt model
Exposures: Documenting Downstream Consumers
dbt exposures extend the lineage graph beyond the transformation layer to the tools and processes that consume dbt models. An exposure is a YAML declaration of a downstream artifact — a BI dashboard, a data science notebook, an operational application, a reverse ETL job — and its dependencies on specific dbt models.
# models/exposures.yml
version: 2
exposures:
- name: finance_monthly_dashboard
label: "Finance Monthly Dashboard"
type: dashboard
maturity: high
url: https://company.tableau.com/views/FinanceMonthly
description: >
Primary finance dashboard showing monthly P&L, revenue by region,
and order metrics. Used in board-level reporting.
owner:
name: Finance Analytics Team
email: finance-analytics@company.com
depends_on:
- ref('fct_revenue')
- ref('dim_customers')
- name: churn_prediction_model
label: "Customer Churn ML Model"
type: ml
maturity: medium
description: "Scikit-learn churn model trained on mart layer features."
owner:
name: Data Science Team
email: ds@company.com
depends_on:
- ref('dim_customers')
- ref('fct_orders') Once declared, exposures appear in the dbt docs lineage graph as leaf nodes — you can see exactly which dashboards and applications will be affected if an upstream model changes. This is critical for impact analysis before schema migrations or model renames.
Connecting dbt Lineage to Enterprise Data Catalogs
The manifest.json produced by dbt is the most information-dense artifact in the dbt ecosystem. It contains the complete DAG, every model's compiled SQL, column definitions, test configurations, descriptions, tags, and metadata — all in a single machine-readable JSON file. This makes it the primary integration point between dbt and enterprise data governance platforms.
A typical integration flow:
- dbt CI/CD pipeline runs
dbt compileordbt run, producingmanifest.jsonandrun_results.json - Artifacts are uploaded to object storage (S3, GCS, Azure Blob) or a dedicated artifact store
- The governance platform polls for new artifacts or receives a webhook notification
- The platform parses the manifest, extracting: model nodes, source nodes, exposure nodes, edges (lineage), column definitions, test configurations, and descriptions
- These are merged into the platform's data catalog, creating or updating catalog entries with fresh lineage, documentation, and quality metadata
Best practices for lineage governance:
- Always use ref() and source() — never hardcode table names. Hardcoded names are invisible to the DAG and create lineage gaps.
- Declare all sources explicitly. Every raw table used in staging models should appear in a
_sources.ymlfile. - Use exposures for downstream consumers. Dashboards, ML models, and operational applications should be represented in the DAG.
- Store manifest.json artifacts after every run. Store it in a versioned artifact store for comparison (state-aware CI) and for governance integrations.
- Add descriptions to models and columns. Lineage without context is a graph of names. Descriptions transform it into documented data contracts.
Dawiso and dbt Lineage
Dawiso ingests dbt manifests to provide end-to-end data lineage visibility that spans the entire data stack — from source systems through dbt transformations to downstream BI tools and data products. This goes significantly beyond what the dbt docs site alone can show.
Key capabilities Dawiso adds on top of raw dbt lineage:
- Cross-system lineage stitching: dbt models connect to upstream ingestion (Fivetran, Airbyte, custom pipelines) and downstream consumers (Power BI, Tableau, Looker, custom applications) in a single unified lineage graph. The dbt layer is one layer in a longer chain, not a standalone island.
- Business-context enrichment: Dawiso's data catalog associates dbt models with business glossary terms, data owners, data domains, and compliance classifications. A column marked as containing PII in the catalog propagates that classification through the lineage graph to all downstream consumers.
- Column-level lineage across systems: Where dbt provides table-level lineage and optional column-level lineage within the transformation layer, Dawiso extends column tracking to source-system fields and downstream BI report fields — supporting column-level impact analysis across the full data journey.
- Quality and freshness context: dbt test results and source freshness checks from
run_results.jsonare surfaced in Dawiso alongside each model's catalog entry, giving data consumers confidence in the currency and validity of what they are consuming. - Governance workflows: Lineage in Dawiso powers change impact notifications (alert downstream owners when an upstream model's schema changes), data access governance (who can see which lineage paths), and regulatory reporting (produce lineage documentation for GDPR, BCBS 239, or internal data governance audits).
The practical implication is that data teams can continue to work entirely within dbt — writing models, adding schema.yml descriptions, running tests — while Dawiso automatically harvests the governance-relevant metadata produced as a by-product, keeping the enterprise catalog current without requiring manual data stewardship effort. See also how Dawiso supports active metadata management to keep lineage continuously up-to-date.