April 29, 2026

dbt lineagedata lineage dbtdbt DAGOpenLineagecolumn-level lineagedbt documentation

Data Lineage with dbt: Complete Guide

Data lineage is the documented record of how data moves, transforms, and flows from its origin to its final destination — which sources feed which models, which models feed which dashboards, and which columns derive from which upstream fields. Good lineage answers the questions that matter most when something goes wrong: "where did this number come from?" and "if I change this model, what breaks?"

TL;DR

dbt generates lineage automatically as a side-effect of normal development. Every ref() and source() call encodes a dependency edge, and dbt collects those edges into a DAG (directed acyclic graph) serialized in manifest.json. This table-level lineage is always in sync with the code, can be extended to column-level and cross-system lineage, and is consumable by governance platforms like Dawiso to provide enterprise-grade visibility.

What Is Data Lineage?

Data lineage operates at multiple levels of granularity, each answering different questions:

Table-level (dataset) lineage: "Which tables are used to build fct_revenue?" — this is what dbt's DAG shows natively
Column-level lineage: "The revenue column in fct_revenue — which source columns contribute to it?" — requires parsing SQL expressions
Job-level lineage: "Which pipeline run produced this table, and when?" — captured by orchestration and observability tools
Business-level lineage: "Which KPI in which dashboard depends on which data asset?" — connects technical lineage to business impact

dbt natively handles table-level lineage and provides the metadata foundation for the others. Understanding which level you need determines how far you need to go beyond dbt's built-in capabilities.

How dbt Generates Lineage Automatically

dbt's lineage is not manually maintained — it is computed from code. When you write {{ ref('stg_orders') }} in a model file, you are simultaneously:

Telling dbt to replace that expression with the correct schema-qualified table name at runtime
Declaring a dependency: the current model cannot be built until stg_orders is built

dbt collects all these dependency declarations across every model file and compiles them into the DAG. The DAG is serialized in the manifest.json artifact produced by dbt compile or any dbt run. This means the lineage is always in sync with the code — you cannot have a model that uses a table without that dependency appearing in the DAG.

ref() Creates Model-to-Model Edges

-- models/marts/fct_orders.sql
-- This single file declares two upstream dependencies
select
    o.order_id,
    o.amount,
    c.country
from {{ ref('stg_orders') }}      as o   -- edge: fct_orders → stg_orders
join {{ ref('stg_customers') }}   as c   -- edge: fct_orders → stg_customers
  using (customer_id)

source() Creates Source-to-Model Edges

-- models/staging/stg_orders.sql
-- Declares the lineage edge from the raw source table
select
    order_id,
    customer_id,
    status,
    amount_cents / 100.0 as amount
from {{ source('raw', 'orders') }}  -- edge: stg_orders → raw.orders

Combined, these declarations produce a lineage chain: raw.orders → stg_orders → fct_orders. dbt traces this chain automatically from all models in the project.

Click to enlarge

The dbt DAG

The DAG (directed acyclic graph) is the central data structure dbt computes from all ref() and source() declarations. It is "directed" because dependencies have a direction (parent → child, upstream → downstream), and "acyclic" because circular dependencies are not allowed (dbt will error on cycles).

The DAG drives not just lineage visualization but also execution order (topological sort), selective rebuilds (dbt run --select +fct_orders runs the model and all its ancestors), and impact analysis (identifying which downstream models are affected by a change to an upstream model).

raw.orders ──────► stg_orders ─────────────────────► fct_orders ──► rpt_monthly_revenue
                                                           ▲
raw.customers ──► stg_customers ──► int_customers_ltv ──┤
                                                           │
raw.products ───► stg_products ──────────────────────────┘

dbt Docs Site: Interactive Lineage Visualization

The dbt docs generate command produces a static documentation website that includes an interactive lineage graph for every node in the project. Running dbt docs serve launches a local server where you can:

Search for any model, source, or column by name
Read the description, column definitions, and test configuration for each model
Click into the lineage graph to visualize the DAG centered on any selected node — showing both upstream sources and downstream dependents
Filter the lineage view to show only the selected node, all ancestors, all descendants, or the full graph

The docs site can be deployed as a static site (Netlify, S3, GitHub Pages) and shared with the entire organization, giving data consumers self-service access to lineage and documentation without requiring access to the dbt project itself.

Column-Level Lineage in dbt

dbt's native lineage is at the table/model level. It tells you that fct_revenue depends on stg_orders, but not which columns of fct_revenue derive from which columns of stg_orders. Column-level lineage requires parsing the SQL SELECT expressions.

dbt Cloud Native Column Lineage

dbt Cloud (the managed service) has introduced native column-level lineage in its Explorer interface. It uses static SQL analysis to parse SELECT expressions and trace how individual columns flow between models — without requiring additional packages. As of 2025–2026, this is available for Snowflake, BigQuery, Databricks, and Redshift adapters.

SQL Parsing Approach

Tools that implement column-level lineage parse the compiled SQL and identify column derivations:

Direct references: select order_id from stg_orders → fct.order_id derives from stg_orders.order_id
Renamed columns: select order_id as id from stg_orders → fct.id derives from stg_orders.order_id
Expressions: select amount_cents / 100.0 as amount → stg.amount derives from raw.amount_cents
Aggregations: select sum(amount) as revenue → fct.revenue derives from all source amount rows

Column-level lineage is significantly more valuable for impact analysis (which consumers are affected when a source column changes?) and for regulatory compliance (where exactly does this personal data field come from?).

OpenLineage Integration with dbt

OpenLineage is an open standard for lineage metadata — a vendor-neutral specification for how pipeline jobs emit lineage events (job started, dataset read, dataset written, job completed). It enables lineage to be collected from many different tools (Spark, Airflow, dbt, Flink) into a single unified lineage graph.

The dbt-openlineage Adapter

The dbt-openlineage package emits OpenLineage events during dbt run and dbt test executions. For each model run, it emits a START event (job begins, input datasets identified), a COMPLETE event (job finishes, output dataset identified, row count facet, schema facet), and a FAIL event if the model errors.

# Install
pip install openlineage-dbt

# Run dbt with OpenLineage enabled
OPENLINEAGE_URL=http://marquez:5000 dbt run

Events are sent to an OpenLineage-compatible backend — Marquez (the open-source reference implementation), Apache Atlas, or enterprise platforms that support the standard including Dawiso. This enables cross-tool lineage: the same lineage graph that shows dbt model dependencies can be extended to show that raw.orders was written by an Airbyte sync job, which was triggered by an Airflow DAG, which is consumed downstream by a Tableau workbook — all in one unified view.

OpenLineage Facets

OpenLineage events carry typed metadata "facets" that enrich lineage beyond simple edge lists:

Schema facet: column names and types of the dataset at time of run
DataQualityMetrics facet: row count, null counts, assertion results
ColumnLineage facet: column-to-column derivation edges (when the emitter supports it)
SourceCode facet: the compiled SQL of the dbt model

Exposures: Documenting Downstream Consumers

dbt exposures extend the lineage graph beyond the transformation layer to the tools and processes that consume dbt models. An exposure is a YAML declaration of a downstream artifact — a BI dashboard, a data science notebook, an operational application, a reverse ETL job — and its dependencies on specific dbt models.

# models/exposures.yml
version: 2

exposures:
  - name: finance_monthly_dashboard
    label: "Finance Monthly Dashboard"
    type: dashboard
    maturity: high
    url: https://company.tableau.com/views/FinanceMonthly
    description: >
      Primary finance dashboard showing monthly P&L, revenue by region,
      and order metrics. Used in board-level reporting.
    owner:
      name: Finance Analytics Team
      email: finance-analytics@company.com
    depends_on:
      - ref('fct_revenue')
      - ref('dim_customers')

  - name: churn_prediction_model
    label: "Customer Churn ML Model"
    type: ml
    maturity: medium
    description: "Scikit-learn churn model trained on mart layer features."
    owner:
      name: Data Science Team
      email: ds@company.com
    depends_on:
      - ref('dim_customers')
      - ref('fct_orders')

Once declared, exposures appear in the dbt docs lineage graph as leaf nodes — you can see exactly which dashboards and applications will be affected if an upstream model changes. This is critical for impact analysis before schema migrations or model renames.

Connecting dbt Lineage to Enterprise Data Catalogs

The manifest.json produced by dbt is the most information-dense artifact in the dbt ecosystem. It contains the complete DAG, every model's compiled SQL, column definitions, test configurations, descriptions, tags, and metadata — all in a single machine-readable JSON file. This makes it the primary integration point between dbt and enterprise data governance platforms.

A typical integration flow:

dbt CI/CD pipeline runs dbt compile or dbt run, producing manifest.json and run_results.json
Artifacts are uploaded to object storage (S3, GCS, Azure Blob) or a dedicated artifact store
The governance platform polls for new artifacts or receives a webhook notification
The platform parses the manifest, extracting: model nodes, source nodes, exposure nodes, edges (lineage), column definitions, test configurations, and descriptions
These are merged into the platform's data catalog, creating or updating catalog entries with fresh lineage, documentation, and quality metadata

Best practices for lineage governance:

Always use ref() and source() — never hardcode table names. Hardcoded names are invisible to the DAG and create lineage gaps.
Declare all sources explicitly. Every raw table used in staging models should appear in a _sources.yml file.
Use exposures for downstream consumers. Dashboards, ML models, and operational applications should be represented in the DAG.
Store manifest.json artifacts after every run. Store it in a versioned artifact store for comparison (state-aware CI) and for governance integrations.
Add descriptions to models and columns. Lineage without context is a graph of names. Descriptions transform it into documented data contracts.

Dawiso and dbt Lineage

Dawiso ingests dbt manifests to provide end-to-end data lineage visibility that spans the entire data stack — from source systems through dbt transformations to downstream BI tools and data products. This goes significantly beyond what the dbt docs site alone can show.

Key capabilities Dawiso adds on top of raw dbt lineage:

Cross-system lineage stitching: dbt models connect to upstream ingestion (Fivetran, Airbyte, custom pipelines) and downstream consumers (Power BI, Tableau, Looker, custom applications) in a single unified lineage graph. The dbt layer is one layer in a longer chain, not a standalone island.
Business-context enrichment: Dawiso's data catalog associates dbt models with business glossary terms, data owners, data domains, and compliance classifications. A column marked as containing PII in the catalog propagates that classification through the lineage graph to all downstream consumers.
Column-level lineage across systems: Where dbt provides table-level lineage and optional column-level lineage within the transformation layer, Dawiso extends column tracking to source-system fields and downstream BI report fields — supporting column-level impact analysis across the full data journey.
Quality and freshness context: dbt test results and source freshness checks from run_results.json are surfaced in Dawiso alongside each model's catalog entry, giving data consumers confidence in the currency and validity of what they are consuming.
Governance workflows: Lineage in Dawiso powers change impact notifications (alert downstream owners when an upstream model's schema changes), data access governance (who can see which lineage paths), and regulatory reporting (produce lineage documentation for GDPR, BCBS 239, or internal data governance audits).

The practical implication is that data teams can continue to work entirely within dbt — writing models, adding schema.yml descriptions, running tests — while Dawiso automatically harvests the governance-relevant metadata produced as a by-product, keeping the enterprise catalog current without requiring manual data stewardship effort. See also how Dawiso supports active metadata management to keep lineage continuously up-to-date.