Skip to main content
dbt modelsdbt transformationsdata build tooldbt materializationsdbt SQL

dbt Models: Complete Guide to Data Transformations

dbt (data build tool) has become the standard framework for analytics engineering — and the dbt model is its fundamental building block. At its core, a dbt model is nothing more than a single SELECT statement saved in a .sql file. dbt takes that SELECT, wraps it in the appropriate DDL, and materializes it in your data warehouse or lakehouse as a table, view, or incremental table. This deceptively simple idea unlocks a full software engineering workflow — version control, testing, documentation, dependency management — on top of plain SQL that analysts already know.

TL;DR

A dbt model is a .sql file containing a single SELECT statement. dbt compiles it into CREATE TABLE/VIEW statements and executes them in dependency order based on ref() and source() calls. Models are organized into staging, intermediate, and mart layers — and every run produces machine-readable metadata (manifest.json) that feeds data governance platforms like Dawiso.

What Is dbt?

dbt is an open-source command-line tool (and a managed cloud service, dbt Cloud) that handles the T in ELT — the transformation of data that has already been loaded into your warehouse. Instead of maintaining complex stored procedures or brittle custom scripts, dbt lets your team write SELECT statements and express dependencies between them. dbt compiles those statements, resolves the dependency graph, and executes transformations in the correct order against your target database (Snowflake, BigQuery, Databricks, Redshift, DuckDB, and many others).

Key capabilities dbt brings to transformation work:

  • Automatic DAG (directed acyclic graph) resolution from ref() and source() calls
  • Built-in testing framework for data quality assertions
  • Documentation generation with interactive lineage graphs
  • Jinja templating for DRY SQL (macros, conditionals, loops)
  • A growing ecosystem of community packages (dbt_utils, dbt_expectations, AutomateDV)

Because all transformation logic lives in .sql and .yml files, dbt slots naturally into Git — enabling code review, branching, CI/CD, and a full audit trail of every change.

What Is a dbt Model?

A dbt model is a file with a .sql extension containing exactly one SELECT statement. The file name becomes the model name, which becomes the name of the object dbt creates in your warehouse.

-- models/staging/stg_orders.sql
select
    order_id,
    customer_id,
    status,
    cast(created_at as timestamp) as created_at,
    amount_cents / 100.0          as amount
from {{ source('raw', 'orders') }}

When you run dbt run, dbt compiles this into a CREATE TABLE AS or CREATE VIEW AS statement (depending on materialization) and executes it. The double-curly-brace syntax is Jinja templating — it lets dbt inject correct schema names, evaluate macros, and resolve references at compile time.

Models can optionally start with a Jinja config block that sets model-level configuration:

-- models/marts/fct_orders.sql
{{
  config(
    materialized = 'incremental',
    unique_key    = 'order_id',
    on_schema_change = 'sync_all_columns'
  )
}}

select
    o.order_id,
    o.customer_id,
    c.country,
    o.amount,
    o.created_at
from {{ ref('stg_orders') }}      as o
left join {{ ref('stg_customers') }} as c using (customer_id)

{% if is_incremental() %}
where o.created_at > (select max(created_at) from {{ this }})
{% endif %}

The ref() and source() Functions

Two Jinja functions are central to every dbt project: ref() and source().

ref() — Referencing Other Models

{{ ref('model_name') }} tells dbt that the current model depends on another model. At compile time, dbt replaces it with the fully qualified schema and table name for that environment. Two critical things happen:

  • Dependency tracking: dbt adds an edge in the DAG, ensuring the referenced model is built first
  • Environment portability: the resolved name changes automatically between dev, staging, and production schemas — no hardcoded schema names

source() — Referencing Raw Data

{{ source('source_name', 'table_name') }} references raw tables not produced by dbt. Sources are declared in .yml files:

# models/staging/_sources.yml
version: 2

sources:
  - name: raw
    database: analytics
    schema: raw_data
    tables:
      - name: orders
        description: "Raw orders from the transactional database"
        loaded_at_field: _loaded_at
        freshness:
          warn_after: {count: 12, period: hour}
          error_after: {count: 24, period: hour}
      - name: customers
        description: "Raw customer records"

Declaring sources unlocks source freshness checks (dbt source freshness), lineage tracking back to raw tables, and documentation of the ingestion layer alongside transformation models.

dbt Model Layer Pattern — Sources, Staging, Intermediate, Marts dbt Model Layer Pattern SOURCE raw.orders SOURCE raw.customers SOURCE raw.products STAGING stg_orders STAGING stg_customers STAGING stg_products INTERMEDIATE int_orders enriched MART (FACT) fct_revenue MART (DIM) dim_customers Source (raw table) Staging (1-per-source, view) Intermediate (join/enrich) Mart (fact/dim, table)
Click to enlarge

Model Materializations

A materialization determines how dbt persists a model's SELECT result in the warehouse. dbt ships with four built-in materializations.

View

The lightest option: dbt creates a database view. No data is stored — queries against the view always re-execute the underlying SQL. Views are the default materialization.

  • Best for: staging models, simple transformations, models queried infrequently
  • Trade-off: query latency equals the cost of re-running the SELECT every time

Table

dbt drops and recreates a physical table on every run. All rows are recomputed from scratch.

  • Best for: mart-layer models, medium-sized datasets where a full refresh is acceptable
  • Trade-off: build time and compute scale with data volume; no partial updates

Incremental

On the first run, dbt builds the full table. On subsequent runs, it processes only new or changed rows and merges them into the existing table.

{{
  config(
    materialized  = 'incremental',
    unique_key    = 'event_id',
    incremental_strategy = 'merge'
  )
}}

select
    event_id,
    user_id,
    event_type,
    occurred_at
from {{ source('raw', 'events') }}

{% if is_incremental() %}
  where occurred_at > (select max(occurred_at) from {{ this }})
{% endif %}
  • Best for: large fact tables, event streams, append-heavy datasets
  • Trade-off: more complex logic; incorrect filter predicates can cause missed records or duplicates

Ephemeral

Ephemeral models are not materialized at all. dbt inlines their SQL as a Common Table Expression (CTE) in any downstream model that references them.

  • Best for: intermediate calculation logic you want to reuse without storing
  • Trade-off: cannot be queried directly; CTE bloat in complex downstream models

Model Organization: Staging, Intermediate, and Marts

The most widely adopted dbt project pattern organizes models into three layers, keeping transformation logic modular, testable, and easy to reason about.

Staging Layer (models/staging/)

One staging model per source table. The job of a staging model is to clean and standardize raw data — rename columns, cast data types, handle nulls, apply light business rules. Staging models reference source() only, never other models. Typically materialized as views.

-- models/staging/stg_customers.sql
select
    id                           as customer_id,
    lower(trim(email))           as email,
    first_name,
    last_name,
    cast(created_at as timestamp) as created_at,
    is_active
from {{ source('raw', 'customers') }}

Intermediate Layer (models/intermediate/)

Intermediate models join and enrich staging models. They encode business logic shared across multiple marts but not yet ready to be exposed directly. Reference ref() to staging (or other intermediate) models.

-- models/intermediate/int_orders_enriched.sql
select
    o.order_id,
    o.amount,
    o.created_at,
    c.customer_id,
    c.email,
    c.country,
    p.product_name,
    p.category
from {{ ref('stg_orders') }}    as o
join {{ ref('stg_customers') }} as c using (customer_id)
join {{ ref('stg_products') }}  as p using (product_id)

Marts Layer (models/marts/)

Marts are the final, analysis-ready outputs organized by business domain (finance, marketing, product). They expose clean, wide tables optimized for BI tools and ad-hoc queries. Typically materialized as tables or incremental models.

-- models/marts/finance/fct_revenue.sql
select
    date_trunc('month', created_at) as month,
    country,
    category,
    sum(amount)                     as revenue,
    count(distinct order_id)        as order_count
from {{ ref('int_orders_enriched') }}
group by 1, 2, 3

Naming Conventions

Consistent naming is one of the highest-leverage practices in dbt projects. The widely adopted convention uses prefixes to signal a model's layer and type:

  • stg_ — staging models (one per source table): stg_orders, stg_customers
  • int_ — intermediate models: int_orders_enriched, int_session_attribution
  • fct_ — fact tables (transactional, additive metrics): fct_orders, fct_page_views
  • dim_ — dimension tables (descriptive attributes): dim_customers, dim_products
  • rpt_ — report-layer models for specific BI use cases (optional extra layer)

Documentation with schema.yml

dbt generates a browsable documentation site from .yml descriptor files co-located with models. Each model and each column can have a description, which appears alongside automatically derived lineage graphs.

# models/marts/finance/_finance.yml
version: 2

models:
  - name: fct_revenue
    description: >
      Monthly revenue aggregated by country and product category.
      Primary source of truth for the Finance dashboard.
    columns:
      - name: month
        description: "First day of the calendar month (UTC)."
      - name: revenue
        description: "Sum of order amounts in USD for the period."
        tests:
          - not_null
      - name: order_count
        description: "Distinct count of orders contributing to revenue."
        tests:
          - not_null

See the dbt tests guide for a full walkthrough of configuring quality assertions in schema.yml.

Running Models

The primary commands for working with models:

# Run all models
dbt run

# Run a specific model
dbt run --select stg_orders

# Run a model and all its upstream dependencies
dbt run --select +fct_revenue

# Run a model and all its downstream dependents
dbt run --select fct_revenue+

# Build models + run tests in one command (recommended for CI)
dbt build --select +fct_revenue

dbt build is the recommended command for CI/CD pipelines — it runs models, seeds, snapshots, and tests in dependency order, failing fast if a test fails before downstream models are built.

Lineage and Governance

One of the most underappreciated aspects of dbt is the machine-readable metadata it generates as a by-product of normal development work. Every dbt run produces a manifest.json (the full compiled graph) and a run_results.json (execution results including timing and row counts). dbt docs generate produces a catalog.json describing every column in every model.

These artifacts are the foundation for integration with data catalog and governance platforms. Dawiso ingests dbt manifests to automatically populate its data catalog with model descriptions, column definitions, lineage relationships, and test coverage — turning the documentation your team writes in schema.yml into a governed, searchable enterprise catalog without duplicate effort. Beyond cataloging, dbt metadata enables Dawiso to surface column-level lineage (which source columns flow into which mart columns), data quality coverage (which models have test assertions), and freshness SLAs (when was each model last successfully built).

Dawiso
Built with love for our users
Make Data Simple for Everyone.
Try Dawiso for free today and discover its ease of use firsthand.
© Dawiso s.r.o. All rights reserved