dbt Models: Complete Guide to Data Transformations
dbt (data build tool) has become the standard framework for analytics engineering — and the dbt model is its fundamental building block. At its core, a dbt model is nothing more than a single SELECT statement saved in a .sql file. dbt takes that SELECT, wraps it in the appropriate DDL, and materializes it in your data warehouse or lakehouse as a table, view, or incremental table. This deceptively simple idea unlocks a full software engineering workflow — version control, testing, documentation, dependency management — on top of plain SQL that analysts already know.
A dbt model is a .sql file containing a single SELECT statement. dbt compiles it into CREATE TABLE/VIEW statements and executes them in dependency order based on ref() and source() calls. Models are organized into staging, intermediate, and mart layers — and every run produces machine-readable metadata (manifest.json) that feeds data governance platforms like Dawiso.
What Is dbt?
dbt is an open-source command-line tool (and a managed cloud service, dbt Cloud) that handles the T in ELT — the transformation of data that has already been loaded into your warehouse. Instead of maintaining complex stored procedures or brittle custom scripts, dbt lets your team write SELECT statements and express dependencies between them. dbt compiles those statements, resolves the dependency graph, and executes transformations in the correct order against your target database (Snowflake, BigQuery, Databricks, Redshift, DuckDB, and many others).
Key capabilities dbt brings to transformation work:
- Automatic DAG (directed acyclic graph) resolution from
ref()andsource()calls - Built-in testing framework for data quality assertions
- Documentation generation with interactive lineage graphs
- Jinja templating for DRY SQL (macros, conditionals, loops)
- A growing ecosystem of community packages (dbt_utils, dbt_expectations, AutomateDV)
Because all transformation logic lives in .sql and .yml files, dbt slots naturally into Git — enabling code review, branching, CI/CD, and a full audit trail of every change.
What Is a dbt Model?
A dbt model is a file with a .sql extension containing exactly one SELECT statement. The file name becomes the model name, which becomes the name of the object dbt creates in your warehouse.
-- models/staging/stg_orders.sql
select
order_id,
customer_id,
status,
cast(created_at as timestamp) as created_at,
amount_cents / 100.0 as amount
from {{ source('raw', 'orders') }} When you run dbt run, dbt compiles this into a CREATE TABLE AS or CREATE VIEW AS statement (depending on materialization) and executes it. The double-curly-brace syntax is Jinja templating — it lets dbt inject correct schema names, evaluate macros, and resolve references at compile time.
Models can optionally start with a Jinja config block that sets model-level configuration:
-- models/marts/fct_orders.sql
{{
config(
materialized = 'incremental',
unique_key = 'order_id',
on_schema_change = 'sync_all_columns'
)
}}
select
o.order_id,
o.customer_id,
c.country,
o.amount,
o.created_at
from {{ ref('stg_orders') }} as o
left join {{ ref('stg_customers') }} as c using (customer_id)
{% if is_incremental() %}
where o.created_at > (select max(created_at) from {{ this }})
{% endif %} The ref() and source() Functions
Two Jinja functions are central to every dbt project: ref() and source().
ref() — Referencing Other Models
{{ ref('model_name') }} tells dbt that the current model depends on another model. At compile time, dbt replaces it with the fully qualified schema and table name for that environment. Two critical things happen:
- Dependency tracking: dbt adds an edge in the DAG, ensuring the referenced model is built first
- Environment portability: the resolved name changes automatically between dev, staging, and production schemas — no hardcoded schema names
source() — Referencing Raw Data
{{ source('source_name', 'table_name') }} references raw tables not produced by dbt. Sources are declared in .yml files:
# models/staging/_sources.yml
version: 2
sources:
- name: raw
database: analytics
schema: raw_data
tables:
- name: orders
description: "Raw orders from the transactional database"
loaded_at_field: _loaded_at
freshness:
warn_after: {count: 12, period: hour}
error_after: {count: 24, period: hour}
- name: customers
description: "Raw customer records" Declaring sources unlocks source freshness checks (dbt source freshness), lineage tracking back to raw tables, and documentation of the ingestion layer alongside transformation models.
Model Materializations
A materialization determines how dbt persists a model's SELECT result in the warehouse. dbt ships with four built-in materializations.
View
The lightest option: dbt creates a database view. No data is stored — queries against the view always re-execute the underlying SQL. Views are the default materialization.
- Best for: staging models, simple transformations, models queried infrequently
- Trade-off: query latency equals the cost of re-running the SELECT every time
Table
dbt drops and recreates a physical table on every run. All rows are recomputed from scratch.
- Best for: mart-layer models, medium-sized datasets where a full refresh is acceptable
- Trade-off: build time and compute scale with data volume; no partial updates
Incremental
On the first run, dbt builds the full table. On subsequent runs, it processes only new or changed rows and merges them into the existing table.
{{
config(
materialized = 'incremental',
unique_key = 'event_id',
incremental_strategy = 'merge'
)
}}
select
event_id,
user_id,
event_type,
occurred_at
from {{ source('raw', 'events') }}
{% if is_incremental() %}
where occurred_at > (select max(occurred_at) from {{ this }})
{% endif %} - Best for: large fact tables, event streams, append-heavy datasets
- Trade-off: more complex logic; incorrect filter predicates can cause missed records or duplicates
Ephemeral
Ephemeral models are not materialized at all. dbt inlines their SQL as a Common Table Expression (CTE) in any downstream model that references them.
- Best for: intermediate calculation logic you want to reuse without storing
- Trade-off: cannot be queried directly; CTE bloat in complex downstream models
Model Organization: Staging, Intermediate, and Marts
The most widely adopted dbt project pattern organizes models into three layers, keeping transformation logic modular, testable, and easy to reason about.
Staging Layer (models/staging/)
One staging model per source table. The job of a staging model is to clean and standardize raw data — rename columns, cast data types, handle nulls, apply light business rules. Staging models reference source() only, never other models. Typically materialized as views.
-- models/staging/stg_customers.sql
select
id as customer_id,
lower(trim(email)) as email,
first_name,
last_name,
cast(created_at as timestamp) as created_at,
is_active
from {{ source('raw', 'customers') }} Intermediate Layer (models/intermediate/)
Intermediate models join and enrich staging models. They encode business logic shared across multiple marts but not yet ready to be exposed directly. Reference ref() to staging (or other intermediate) models.
-- models/intermediate/int_orders_enriched.sql
select
o.order_id,
o.amount,
o.created_at,
c.customer_id,
c.email,
c.country,
p.product_name,
p.category
from {{ ref('stg_orders') }} as o
join {{ ref('stg_customers') }} as c using (customer_id)
join {{ ref('stg_products') }} as p using (product_id) Marts Layer (models/marts/)
Marts are the final, analysis-ready outputs organized by business domain (finance, marketing, product). They expose clean, wide tables optimized for BI tools and ad-hoc queries. Typically materialized as tables or incremental models.
-- models/marts/finance/fct_revenue.sql
select
date_trunc('month', created_at) as month,
country,
category,
sum(amount) as revenue,
count(distinct order_id) as order_count
from {{ ref('int_orders_enriched') }}
group by 1, 2, 3 Naming Conventions
Consistent naming is one of the highest-leverage practices in dbt projects. The widely adopted convention uses prefixes to signal a model's layer and type:
stg_— staging models (one per source table):stg_orders,stg_customersint_— intermediate models:int_orders_enriched,int_session_attributionfct_— fact tables (transactional, additive metrics):fct_orders,fct_page_viewsdim_— dimension tables (descriptive attributes):dim_customers,dim_productsrpt_— report-layer models for specific BI use cases (optional extra layer)
Documentation with schema.yml
dbt generates a browsable documentation site from .yml descriptor files co-located with models. Each model and each column can have a description, which appears alongside automatically derived lineage graphs.
# models/marts/finance/_finance.yml
version: 2
models:
- name: fct_revenue
description: >
Monthly revenue aggregated by country and product category.
Primary source of truth for the Finance dashboard.
columns:
- name: month
description: "First day of the calendar month (UTC)."
- name: revenue
description: "Sum of order amounts in USD for the period."
tests:
- not_null
- name: order_count
description: "Distinct count of orders contributing to revenue."
tests:
- not_null See the dbt tests guide for a full walkthrough of configuring quality assertions in schema.yml.
Running Models
The primary commands for working with models:
# Run all models
dbt run
# Run a specific model
dbt run --select stg_orders
# Run a model and all its upstream dependencies
dbt run --select +fct_revenue
# Run a model and all its downstream dependents
dbt run --select fct_revenue+
# Build models + run tests in one command (recommended for CI)
dbt build --select +fct_revenue dbt build is the recommended command for CI/CD pipelines — it runs models, seeds, snapshots, and tests in dependency order, failing fast if a test fails before downstream models are built.
Lineage and Governance
One of the most underappreciated aspects of dbt is the machine-readable metadata it generates as a by-product of normal development work. Every dbt run produces a manifest.json (the full compiled graph) and a run_results.json (execution results including timing and row counts). dbt docs generate produces a catalog.json describing every column in every model.
These artifacts are the foundation for integration with data catalog and governance platforms. Dawiso ingests dbt manifests to automatically populate its data catalog with model descriptions, column definitions, lineage relationships, and test coverage — turning the documentation your team writes in schema.yml into a governed, searchable enterprise catalog without duplicate effort. Beyond cataloging, dbt metadata enables Dawiso to surface column-level lineage (which source columns flow into which mart columns), data quality coverage (which models have test assertions), and freshness SLAs (when was each model last successfully built).