Change Data Capture (CDC): Complete Guide to Real-Time Data Synchronization
Change data capture (CDC) is a set of software design patterns and technologies that identify and capture changes made to data in a source database — inserts, updates, and deletes — and deliver those changes to downstream systems in near real time. Rather than periodically copying the full contents of a table (as batch ETL does), CDC continuously monitors the source for changes and streams only the changed records, dramatically reducing latency and processing overhead for downstream data pipelines.
Change data capture (CDC) tracks every insert, update, and delete in a source database and streams those changes to downstream systems with sub-second to low-minute latency. Log-based CDC — reading the database's write-ahead log — is the gold standard: it adds no load to the source, preserves every change event, and powers real-time data warehouses, microservice sync, and operational analytics.
What Is CDC?
Every operational database constantly receives changes: new orders are inserted, customer addresses are updated, cancelled subscriptions are deleted. In traditional data architectures, these changes are captured periodically by full-table or incremental batch jobs that copy data to a data warehouse or data lake on a fixed schedule. This means the downstream systems are always some hours behind the source — a latency that is increasingly unacceptable for operational analytics, real-time dashboards, event-driven microservices, and ML feature pipelines.
CDC solves this by treating every database change as an event that is captured and streamed immediately. The downstream data warehouse, data lake, or target application receives each insert, update, and delete as it happens, keeping downstream data in near-perfect synchronisation with the source. This enables analytical queries over data that is seconds or minutes old rather than hours or days.
CDC is also the foundation for event-driven architectures where microservices react to database state changes. The outbox pattern — a common pattern in distributed systems — uses CDC to reliably publish events to a message bus every time a record is inserted into an "outbox" table, providing exactly-once event delivery without the dual-write problem. See also: data lineage, ETL/ELT, data pipeline.
CDC vs Batch ETL
Batch ETL and CDC represent two fundamentally different approaches to moving data from operational systems to analytical and downstream systems:
- Latency: batch ETL delivers data on a schedule (hourly, daily); CDC delivers data continuously with latency typically measured in seconds to low minutes.
- Data volume per run: full-table batch jobs copy all rows on every run, regardless of what changed; CDC captures only changed rows, dramatically reducing the volume of data transferred for each synchronisation cycle.
- Source load: bulk batch jobs place significant query load on the source database; log-based CDC reads the transaction log without issuing any queries against the source tables, adding near-zero operational overhead.
- Change history: batch ETL typically provides a point-in-time snapshot, losing intermediate states between runs (a row that was inserted and deleted between two batch runs simply disappears); CDC captures every intermediate state as a distinct event in the change stream.
- Deletes: batch ETL struggles to detect deletes — a row that disappears between two full-table snapshots is indistinguishable from a row that simply was not in scope of the incremental query; CDC explicitly captures delete events with the deleted row's key and timestamp.
The tradeoff is complexity: batch ETL is straightforward to implement and debug; CDC requires understanding database transaction logs, managing connector configuration, and handling schema evolution in the change stream. For use cases where daily or hourly latency is acceptable, batch ETL remains simpler. For real-time analytics, event-driven architectures, and data observability pipelines, CDC's advantages are compelling.
4 CDC Methods
1. Log-Based CDC
Log-based CDC reads the database's transaction log — the write-ahead log (WAL) in PostgreSQL, the binary log (binlog) in MySQL, or the redo log in Oracle — to capture every committed change event. This is the gold standard CDC method because it adds zero load to the source database (no additional queries are executed), it captures every change including deletes, and it preserves the exact order and timing of changes. Log-based CDC is the approach used by Debezium, the leading open-source CDC framework, as well as by most enterprise CDC and ELT tools.
2. Trigger-Based CDC
Trigger-based CDC uses database triggers that fire on insert, update, and delete operations and write the change records to a dedicated audit or change table. This approach works on any database that supports triggers and does not require special database permissions to read transaction logs. The disadvantage is significant performance overhead: triggers fire synchronously during each write operation, adding latency to every transaction on the source tables and increasing write amplification.
3. Timestamp-Based (Query-Based) CDC
Timestamp-based CDC queries the source table periodically for rows with a last_modified or updated_at timestamp greater than the timestamp of the last run. This is the simplest CDC method to implement — it requires only a scheduled query — but it has critical limitations: it requires every table to have a maintained timestamp column, it misses deletes entirely (deleted rows have no timestamp to query), and it places periodic query load on the source database.
4. Diff-Based (Snapshot) CDC
Diff-based CDC takes full snapshots of source tables at regular intervals and computes the difference between consecutive snapshots to identify changes. It is the most reliable method for capturing deletes and works on databases that support neither log access nor triggers. The disadvantage is the resource cost of full-table snapshots and the latency of the snapshot interval. Diff-based CDC is typically used as a fallback when log-based methods are not available.
Log-Based CDC Deep Dive
Understanding log-based CDC requires understanding how database transaction logs work. Every committed write to a database is recorded in the transaction log before it is applied to the data files. This log is the database's source of truth for recovery: if the database crashes before writing changes to disk, it can replay the log on restart to recover to a consistent state. CDC connectors exploit this log by reading it as a stream of change events, translating each log entry into a structured event that can be consumed by downstream systems.
In PostgreSQL, log-based CDC uses logical replication: the database publishes logical changes (as opposed to physical page changes) to a replication slot, which the CDC connector consumes. The connector maintains a replication slot position that tracks which log entries have been consumed, ensuring that no change is missed even if the connector restarts. In MySQL, the equivalent mechanism is the binary log, which the CDC connector reads using the MySQL replication protocol.
Schema changes (DDL events) in the source database are a significant challenge for log-based CDC. When a column is added, renamed, or dropped, the CDC connector must handle events recorded under both the old and new schema. Mature CDC frameworks like Debezium handle schema evolution by maintaining a schema registry that tracks the schema version at the time each event was recorded, enabling consumers to parse historical events correctly even after schema changes.
The initial snapshot is the bootstrapping phase of log-based CDC: before streaming changes, the connector takes a consistent read of the source table's current state to provide the consumer with a baseline. After the snapshot is complete, the connector switches to streaming from the transaction log, picking up all changes that occurred after the snapshot began. This snapshot-then-stream pattern ensures that downstream consumers receive a complete, consistent view of the data without gaps.
Tools: Debezium, Fivetran, Airbyte
Debezium
Debezium is the leading open-source CDC platform, developed under the Red Hat umbrella and contributed to the Apache Kafka ecosystem. It provides connectors for PostgreSQL, MySQL, MongoDB, Oracle, SQL Server, and other databases, all of which use log-based CDC. Debezium connectors run as Kafka Connect plugins, publishing change events as JSON or Avro messages to Kafka topics. Because Debezium is open source and runs within the data team's own infrastructure, it is highly configurable and has no per-row pricing — making it cost-effective for high-volume CDC pipelines.
Fivetran
Fivetran is a managed ELT platform that includes CDC-based connectors for a wide range of databases and SaaS sources. Fivetran abstracts away the infrastructure complexity of managing CDC connectors: teams configure a source and destination, and Fivetran handles schema detection, initial snapshots, ongoing change streaming, and schema evolution automatically. This simplicity comes at a per-row cost that can be significant for high-volume sources. Fivetran is most commonly used for connecting SaaS data sources (Salesforce, HubSpot, Stripe) to data warehouses where log-based CDC is not available and Fivetran uses timestamp-based or API-based capture instead.
Airbyte
Airbyte is an open-source ELT platform with a large connector library. Like Fivetran, it provides managed synchronisation from many source types to target data warehouses and lakes. Airbyte's CDC connectors use Debezium under the hood for databases that support log-based CDC, while using polling-based methods for sources that do not. Airbyte's open-source version can be self-hosted for full control and no per-row costs, while Airbyte Cloud provides a managed deployment with simplified operations.
Use Cases
- Real-time data warehousing: streaming changes from operational databases to Snowflake, BigQuery, or Redshift through Kafka, enabling analytical queries on data that is seconds old rather than hours.
- Cache invalidation: when operational database records change, downstream services that cache those records must invalidate their caches. CDC provides a reliable, low-latency trigger for cache invalidation without coupling the writing service to cache management.
- Microservice event streaming: the outbox pattern uses CDC to reliably publish domain events from a microservice's database to a message broker, solving the dual-write problem (the risk of a database write succeeding but the corresponding event publish failing).
- Search index synchronisation: keeping Elasticsearch or OpenSearch indices in sync with a primary database. CDC streams insert, update, and delete events directly to the search index, eliminating the need for periodic full re-indexing.
- Audit and compliance: maintaining a complete, ordered history of every change to sensitive data for regulatory compliance, forensic investigation, or dispute resolution. The CDC change stream becomes the audit log.
- ML feature stores: streaming operational data changes to a feature store in near real time, ensuring that ML models consume up-to-date feature values rather than stale batch-computed features.
CDC and Data Lineage
CDC creates a rich lineage trail: every change event is tagged with its source table, the type of operation (insert/update/delete), a timestamp, and (in log-based CDC) the transaction ID and log sequence number. This provenance information is highly valuable for data lineage tracking because it provides not just the current state of data but the complete history of how it arrived at that state.
For downstream systems that consume CDC events — data warehouses, data lakes, feature stores — the lineage chain runs from the original operational system through the CDC connector to Kafka and then to the target. When a data quality incident occurs in the downstream system, lineage allows engineers to trace the problem back to the specific CDC event (or batch of events) that introduced it, using the transaction timestamp and source table information to identify exactly when and where the bad data originated.
CDC also enables a form of data observability at the source level: by monitoring the rate, volume, and schema of CDC events, teams can detect anomalies — a sudden spike in delete events, a column that appears in some events but not others, a gap in event timestamps — that indicate problems in the source system before they manifest as data quality issues downstream.
Governance and Dawiso
Data governance for CDC pipelines must address several specific requirements. Schema evolution is the most operationally critical: when source schemas change, CDC events may suddenly contain different fields, causing consumers to fail. A schema registry (such as Confluent Schema Registry with Avro or Protobuf) combined with consumer schema compatibility policies prevents schema changes from breaking downstream consumers without warning.
Sensitive data in CDC streams is a significant governance concern. CDC events often contain PII and other sensitive data fields because they capture the full row content of source tables. When CDC streams are published to Kafka or other message brokers that may be broadly accessible within an organisation, sensitive fields should be masked or encrypted in the event payload — applying the same column-level masking policies that govern the source tables to the event stream.
Dawiso integrates CDC pipeline metadata into its lineage and governance platform. By connecting to the systems that source CDC data and the systems that consume it, Dawiso builds end-to-end lineage graphs that span the CDC streaming layer: from the original operational database table, through the CDC connector and Kafka topics, to the downstream data warehouse tables and analytical models. This lineage is essential for impact analysis — when a source table schema changes, Dawiso can immediately identify all CDC consumers and downstream analytics that will be affected.
Dawiso also tracks the classification and governance status of data flowing through CDC pipelines. If a source table column is classified as PII, Dawiso ensures that classification propagates through the lineage graph to the CDC-derived downstream tables, so that governance teams can verify that appropriate masking or access controls are in place at every point in the streaming data path. This makes CDC pipelines first-class citizens of the data governance programme rather than opaque infrastructure components that governance teams cannot see or control.