Reinitiating Initial snapshot for specific table in source connector

632 views
Skip to first unread message

Mukesh Kumar

unread,
Sep 7, 2023, 4:51:42 AM9/7/23
to debezium
Hello All,

Is it possible to trigger the initial snapshot multiple times for a source table? 

Scenario - I have a SQL SERVER source connector that includes specific tables. Now for table X, my CDC is enabled for only N number of columns. I disabled-enabled CDC on the X table and added one more column. How can I capture this N+1th column's initial snapshot data into Kafka?

Even if I delete the Kafka topic and update the connector, it seems like it is not taking the initial snapshot.

I can not delete and recreate the connector as other tables are also part of this connector.

Regards
Mukesh Kumar

Chris Cranford

unread,
Sep 7, 2023, 8:02:45 AM9/7/23
to debe...@googlegroups.com
Hi,

No, once an initial snapshot has concluded, the connector won't be able to re-create the initial snapshot unless you explicitly remove the connector offsets and history topic, which makes the connector believe that it is a new deployment.  However, your use case is very much why we introduced incremental snapshots.  These can be initiated as many times as you need, does not require you to remove offsets or history topics and can be triggered as needed.  In your case, once you've added the N-th new column, you'd send a signal to trigger the incremental snapshot and the snapshot would run concurrently while streaming changes from the configured tables.

Thanks,
Chris
--
You received this message because you are subscribed to the Google Groups "debezium" group.
To unsubscribe from this group and stop receiving emails from it, send an email to debezium+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/debezium/57fe2428-2c6b-445c-b834-cfce57592389n%40googlegroups.com.

Oren Elias

unread,
Sep 7, 2023, 8:33:42 AM9/7/23
to debe...@googlegroups.com
Chris,
What do you see as the limitations of incremental snapshots? I am wondering why these are not the standard. Is this performance related?
I saw some limitations there are as well. 
We may be interested in bridging gaps as we are thinking of using incremental snapshots more.
Thanks,
-Oren

Chris Cranford

unread,
Sep 7, 2023, 9:38:26 AM9/7/23
to debe...@googlegroups.com
Hi Oren -

I think it's really a question of what is best for a given use case. 

An initial snapshot is an excellent tool for quickly populating the topics with the existing data, it's extremely fast, there is minimal overhead of traversing the JDBC result set and generating change events and so on; however, it isn't without its own set of requirements of being consistent, which imposes limitations around resumability.  Additionally, if your initial snapshot consists of millions and millions of rows, this can impose a certain amount of overhead on the database with retaining transaction logs for the duration of the snapshot so that streaming can begin where the consistent snapshot was taken.  For MySQL and Oracle, this means making sure your transaction log retention is large enough, for Oracle making sure the undo retention period is large enough, for PostgreSQL making sure that the WAL can grow quite large in the interim and so on. 

Incremental snapshots are meant to bridge a number of those initial snapshot requirements to make the process easier and smoother.  You gain resumability, you avoid the need for adjusting your transaction log retention policies, you avoid issues with reconfiguring Oracle's undo retention period, and you minimize the WAL growth on PostgreSQL to just name a few.  Now incremental snapshots are generally slightly slower than initial snapshots with how the data is queried, but that's the cost of being able to gain all the added benefits over the initial snapshot counterpart.

Probably the biggest limitation for incremental snapshots prior to Debezium 2.4 was around key-less tables.  If a table did not have a primary key, a unique index that provided primary key semantics, or did not have a column that could act as a surrogate key for the chunk-based processing, then you couldn't use incremental snapshots on such a table.  This most often was an issue for users who were doing CDC on databases that were outside their control, such as ERP or similar systems where changing table structures could impose terms or support agreement violations.

For such tables prior to Debezium 2.4, users had to use multiple connectors serially to try and re-snapshot the key-less table with a temporary connector before resuming the original connector.  This was very cumbersome and quite prone to human error, but Debezium 2.4 simplifies this tremendously with a new ad-hoc blocking snapshot feature.  This new feature allows you to use the signal mechanism that is used for incremental snapshots to trigger the capture of the tables using the initial snapshot style queries.  The caveat is that streaming is paused temporarily while the snapshot runs, which means that you incur the same concerns as you do with initial snapshots or the multiple connector strategy, but it removes the human error and complexity components for the latter.

Our goal here is to give users the flexibility to pick the strategy that best fits their needs.  In some cases, initial or blocking snapshots are the ideal choice and others incremental snapshots may be the better choice.

Hope that helps.
Chris

George Tsopouridis

unread,
Feb 11, 2025, 10:46:01 AM2/11/25
to debezium

Hello Chris,
thanks a lot for all your support.

I really appreciate your way explaining all the details and I just wanted to ask you about a challenge I am facing in case you have any ideas. I've described it also here https://issues.redhat.com/browse/DBZ-8565. The challenge I am facing is that I need to control somehow the rate of the incremental snapshot prioritizing always live updates (it seems that the producing rate remains always the same independent of the increment chunk size). I would appreciate any ideas around this topic.

Thanks in advance.
George

Chris Cranford

unread,
Feb 11, 2025, 11:10:38 AM2/11/25
to debe...@googlegroups.com
Hi George -

So coming back to a point that Jiri said, I think what you are needing would require some architectural changes to the overall process for this to be handled by Debezium.

But in the interim, perhaps you can handle this externally by taking advantage of the notification & signal subsystems. For example, an external process could subscribe to notifications and when an incremental snapshot starts, you can have this process handle sending pause/resume signals to at intervals. The frequency of the pause/resume could be based on some timer or it could be be driven based on some other calculation to satisfy your throughput needs.

TL;DR use an external program with notifications & signals to temporarily pause incremental snapshots when throughput drops, and resume when throughput is acceptable or above expectations.

Does that help?
-cc
Reply all
Reply to author
Forward
0 new messages