need some clarifications on snapshot_only vs _recovery

531 views
Skip to first unread message

Jorge Bo

unread,
Mar 14, 2024, 8:40:36 AM3/14/24
to debezium
Hi i'm new to debezium and i'm having a hard time to understand the difference between schema_only and scuema_only_recovery. 

This is what i understand so far based on my own tests.Maybe it is wrong, maybe only parts.

- Schema_only captures only table schemas to history topic, but doesn't capture row data. Also it doesn't stream changes from binlog. So basically it does what its name states.
- Schema_only_recovery seems not to work if there is no binlog position stored . I thinkg that it captures table schemas, row data and streams binlog as event changes.

- Also this 2 modes should be enabled only when we want to capture tables not captured by the initial snapshot or in case of history problems. 

Could somebody help me to understand better how all of these work?

Thanks in advance,
Jorge

 

Chris Cranford

unread,
Mar 14, 2024, 8:51:11 AM3/14/24
to debe...@googlegroups.com
Hi Jorge -

So during a new connector deployment where there are no offsets and no schema history, you can either chose to perform an initial snapshot or a schema-only snapshot.  The prime difference between these two modes is that with an initial, not only do we read the table structures and publish those to the schema history topic, we also read all the historical data (i.e. SELECT * FROM <table>) and publish that to each of your included table's topics; however, with a schema-only snapshot we skip the historical bit and only read the table structures and publish those to the schema history topic.

The schema only recovery mode is as its name implies, a recovery or emergency-fix solution.  Sometimes users may misconfigure the schema history topic without infinite retention, may experience a DDL parser failure, or the topic may become corrupt.  Regardless of the reason, this mode allows the connector to re-read the existing table structures and repopulate the schema history topic based on the current schema metadata in your database.  Because this mode does not change the read position in the transaction logs where it will continue to stream from, there is a hard requirement that the tables you are capturing, there have been no DDL changes to them between the read position in the offsets and the time when you perform the recovery. If this prerequisite can be met, you can use the recovery mode.  If this requirement cannot be met, then you should not attempt to use this recovery mode as it can lead to processing failures for changes that have table structures that do not match the current table structure state. 

In this case where the recovery mode isn't possible, it's recommended that you choose to use an initial snapshot or a schema only snapshot by removing both the offsets and schema history topic.  In the case of a schema only snapshot, you can also consider pairing an ad-hoc snapshot like blocking or incremental with this step so that you can capture any changes to historical data that happened between when the connector was stopped and when you restarted.

Does that help?

Thanks,
Chris
--
You received this message because you are subscribed to the Google Groups "debezium" group.
To unsubscribe from this group and stop receiving emails from it, send an email to debezium+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/debezium/2e7d9064-8b77-496c-8452-9340ad3128d5n%40googlegroups.com.

Jorge Bo

unread,
Mar 14, 2024, 9:09:13 AM3/14/24
to debezium
Thanks, Chris.  That was really useful. 
Thanks!
Jorge

Jorge Bo

unread,
Mar 14, 2024, 10:50:16 AM3/14/24
to debezium
Having a detailed look at your response. Does it mean that  schema-only will not repair a wrong history topic, but only pushed DDL ? So it will have no effect if there is already an existing history topic, while schema_only_recovery will repair it. 

Chris Cranford

unread,
Mar 14, 2024, 10:50:59 AM3/14/24
to debe...@googlegroups.com
That's correct, Jorge.

Chris

yuri abaev

unread,
Mar 17, 2024, 7:53:32 AM3/17/24
to debezium
Hi,
Chris can you plz explain your response?

"Because this mode does not change the read position in the transaction logs where it will continue to stream from, there is a hard requirement that the tables you are capturing, there have been no DDL changes to them between the read position in the offsets and the time when you perform the recovery. "

why is there such hard requirement? 
if I understand correctly:  on shcema_only_recovery the read position is not changed, so if there is a DDL change, connector's read position will get to position of the DDL change. why is that a problem?

Chris Cranford

unread,
Mar 17, 2024, 6:34:55 PM3/17/24
to debe...@googlegroups.com
Hi Yuri -

So lets say that our transaction log and read positions are as follows:

    INSERT INTO tableA with 3 columns <-- read position is here
    ALTER TABLE tableA DROP COLUMN column3
    INSERT INTO tableA with 2 columns

If you use "schema_only_recovery", the connector will create the in-memory relational model for "tableA" with 2 columns, because the current database metadata says the table has 2 columns.  When the connector then moves into the streaming phase after the recovery, the streaming phase will fail because the current read position expects that "tableA" has 3 columns but we know it as only having 2.

The database metadata that describes the table structure must be consistent with the read position to use "schema_only_recovery". If this cannot be guaranteed, then you should not use "schema_only_recovery" because you're just going to face potential connector failures when the streaming phase begins.

Hope that helps.
Chris

Jorge Bo

unread,
Apr 2, 2024, 5:38:11 AM4/2/24
to debezium
Hi Chris, that is a great example. Is it possible to have something like this in the documentation? How will this example change if schema_only were used instead ?
Thanks,
Jorge

Reply all
Reply to author
Forward
0 new messages