Questions about snapshot mode

Randy Rodriguez Collado

unread,

Feb 26, 2021, 4:50:17 AM2/26/21

to debezium

Hey guys,

Hope you are doing fine. I have a couple of questions about snapshot modes and the Postgres connector.

I read in the documentation that it is not recommended to use the initial snapshot mode, as it could lead to loss of events. And it is recommended to use exported mode.

What does having a snapshot really helps with? As far as I'm aware the LSN is stored in the kafka offset topic and it is used to recover from failures in the connector, so where does a snapshot come into play?

What happens if I use the never snapshot mode. Let's say that for whatever reason there is an extended down time for the connector, however, a previous LSN is stored in the kafka offset topic. How long does this down time has to be to cause a loss of events for instance because the WAL was purged?

If I originally had the snapshot mode set to never for an extended period of time and decided to change the configuration to exported instead, would that cause reprocessing of old events?

Finally, both modes use the following sentence ...based on the point in time when the replication slot was created the difference is that the mode never starts streaming from that point on, and exported starts the snapshot from that point on. However, I fail to understand how to know when that point is? From the last time the WAL was purged?, from the very beginning of the creation of the Postgres DB?

If it is of any help we are using AWS RDS for PostgresSQL to host our DB to which Debezium is connected to.

Sorry for having so many questions, but I really would like to understand how it works underneath to make sure that we are doing things right.

Hope to hear from you, and thank you very much in advance.

Cheers,

Randy

Gunnar Morling

unread,

Mar 1, 2021, 9:41:39 AM3/1/21

to debezium

rand...@gmail.com schrieb am Freitag, 26. Februar 2021 um 10:50:17 UTC+1:

Hey guys,

Hope you are doing fine. I have a couple of questions about snapshot modes and the Postgres connector.

I read in the documentation that it is not recommended to use the initial snapshot mode, as it could lead to loss of events. And it is recommended to use exported mode.

What does having a snapshot really helps with? As far as I'm aware the LSN is stored in the kafka offset topic and it is used to recover from failures in the connector, so where does a snapshot come into play?

Snapshotting helps you to get a consistent initial state of the current data into Kafka. If you start the connector for a database which has been running for months or years, you won't have those old WAL segments any longer, i.e. the connector can only stream a subset of the data based on the WAL.

What happens if I use the never snapshot mode. Let's say that for whatever reason there is an extended down time for the connector, however, a previous LSN is stored in the kafka offset topic. How long does this down time has to be to cause a loss of events for instance because the WAL was purged?

A downtime of a connector will never cause event loss, provided you keep the connector's replication slot. Postgres will only flush those WAL segments which have been acknowledged by all replication slots. Naturally, if the connector isn't up, no progress on its slot will be made.

If I originally had the snapshot mode set to never for an extended period of time and decided to change the configuration to exported instead, would that cause reprocessing of old events?

Once an initial snapshot has been completed, it will not be done again for the lifetime of this connector. The exception being the "always" mode, which will re-do the snapshot upon each connector start-up.

Finally, both modes use the following sentence ...based on the point in time when the replication slot was created the difference is that the mode never starts streaming from that point on, and exported starts the snapshot from that point on. However, I fail to understand how to know when that point is? From the last time the WAL was purged?, from the very beginning of the creation of the Postgres DB?

It's some random point in time essentially, whenever the replication slot gets created.

If it is of any help we are using AWS RDS for PostgresSQL to host our DB to which Debezium is connected to.

Sorry for having so many questions, but I really would like to understand how it works underneath to make sure that we are doing things right.

Hope to hear from you, and thank you very much in advance.

Cheers,
Randy

Hth,

--Gunnar

Reply all

Reply to author

Forward

Message has been deleted