Lag Starting Incremental Snapshot (debezium notifications)

96 views
Skip to first unread message

Drew von Zweck

unread,
Dec 5, 2023, 7:45:38 PM12/5/23
to debezium
Hello, we have implemented a solution using the Debezium Notfification Service and Incremental Snapshotting to scan schema change topics, parse them, and generate incremental snapshot signals in near real time. We are hooked up to a MYSQL db with read only and GTID enabled, with allow.schema.changes flag to true. The MYSQL instance we are connected to has thousands of tables and hundreds of DB's.

We have noticed the following behavior...
1. We successfully process and generate a signal that is sent to the signal Topic for a set of Data Collections (has been up to 1000 tables before)
2. The connect worker logs that it has received the Incremental Snapshot Request.
3. A long period of time goes by... 10 minutes to 3 hours, no notifications are received and no educational logs are outputted (we set them to TRACE).
4. Suddenly after a long period of time we get the STARTED notification, then quickly after the COMPLETED notification.

What we expect...
We expect the STARTED notification to come much sooner after the signal is received, and for the process to start to be much shorter. Ideally maybe there would even be a notification before STARTED if there is actually some process occurring that our Snapshot Monitoring application should be aware of. We tried setting the chunk size to 500,000 and it did not reduce the time to STARTED. This is only the case in our Prod env which makes us think it is a scalability issue.

Is there any insight that can be provided about what is happening in this window of time and how we might reduce it? We rely heavily on the notification system to determine the state of Incremental Snapshots and having some extra notification before STARTED, might be very helpful.

jiri.p...@gmail.com

unread,
Dec 5, 2023, 11:44:11 PM12/5/23
to debezium
Hi,

would it be possible to share one of the TRACE level logs? Ideally TRACE for io.debezium only.

Is the delay affected by the amount of data collections? What if you send signal for only 5?

J.

Message has been deleted

Drew von Zweck

unread,
Dec 6, 2023, 1:36:19 PM12/6/23
to debezium
I would say that there is some noticeable difference between the lag to STARTED between larger and smaller data collections sizes, however we don't necessarily think the solution would be to limit the Data Collections size as that would just cause many more incremental snapshots to be needed.
Here are some examples of the large gap between our application logging a signal sent (which we have confirmed via signal topic as well as INCREMENTAL SNAPSHOT requested log from the connector) and when the snapshot STARTED notification is sent...
The first one has 53 tables and a lag of 12 minutes ( this is the fastest we have recorded)
The second one has 574 tables and a lag of 1hour 40 minutes.
Incremental Snapshot Signal Sent
Connector: connector1
Status: SIGNAL_SENT
Snapshot ID:
connector1-123423
Tables:
tablename: DATABASES_AFFECTED: 53, WAITING: 53
Total Tables To Be Snapshotted: 53
Signal Sent: 2023-12-05 23:15:56.219911
Offsets Parsed: 1072989 - 1073909
Incremental Snapshot Started
Connector: connector1
Status: STARTED
Snapshot ID: connector1-123423
Signal Sent: 2023-12-05 23:15:56.219911
Offsets Parsed: 1072989 - 1073909
Started At: 2023-12-05 23:23:11.769543


Incremental Snapshot Signal Sent
Connector: connector-202311291719,
Status: SIGNAL_SENT
Snapshot ID: 
connector-202311291719_12345
Tables:
t1: DATABASES_AFFECTED: 287, WAITING: 287
t2: DATABASES_AFFECTED: 287, WAITING: 287
Total Tables To Be Snapshotted: 574
Signal Sent: 2023-12-06 02:47:28.263852
Offsets Parsed: 1074129 - 1074703


Incremental Snapshot Started
Connector: connector-202311291719,
Status: STARTED
Snapshot ID: connector-202311291719_12345
Signal Sent: 2023-12-06 02:47:28.263852
Offsets Parsed: 1074129 - 1074703
Started At: 2023-12-06 04:25:05.306015
We do want to note that in some extreme cases (pretty consistently though like every 5th snapshot, the STARTED notification never comes at all, until the next day, we restart the connector, then suddenly we get the COMPLETED notification but never a STARTED notification.

As far as logs, this is actually a point of confusion for us, after we see the
Requested 'INCREMENTAL' snapshot of data collections '
log we don't really see much else that gives us insight into what is occurring until much later we start to see the logging about primary keys and schema validations that log after the STARTED notification is received.
Here are some of the logs that come after the INCREMENTAL SNAPSHOT request...  (they seem mostly related to streaming data, but ill grab the ones that look snapshot related.

[2023-12-06 18:03:14,154] INFO [p2501-debezium-202311291719|task-0] Requested 'INCREMENTAL' snapshot of data collectionsCREMENTAL' snapshot of data collections '[
[2023-12-06 18:04:09,189] INFO [p2501-debezium-202311291719|task-0|offsets] WorkerSourceTask{id=p2501-debezium-202311291719-0} Committing offsets for 23761 acknowledged messages18:03(org.apache.kafka.connect.runtime.WorkerSourceTask):14
[2023-12-06 18:04:44,862] INFO [p2501-debezium-202311291719|task-0] 75127 records sent during previous 00:04:17.725, last recorded offset of {server=p2501} partition is {transaction_id=null, ts_sec=1701885884, file=binlog.001254, pos=777544516, incremental_snapshot_signal_offset=17, gtids=0fc7fc6b-1f37-11ed-9b8c-0a97a9e751f5:1-25,7565bc6a-1db0-11ed-a4f7-02c9ba76372f:1-677085670,8cb1587b-1da1-11ed-b5d7-0eadadb7cc9b:1-95496254, row=2, server_id=176008077, event=33} (io.debezium.connector.common.BaseSourceTask),15
[2023-12-06 18:17:29,809] INFO [p2501-debezium-202311291719|task-0] [Producer clientId=connector-producer-p2501-debezium-202311291719-0] Resetting the last seen epoch of partition topicname-20 to 11 since the associated topicId changed from null to P_04a9RLR8Oe0Sq8p-dXIw (org.apache.kafka.clients.Metadata)
^ we get alot of logs like this ^
[2023-12-06 18:19:25,614] INFO [p2501-debezium-202311291719|task-0] Incremental snapshot's schema verification passed = true, schema = columns: {
all schemas are then logged and snapshot completes...
notice the time passed between those logs though, we don't know what happens in that time or how to optimize it.

Drew von Zweck

unread,
Dec 6, 2023, 1:46:38 PM12/6/23
to debezium
Another interesting point is that the debezium_remainingtablecount seems to stay stuck at the exact amount of EMPTY tables that were a part of the incremental snapshot.

For example we snapshot 50 tables, 25 are EMPTY, the debezium_remaining table count stays at 25.
We snapshot another set of tables, 20 are EMPTY, the debezium_remaining table count is now 45 and it remains there even after the snapshots are COMPLETED.

Mario Fiore Vitale

unread,
Dec 11, 2023, 3:36:06 AM12/11/23
to debe...@googlegroups.com
Are you able to see in the logs something like "Reading schema for table "?

--
You received this message because you are subscribed to a topic in the Google Groups "debezium" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/debezium/5hQUAYQrRI0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to debezium+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/debezium/fd73847d-c8f7-4ae2-ae03-7100ee5b96ean%40googlegroups.com.


--

Mario Fiore Vitale

Senior Software Engineer

Red Hat

Drew von Zweck

unread,
Dec 11, 2023, 5:46:59 PM12/11/23
to debe...@googlegroups.com
Here are two examples of a fast incremental snapshot and one that has some lag, to your point the "Reading Schema" logs occur right as the snapshot starts and finishes, not during the lag time before it starts....

The first is a snapshot taking place on a connector that is only replicating a few hundred tables. We send the signal for 2 tables and they are both empty. Right after the request we see these logs and the snapshot finishes...


[2023-12-11 21:03:33,051] INFO [p1501-debezium-202311202143|task-0] Requested 'INCREMENTAL' snapshot of data collections '[d1.t1,d1.t2]' with additional conditions '[]' and surrogate key 'PK of table will be used' (io.debezium.pipeline.signal.actions.snapshotting.ExecuteSnapshot)
[2023-12-11 21:03:33,222] DEBUG [p1501-debezium-202311202143|task-0] Reading schema for table 'd1.t1' using select statement: 'SELECT `____`, `____`, `____`, `____`, `____`, `____`, `____` FROM `t1`.`lease_auto_renewals` ORDER BY `id` LIMIT 0' (io.debezium.pipeline.source.snapshot.incremental.AbstractIncrementalSnapshotChangeEventSource)
[2023-12-11 21:03:33,223] INFO [p1501-debezium-202311202143|task-0] Incremental snapshot's schema verification passed = true, schema = columns: {
  ____ INT(10, 0) NOT NULL
  ____ INT(10, 0) DEFAULT VALUE NULL
  ____ INT(10, 0) NOT NULL
  ____ VARCHAR(50, 0) NOT NULL
  ____ BIGINT(19, 0) DEFAULT VALUE NULL
  ____ DATETIME(19, 0) NOT NULL
  ____ DATETIME(19, 0) NOT NULL
}
primary key: []
default charset: null
comment: null
attributes: {
}
 (io.debezium.pipeline.source.snapshot.incremental.AbstractIncrementalSnapshotContext)
[2023-12-11 21:03:33,224] INFO [p1501-debezium-202311202143|task-0] No maximum key returned by the query, incremental snapshotting of table 't1' finished as it is empty (io.debezium.pipeline.source.snapshot.incremental.AbstractIncrementalSnapshotChangeEventSource)

[2023-12-11 21:03:33,233] DEBUG [p1501-debezium-202311202143|task-0] Reading schema for table 'd1.t2' using select statement: 'SELECT `____`, `____`, `____`, `____`, `____`, `____`, `____` FROM `t2` ORDER BY `id` LIMIT 0' (io.debezium.pipeline.source.snapshot.incremental.AbstractIncrementalSnapshotChangeEventSource)
[2023-12-11 21:03:33,234] INFO [p1501-debezium-202311202143|task-0] Incremental snapshot's schema verification passed = true, schema = columns: {
  ____ INT(10, 0) NOT NULL
  ____ INT(10, 0) DEFAULT VALUE NULL
  ____ INT(10, 0) NOT NULL
  ____ VARCHAR(50, 0) NOT NULL
  ____ BIGINT(19, 0) DEFAULT VALUE NULL
  ____ DATETIME(19, 0) NOT NULL
  ____ DATETIME(19, 0) NOT NULL
}
primary key: []
default charset: null
comment: null
attributes: {
}
 (io.debezium.pipeline.source.snapshot.incremental.AbstractIncrementalSnapshotContext)
[2023-12-11 21:03:33,235] INFO [p1501-debezium-202311202143|task-0] No maximum key returned by the query, incremental snapshotting of table 't2' finished as it is empty (io.debezium.pipeline.source.snapshot.incremental.AbstractIncrementalSnapshotChangeEventSource)



The second example is a snapshot of 128 tables on a connector that is capturing around 500,000 tables. You can see here that there is a large time gap between the request received for a snapshot, and the "reading schema" log, which comes right as the time starts and finishes being snapshotted. In fact all 128 tables finish only a second after the first appearance of the "reading schema" log. This makes me think it is not the reading of the schema that is causing the lag, but maybe some process happening before it? Do you guys have any ideas what is going on between the Requested 'INCREMENTAL' snapshot log and the actual snapshot starting and finishing? It seems that the process is having some trouble scaling as it can take 10 minutes to several hours. (notice the time difference between the first and second line.... in between that is just a bunch of org.apache.kafka.clients.Metadata logs)

[2023-12-11 22:01:23,309] INFO [p1505d-debezium-202311302224|task-0] Requested 'INCREMENTAL' snapshot of data collections '[t1,t2,t3,t4,t5,t5...]'
[2023-12-11 22:14:57,157] DEBUG [p1505d-debezium-202311302224|task-0] Reading schema for table 't1' using select statement: 'SELECT `___`, `___`, `___`, `___`, `___`, `___`, `___` FROM t1` ORDER BY `id` LIMIT 0' (io.debezium.pipeline.source.snapshot.incremental.AbstractIncrementalSnapshotChangeEventSource)
[2023-12-11 22:14:57,159] INFO [p1505d-debezium-202311302224|task-0] Incremental snapshot's schema verification passed = true, schema = columns: {
  ___ BIGINT(19, 0) NOT NULL
  ___ INT(10, 0) NOT NULL
  ___ INT(10, 0) DEFAULT VALUE NULL
  ___ INT(10, 0) DEFAULT VALUE NULL
  ___ INT(10, 0) DEFAULT VALUE NULL
  ___ DATETIME(19, 0) NOT NULL
  ___ DATETIME(19, 0) NOT NULL
}
primary key: []
default charset: null
comment: null
attributes: {
}
 (io.debezium.pipeline.source.snapshot.incremental.AbstractIncrementalSnapshotContext)
[2023-12-11 22:14:57,162] INFO [p1505d-debezium-202311302224|task-0] No maximum key returned by the query, incremental snapshotting of table 't1' finished as it is empty (io.debezium.pipeline.source.snapshot.incremental.AbstractIncrementalSnapshotChangeEventSource)

Mario Fiore Vitale

unread,
Dec 12, 2023, 2:32:14 AM12/12/23
to debe...@googlegroups.com
Hi Drew, 

In that case I think it's better to open an issue, can you please fire one and add all relevant details? 

Thanks, 
Mario.

Drew von Zweck

unread,
Dec 12, 2023, 2:27:44 PM12/12/23
to debe...@googlegroups.com
Reply all
Reply to author
Forward
0 new messages