I would say that there is some noticeable difference between the
lag to STARTED between larger and smaller data collections sizes,
however we don't necessarily think the solution would be to limit the
Data Collections size as that would just cause many more incremental
snapshots to be needed.
Here are some examples of the large
gap between our application logging a signal sent (which we have
confirmed via signal topic as well as INCREMENTAL SNAPSHOT requested log
from the connector) and when the snapshot STARTED notification is
sent...
The first one has 53 tables and a lag of 12 minutes ( this is the fastest we have recorded)
The second one has 574 tables and a lag of 1hour 40 minutes.
Incremental Snapshot Signal Sent
Connector: connector1
Status: SIGNAL_SENT
Snapshot ID: connector1-123423
Tables:
tablename: DATABASES_AFFECTED: 53, WAITING: 53
Total Tables To Be Snapshotted: 53
Signal Sent: 2023-12-05 23:15:56.219911
Offsets Parsed: 1072989 - 1073909
Incremental Snapshot Started
Connector: connector1
Status: STARTED
Snapshot ID: connector1-123423
Signal Sent: 2023-12-05 23:15:56.219911
Offsets Parsed: 1072989 - 1073909
Started At: 2023-12-05 23:23:11.769543
Incremental Snapshot Signal Sent
Connector: connector-202311291719,
Status: SIGNAL_SENT
Snapshot ID: connector-202311291719_12345
Tables:
t1: DATABASES_AFFECTED: 287, WAITING: 287
t2: DATABASES_AFFECTED: 287, WAITING: 287
Total Tables To Be Snapshotted: 574
Signal Sent: 2023-12-06 02:47:28.263852
Offsets Parsed: 1074129 - 1074703
Incremental Snapshot Started
Connector: connector-202311291719,
Status: STARTED
Snapshot ID: connector-202311291719_12345
Signal Sent: 2023-12-06 02:47:28.263852
Offsets Parsed: 1074129 - 1074703
Started At: 2023-12-06 04:25:05.306015
We
do want to note that in some extreme cases (pretty consistently though
like every 5th snapshot, the STARTED notification never comes at all,
until the next day, we restart the connector, then suddenly we get the
COMPLETED notification but never a STARTED notification.
As far as logs, this is actually a point of confusion for us, after we see the
Requested 'INCREMENTAL' snapshot of data collections '
log
we don't really see much else that gives us insight into what is
occurring until much later we start to see the logging about primary
keys and schema validations that log after the STARTED notification is
received.
Here are some of the logs that come after the
INCREMENTAL SNAPSHOT request... (they seem mostly related to streaming
data, but ill grab the ones that look snapshot related.
[2023-12-06 18:03:14,154] INFO [p2501-debezium-202311291719|task-0] Requested 'INCREMENTAL' snapshot of data collectionsCREMENTAL' snapshot of data collections '[
[2023-12-06 18:04:09,189] INFO [p2501-debezium-202311291719|task-0|offsets] WorkerSourceTask{id=p2501-debezium-202311291719-0} Committing offsets for 23761 acknowledged messages18:03(org.apache.kafka.connect.runtime.WorkerSourceTask):14
[2023-12-06 18:04:44,862] INFO [p2501-debezium-202311291719|task-0] 75127 records sent during previous 00:04:17.725, last recorded offset of {server=p2501} partition is {transaction_id=null, ts_sec=1701885884, file=binlog.001254, pos=777544516, incremental_snapshot_signal_offset=17, gtids=0fc7fc6b-1f37-11ed-9b8c-0a97a9e751f5:1-25,7565bc6a-1db0-11ed-a4f7-02c9ba76372f:1-677085670,8cb1587b-1da1-11ed-b5d7-0eadadb7cc9b:1-95496254, row=2, server_id=176008077, event=33} (io.debezium.connector.common.BaseSourceTask),15
[2023-12-06 18:17:29,809] INFO [p2501-debezium-202311291719|task-0] [Producer clientId=connector-producer-p2501-debezium-202311291719-0] Resetting the last seen epoch of partition topicname-20 to 11 since the associated topicId changed from null to P_04a9RLR8Oe0Sq8p-dXIw (org.apache.kafka.clients.Metadata)
^ we get alot of logs like this ^
[2023-12-06 18:19:25,614] INFO [p2501-debezium-202311291719|task-0] Incremental snapshot's schema verification passed = true, schema = columns: {
all schemas are then logged and snapshot completes...
notice the time passed between those logs though, we don't know what happens in that time or how to optimize it.