Hi all,
We're experiencing an issue with our AWS MSK Connector (Debezium) connected to a MySQL RDS instance, with multiple large databases (~20GB/week data volume). After running successfully for a few hours or days, the connector stops flushing/committing messages to Kafka (0 messages sent), despite no errors or warnings appearing in CloudWatch Logs, the connector remaining alive and connected to MySQL, the source tables being actively updated and MySQL binlog continuing to grow.
[Worker-0989ee99c536b29b0] [2026-02-02 20:27:51,011] INFO [mysql|task-0|offsets] WorkerSourceTask{id=mysql-0} flushing 0 outstanding messages for offset commit (org.apache.kafka.connect.runtime.WorkerSourceTask:507)
[Worker-0989ee99c536b29b0] [2026-02-02 20:27:56,012] INFO [mysql|task-0|offsets] WorkerSourceTask{id=mysql-extractor-0} Committing offsets (org.apache.kafka.connect.runtime.WorkerSourceTask:490)
Our current configuration is:
database_server_id = "5401"
max_batch_size = 128
max_queue_size = 2048
mcu_count = 1
worker_count = 1
connect_timeout_ms = 30000
offset_flush_timeout_ms = 1200000 # 20min
offset_flush_interval_ms = 5000 # 5s
errors_log_enable = true
errors_tolerance = "all"
errors_log_include_messages = true
producer_buffer_memory_bytes = 16777216 # 16MB
snapshot_mode = "schema_only"
snapshot.locking.mode = "none"
tasks.max = "1"
key.converter = "org.apache.kafka.connect.json.JsonConverter"
value.converter = "org.apache.kafka.connect.json.JsonConverter"
transforms = "unwrap,addVersion,underscoreRemover"
transforms.unwrap.type = "io.debezium.transforms.ExtractNewRecordState"
transforms.unwrap.delete.handling.mode = "rewrite"
Destroying and recreating the connector via Terraform resolves the issue temporarily, but it recurs after some time. Besides, when the connector reaches this stale state the Kafka offset topic stops receiving new messages, which implies the offsets stop advancing even though the binlog keeps growing.
We have also investigated the binlog positions where the connector stopped processing, considering two cases in which it reached this ghost state. In both scenarios, the last offset corresponded to an Xid (transaction BEGIN) and the transactions involved multiple tables with INSERT operations. In case 1, one table was in the connector exclude list. In case 2, all 3 tables were in the include list, with tables 2 and 3 being, actually, the same table.
We have, finally, attempted to enable DEBUG logs but discovered that, apparently, AWS MSK only supports INFO level logging.
We're using Debizium MySQK Connector Plug-In (2.7.1) and Kafka Connect (2.7.1).
I'd really appreciate any insights into potential root causes or recommended troubleshooting steps.
Thank youu in advance,
--
You received this message because you are subscribed to the Google Groups "debezium" group.
To unsubscribe from this group and stop receiving emails from it, send an email to debezium+u...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/debezium/4c1c20d8-8bd7-4a12-92f1-9f8efc5195e8n%40googlegroups.com.
Hi Chris,
Thanks for the suggestion. We enabled the Debezium heartbeat as suggested, with the following configurations:
heartbeat.interval.ms = 300000 (5 minutes)
topic.heartbeat.prefix = "heartbeat"
No query heartbeat was configured. After this change, the connector ran normally for around 2 days, emitting both CDC events and heartbeat messages. On Friday late afternoon, it entered the same stale state, in which CDC events stopped, heartbeat messages stopped at the same time and no new events appear in CloudWatch or Kafka.
Meanwhile the connector and task remain "RUNNING", logs keep showing "Committing offsets / flushing 0 outstanding messages", MySQL RDS still shows an active "Binlog Dump – Sending to client", tables continue to be updated and binlog keeps growing. We are now checking RDS logs, but I wanted to ask first if does the fact that the heartbeat stops together with CDC already point to a known class of issues or clarify what the issue might be?
Thanks again for the help.
Este conteúdo é externo, cuidado. --
You received this message because you are subscribed to the Google Groups "debezium" group.
To unsubscribe from this group and stop receiving emails from it, send an email to debezium+u...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/debezium/c4d9e0e1-67d8-4f30-a489-51d0d36cee4cn%40googlegroups.com.
Hi Chris,
We investigated the issue from our side. Looking at the Aurora RDS logs around the time the connector stopped flushing, we only saw aborted connections and the binlog backlog growing. There were no errors or events indicating a crash, timeout, or lock that could explain the connector halt, and the database continued receiving updates normally. The heartbeat and offsets from Debezium stopped way before any of these Aurora events, so I believe the connector might have, indeed, entered some kind of blocking state internally as you suggested.
Regarding your suggestion for a thread dump, since we’re using AWS MSK Connect (managed infrastructure) we don’t have access to the JVM process, so I'm afraid it’s not possible for us to provide a thread dump. Are there any alternative ways to diagnose these kinds of blocking states in MSK Connect, given the MSK level logging limitation?
Thanks,
To view this discussion visit https://groups.google.com/d/msgid/debezium/aef10636-98bf-487d-9c74-37b453773e35n%40googlegroups.com.