We set this to 60000 (1 min) temporarily and the lag cleared up after a while. However, we've since reverted it back to a higher value because we have long-running legitimate transactions (e.g. migrations) in our environment that could be abandoned prematurely. So this was more of a workaround than a fix.
I'm actually not 100% sure why the retention-based abandon didn't kick in automatically during the stuck state — the connector was running continuously and the ghost transaction appeared to have already disappeared from v$transaction, yet the offset SCN stayed frozen. Any insight into why the abandon logic might not trigger in this case would be appreciated.
2. On the strange transaction pattern — logs attached
Looking at our logs (15,770 lines total), the abandoned transactions all share a suspicious characteristic: 0 events.
Every single abandoned transaction has 0 events — meaning Debezium sees them as open but has never captured any DML inside them.
The key problem: the freeze starts at line 14756:
When we immediately query v$transaction after this WARN, we get 0 rows. The transaction is already gone from Oracle's perspective, but Debezium's in-memory cache still holds it as open and pins the offset SCN. The connector then stays frozen from line 14756 (22:07) all the way until line 15159 (01:22) — over 3 hours — with only this repeating:
Questions:
3. On metrics tracking
We're planning to set up SCN lag monitoring. Key metrics we're thinking of tracking:
Are there specific JMX MBean paths or v$ views you'd recommend? We're on AWS RDS Oracle so access to some internal views may be limited.
4. On frequency and operational impact
This is happening more than twice a week and causing real business impact. We're now building a heartbeat-based lag monitor (using DEBEZIUM_HEARTBEAT table + __source_ts_ms) to detect when the offset freezes and trigger an automatic connector restart via the Kafka Connect REST API.
However, we'd prefer to prevent the issue rather than just react to it. Given that all abandoned transactions consistently show 0 events, we suspect these are Oracle-internal transactions or system-level operations that Debezium picks up from the redo log but never resolves.
Is there a way to filter out or exclude these 0-event transactions and Offset SCN ### has not changed in 25 mining session iterations~ before they pin the offset SCN? And more broadly, could this issue be related to the MSK or Kafka Connect worker server specifications? Any recommendations on a prevention strategy would be greatly appreciated. We're happy to provide additional logs or configuration details if that would help.
Best regards,
Hi again,
Thank you so much for the detailed explanation. Your insights were incredibly helpful in clearing up some of our misconceptions. I have a few follow-up questions regarding your points to ensure we fix the root cause properly.
1. On log.mining.transaction.retention.ms and the 0-event transactions Thanks to your explanation, I realized I had misunderstood how the retention.ms option works. I originally thought that if a transaction started at 4:01 PM and was left improperly open, setting retention to 1 hour would force Debezium to abandon it 1 hour later (at 5:01 PM). However, you clarified that it is based on the actual duration the transaction is executing inside Oracle.
This makes my situation even more confusing. The problematic transactions we experienced were not heavy, long-running queries. They were abnormal "ghost" transactions with 0 events (no DMLs) that were improperly left open. If the retention option relies strictly on execution duration, why weren't these 0-event transactions cleaned up by the retention threshold, ending up pinning the offset SCN for over 7 hours?
2. Fundamental solution for the "SCN Stuck" issue and Fetch Sizing The reason we previously resorted to restarting the connector or manipulating the retention.ms value was that whenever we forced an abandon after hours of "SCN stuck" WARN logs, the pipeline would immediately unblock and flush all the pending messages. Because of this, we assumed "skipping the stuck transaction" was the proper fix.
If blindly restarting the connector isn't the answer, what is the fundamental root cause and solution for the offset SCN freezing due to these lingering transactions? You mentioned adjusting log.mining.batch.size.* and fetch sizes. Looking at our current configuration, we have both database.statement.fetchSize and log.mining.batch.size.max set to 2000. Is this value too conservative for a high-volume environment, causing the connector to digest pending transactions too slowly? What values would you recommend we scale these up to?
3. Best practices for Redo Log Sizing I completely agree with your assessment regarding the redo log capacity being insufficient. (Fortunately, we have over 250GB of free storage, so scaling up the logs is not an issue at all.)
Currently, we are running with 6 groups of 1GB, which we had increased from the default setup. Due to the nature of our business, our DB workload is relatively quiet with steady I/U/D traffic most of the time. However, during specific periods like financial month-end - month-start season closings, we experience massive spikes of batch operations and bulk inserts. It seems these bursty spikes overwhelmed our 1GB log groups and caused the Checkpoint Starvation.
In a CDC environment with a "bursty" workload like ours, what is the general best practice for sizing the redo log files and determining the optimal number of log groups?
Per your request, here is our current connector configuration (with sensitive info masked). We also already have database.query.timeout.ms set to 0 as you suggested.