Hi Chris,
> Is this with AWS MSK or is Kafka Connect running locally or
on-prem?
We are running Kafka Connect clusters in our on-prem K8S environment.
> Can you also share what connectors you are running on
the cluster that has this stuck behavior?
It's Debezium MongoDB Connector. But we've also experience similar behavior in Vitess connector in the past.
> Please be sure to
include the version of Kafka Connect & the Debezium
connectors.
Kafka Connect: 6.2.1-ccs
Debezium MongoDB Connector: 1.9.4
> As for the warning message, that is normal if a
connector hasn't successfully started or a shutdown has been
requested while an offset commit call was also requested.
I can confirm there was no restart or shutdown request during that time, but we only see these log lines for the worker which have its API stuck. It seems like it's worse than just API being stuck, it's unable to commit offset as well.
[2022-08-09 17:54:02,394] WARN Couldn't commit processed log positions with the source database due to a concurrent connector shutdown or restart (io.debezium.connector.common.BaseSourceTask)
[2022-08-09 17:54:02,394] INFO WorkerSourceTask{id=xxxxxx} flushing 0 outstanding messages for offset commit (org.apache.kafka.connect.runtime.WorkerSourceTask)
[2022-08-09 17:53:02,394] WARN Couldn't commit processed log positions with the source database due to a concurrent connector shutdown or restart (io.debezium.connector.common.BaseSourceTask)
[2022-08-09 17:53:02,394] INFO WorkerSourceTask{id=yyyyyy} flushing 0 outstanding messages for offset commit (org.apache.kafka.connect.runtime.WorkerSourceTask)
Here are some detailed symptoms on some problematic workers.
- Sometimes, curl to localhost:28082/connectors stuck entirely.
- And some other times, curl to localhost:28082/connectors is working fine but localhost:28082/connectors/CONNECTOR_NAME stuck, here CONNECTOR_NAME can be anything, even a connector name that doesn't exist.
The only fix we found so far is to restart the worker. That means we need to implement a K8S probe to health check the API and restart the pod if it's stuck.