Kafka Connect API stuck

2,376 views
Skip to first unread message

Shichao An

unread,
Aug 8, 2022, 3:19:09 PM8/8/22
to debezium
Hi,

We are recently getting Kafka Connect API stuck issues more frequently. We are usually are able to identify one of workers that is stuck, and do `curl -s localhost:28082/connectors`. Sometimes, if the Connect cluster is small (e.g. 2 workers), then all workers can get stuck which makes API unusable. We have to restart all workers of the cluster.

The log doesn't have anything suspicious other than this line:

WARN Couldn't commit processed log positions with the source database due to a concurrent connector shutdown or restart (io.debezium.connector.common.BaseSourceTask)

We have a service that monitors and manages Kafka Connect cluster. This issue caused the service unable to manage connectors due to the stuck API.

Have you seen this issue as well? I wonder if this is some fundamental issue or bug with Kafka Connect, or maybe Debezium.

Chris Cranford

unread,
Aug 9, 2022, 7:45:14 AM8/9/22
to debe...@googlegroups.com, Shichao An
Hi Shichao -

Is this with AWS MSK or is Kafka Connect running locally or on-prem?  Can you also share what connectors you are running on the cluster that has this stuck behavior?  Please be sure to include the version of Kafka Connect & the Debezium connectors.  As for the warning message, that is normal if a connector hasn't successfully started or a shutdown has been requested while an offset commit call was also requested.  In this case, it just alerts you that the connector isn't in a a proper state for offset commits to occur.

Chris
--
You received this message because you are subscribed to the Google Groups "debezium" group.
To unsubscribe from this group and stop receiving emails from it, send an email to debezium+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/debezium/da0674c6-1352-4d88-93be-5fe46fa87141n%40googlegroups.com.

Shichao An

unread,
Aug 9, 2022, 2:15:25 PM8/9/22
to debezium

Hi Chris,

> Is this with AWS MSK or is Kafka Connect running locally or on-prem?
We are running Kafka Connect clusters in our on-prem K8S environment.

> Can you also share what connectors you are running on the cluster that has this stuck behavior?
It's Debezium MongoDB Connector. But we've also experience similar behavior in Vitess connector in the past.

> Please be sure to include the version of Kafka Connect & the Debezium connectors.
Kafka Connect: 6.2.1-ccs
Debezium MongoDB Connector: 1.9.4

> As for the warning message, that is normal if a connector hasn't successfully started or a shutdown has been requested while an offset commit call was also requested.
I can confirm there was no restart or shutdown request during that time, but we only see these log lines for the worker which have its API stuck. It seems like it's worse than just API being stuck, it's unable to commit offset as well.

[2022-08-09 17:54:02,394] WARN Couldn't commit processed log positions with the source database due to a concurrent connector shutdown or restart (io.debezium.connector.common.BaseSourceTask)
[2022-08-09 17:54:02,394] INFO WorkerSourceTask{id=xxxxxx} flushing 0 outstanding messages for offset commit (org.apache.kafka.connect.runtime.WorkerSourceTask)
[2022-08-09 17:53:02,394] WARN Couldn't commit processed log positions with the source database due to a concurrent connector shutdown or restart (io.debezium.connector.common.BaseSourceTask)
[2022-08-09 17:53:02,394] INFO WorkerSourceTask{id=yyyyyy} flushing 0 outstanding messages for offset commit (org.apache.kafka.connect.runtime.WorkerSourceTask)


Here are some detailed symptoms on some problematic workers.
  • Sometimes, curl to localhost:28082/connectors stuck entirely.
  • And some other times, curl to localhost:28082/connectors is working fine but localhost:28082/connectors/CONNECTOR_NAME stuck, here CONNECTOR_NAME can be anything, even a connector name that doesn't exist.

The only fix we found so far is to restart the worker. That means we need to implement a K8S probe to health check the API and restart the pod if it's stuck.

Shichao An

unread,
Aug 15, 2022, 10:08:37 PM8/15/22
to debezium
Hi Chris,

I checked the code where this log is originally emitted: "Couldn't commit processed log positions with the source database due to a concurrent connector shutdown or restart"

Since we got this line of log continuously when the issue happened, is it fair to say the connector task is in a deadlock somehow?
It seems the poll() is still working since we can see the "flushing 0 outstanding messages" log as well.

Chris Cranford

unread,
Aug 16, 2022, 8:14:53 AM8/16/22
to debe...@googlegroups.com, Shichao An
Hi Sichao,

We had a similar discussion on Zulip last week and after reviewing the provided thread dump, it appears that somehow this Kafka bug [1] may be the cause.  I would suggest that you upgrade to a version of the Kafka platform that contains Kafka 3.0.0 or later. 

Hope that helps.
Chris

[1]: https://issues.apache.org/jira/browse/KAFKA-7421

Shichao An

unread,
Aug 17, 2022, 7:03:33 PM8/17/22
to debezium
Hi Chris,

Thank you for the info on the bug. It's really helpful!
We will upgrade the Kafka Connect version.

Pranav Kulkarni

unread,
Sep 11, 2023, 7:31:30 AM9/11/23
to debezium
Hi Chris,

I am having the same issue. Been using debezium for 2 years now. All of a sudden this has started today. Using Kafka 3.0.0 (MSK).

Chris Cranford

unread,
Sep 11, 2023, 12:54:28 PM9/11/23
to debe...@googlegroups.com
Hi -

Are you able to obtain a thread dump of the Kafka Connect cluster running on MSK?  If not, you may need to reach out to AWS support for them to provide you one.  It may require AWS to reach out to the Kafka community if there are still some potential deadlock scenarios. If you can get the thread dump, you're welcomed to share it with us too so we can rule out there isn't any Debezium influence.

Thanks,
Chris
Reply all
Reply to author
Forward
0 new messages