Limiting the number or the duration of attempts to recover from a retriable error

Skip to first unread message

Sergei Morozov

Dec 14, 2021, 5:58:49 PM12/14/21
to debezium

My experience of operating Debezium connectors for SQL Server shows that while the connectors handle different retriable exceptions as such, there doesn't seem to be a way to limit the number or the duration of attempts to restart.

This may lead to the situation where while a failing connector infinitely keeps trying to restart, the CDC data expires on the server.

How to reproduce:
  1. Start the pipeline from the SQL Server connector tutorial.
  2. Stop SQL Server:
    $ docker compose -f docker-compose-sqlserver.yaml stop sqlserver
At this point, the connector will infinitely retry to start:
  1. An exception occurred in the change event producer. This connector will be restarted.Awaiting end of restart backoff period after a retriable error
  2. Awaiting end of restart backoff period after a retriable error
  3. Starting SqlServerConnectorTask with configuration
  4. GO TO 1
During this loop, from the Kafka Connect perspective, the connector is still running:

"id": 0,
"state": "RUNNING",
"worker_id": ""

Here's an excerpt from the conversation in #2572:

> > Given that there's no way to limit the number of retries or the duration of the fail-and-retry loop […]

> I think with the change described above, Kafka Connect's `errors.retry.timeout` setting should apply (not a count of retries, but a maximum time)?

It doesn't seem to be the case. Although KIP-298 mentions "poll() in SourceTask", there doesn't seem to be any logic implemented to handle this case. If WorkerSourceTask#poll catches a retriable exception, it just returns null.

As originally proposed in #2572, should Debezium support some configuration for limiting the retries or Kafka Connect should do that? In the latter case, how should a connector be configured?

Thank you.

Gunnar Morling

Dec 17, 2021, 11:00:47 AM12/17/21
to debezium
Hey Sergei,

Thanks for bringing this up. Agreed that it's not desirable for a connector to nominally stay in RUNNING state while actually being stuck in an eternal restart loop internally. Limiting the period / duration of internal restarts should be made configurable in Debezium, with some sensible default of  180 sec or similar. Could you log an issue for that?


Reply all
Reply to author
0 new messages