Limiting the number or the duration of attempts to recover from a retriable error

39 views
Skip to first unread message

Sergei Morozov

unread,
Dec 14, 2021, 5:58:49 PM12/14/21
to debezium
Hi,

My experience of operating Debezium connectors for SQL Server shows that while the connectors handle different retriable exceptions as such, there doesn't seem to be a way to limit the number or the duration of attempts to restart.

This may lead to the situation where while a failing connector infinitely keeps trying to restart, the CDC data expires on the server.

How to reproduce:
  1. Start the pipeline from the SQL Server connector tutorial.
  2. Stop SQL Server:
    $ docker compose -f docker-compose-sqlserver.yaml stop sqlserver
At this point, the connector will infinitely retry to start:
  1. An exception occurred in the change event producer. This connector will be restarted.Awaiting end of restart backoff period after a retriable error
  2. Awaiting end of restart backoff period after a retriable error
  3. Starting SqlServerConnectorTask with configuration
  4. GO TO 1
During this loop, from the Kafka Connect perspective, the connector is still running:

{
"id": 0,
"state": "RUNNING",
"worker_id": "172.26.0.5:8083"
}

Here's an excerpt from the conversation in #2572:

> > Given that there's no way to limit the number of retries or the duration of the fail-and-retry loop […]

> I think with the change described above, Kafka Connect's `errors.retry.timeout` setting should apply (not a count of retries, but a maximum time)?

It doesn't seem to be the case. Although KIP-298 mentions "poll() in SourceTask", there doesn't seem to be any logic implemented to handle this case. If WorkerSourceTask#poll catches a retriable exception, it just returns null.

As originally proposed in #2572, should Debezium support some configuration for limiting the retries or Kafka Connect should do that? In the latter case, how should a connector be configured?

Thank you.

Gunnar Morling

unread,
Dec 17, 2021, 11:00:47 AM12/17/21
to debezium
Hey Sergei,

Thanks for bringing this up. Agreed that it's not desirable for a connector to nominally stay in RUNNING state while actually being stuck in an eternal restart loop internally. Limiting the period / duration of internal restarts should be made configurable in Debezium, with some sensible default of  180 sec or similar. Could you log an issue for that?

Thanks,

--Gunnar
Reply all
Reply to author
Forward
0 new messages