Hi,
My experience of operating Debezium connectors for SQL Server shows that while the connectors handle different retriable exceptions as such, there doesn't seem to be a way to limit the number or the duration of attempts to restart.
This may lead to the situation where while a failing connector infinitely keeps trying to restart, the CDC data expires on the server.
How to reproduce:
- Start the pipeline from the SQL Server connector tutorial.
- Stop SQL Server:
$ docker compose -f docker-compose-sqlserver.yaml stop sqlserver
At this point, the connector will infinitely retry to start:
An exception occurred in the change event producer. This connector will be restarted.Awaiting end of restart backoff period after a retriable error
Awaiting end of restart backoff period after a retriable error
Starting SqlServerConnectorTask with configuration
GO TO 1
During this loop, from the Kafka Connect perspective, the connector is still running:
{
"id": 0,
"state": "RUNNING",
}
Here's an excerpt from the conversation in
#2572:
> > Given that there's no way to limit the number of retries or the duration of the fail-and-retry loop […]
> I think with the change described above, Kafka Connect's `errors.retry.timeout` setting should apply (not a count of retries, but a maximum time)?
It doesn't seem to be the case. Although
KIP-298 mentions "poll() in SourceTask", there doesn't seem to be any logic implemented to handle this case. If
WorkerSourceTask#poll catches a retriable exception, it just returns null.
As originally proposed in
#2572, should Debezium support some configuration for limiting the retries or Kafka Connect should do that? In the latter case, how should a connector be configured?
Thank you.