Limiting the number or the duration of attempts to recover from a retriable error

513 views

Skip to first unread message

Sergei Morozov

unread,

Dec 14, 2021, 5:58:49 PM12/14/21

to debezium

Hi,

My experience of operating Debezium connectors for SQL Server shows that while the connectors handle different retriable exceptions as such, there doesn't seem to be a way to limit the number or the duration of attempts to restart.

This may lead to the situation where while a failing connector infinitely keeps trying to restart, the CDC data expires on the server.

How to reproduce:

Start the pipeline from the SQL Server connector tutorial.
Stop SQL Server:
$ docker compose -f docker-compose-sqlserver.yaml stop sqlserver

At this point, the connector will infinitely retry to start:

An exception occurred in the change event producer. This connector will be restarted.Awaiting end of restart backoff period after a retriable error
Awaiting end of restart backoff period after a retriable error
Starting SqlServerConnectorTask with configuration
GO TO 1
During this loop, from the Kafka Connect perspective, the connector is still running:

$ curl -s http://localhost:8083/connectors/inventory-connector/tasks/0/status | jq
{
  "id": 0,
  "state": "RUNNING",
  "worker_id": "172.26.0.5:8083"
}

Here's an excerpt from the conversation in #2572:

> > Given that there's no way to limit the number of retries or the duration of the fail-and-retry loop […]

> I think with the change described above, Kafka Connect's `errors.retry.timeout` setting should apply (not a count of retries, but a maximum time)?

It doesn't seem to be the case. Although KIP-298 mentions "poll() in SourceTask", there doesn't seem to be any logic implemented to handle this case. If WorkerSourceTask#poll catches a retriable exception, it just returns null.

As originally proposed in #2572, should Debezium support some configuration for limiting the retries or Kafka Connect should do that? In the latter case, how should a connector be configured?

Thank you.

Gunnar Morling

unread,

Dec 17, 2021, 11:00:47 AM12/17/21

to debezium

Hey Sergei,

Thanks for bringing this up. Agreed that it's not desirable for a connector to nominally stay in RUNNING state while actually being stuck in an eternal restart loop internally. Limiting the period / duration of internal restarts should be made configurable in Debezium, with some sensible default of 180 sec or similar. Could you log an issue for that?

Thanks,

--Gunnar

Reply all

Reply to author

Forward

0 new messages