Replication with unreachable targets are automatically removed

Matthieu RAKOTOJAONA RAINIMANGAVELO

unread,

Oct 26, 2023, 8:06:09 AM10/26/23

to us...@couchdb.apache.org

Hello there,

I realized that when a replication, continuous or transient, is ran, but the target host is unreachable the replication job is deleted. Here are a few examples of logs:

[error] 2023-10-18T12:43:37.394891Z cou...@127.0.0.1 <0.582.0> -------- couch_replicator_scheduler : Transient job {"0f63c93e6e24efacede944ce1ed14795","+continuous"} failed, removing. Error: <<"{checkpoint_commit_failure,<<\"instance_start_time on source and target database has changed since last checkpoint.\">>}">>

[error] 2023-10-18T12:43:38.679316Z cou...@127.0.0.1 <0.582.0> -------- couch_replicator_scheduler : Transient job {"ecc1efdf8f86c4ef626f0ba36766ec56","+continuous"} failed, removing. Error: <<"{checkpoint_commit_failure,<<\"instance_start_time on source and target database has changed since last checkpoint.\">>}">>

[error] 2023-10-18T12:43:42.230056Z cou...@127.0.0.1 <0.582.0> -------- couch_replicator_scheduler : Transient job {"a3ede72be05b5aaf0da538843928491a","+continuous"} failed, removing. Error: <<"{checkpoint_commit_failure,<<\"instance_start_time on source and target database has changed since last checkpoint.\">>}">>

[error] 2023-10-18T12:44:37.178885Z cou...@127.0.0.1 <0.582.0> -------- couch_replicator_scheduler : Transient job {"76e126167983ab9e8003853ad5cbcfaa",[]} failed, removing. Error: <<"{http_request_failed,\"GET\",\n \"http://some.anonymized.host:5984/db/\",\n {error,{error,{conn_failed,{error,econnrefused}}}}}">>

[error] 2023-10-18T12:44:43.901342Z cou...@127.0.0.1 <0.582.0> -------- couch_replicator_scheduler : Transient job {"0f63c93e6e24efacede944ce1ed14795","+continuous"} failed, removing. Error: <<"{checkpoint_commit_failure,<<\"Failure on target commit: {'EXIT',\\n {http_request_failed,\\\"POST\\\",\\n \\\"http://some.anonymized.host:5984/db/_ensure_full_commit\\\",\\n {error,{error,{conn_failed,{error,econnrefused}}}}}}\">>}">>

The problem is that in my usecase it is expected for these hosts to be unreachable. I want couchdb to consider this as a transient error and continue, and a human will tell Couchdb when a replication job should be actually removed. Today I need some application code to recreate those replication jobs but I'd like not to.

Is there a way to have those replication persist ?

--
Matthieu Rakotojaona
Research Engineer, Inria <https://www.inria.fr/>
STACK team <https://stack-research-group.gitlabpages.inria.fr/web/>

Nick Vatamaniuc

unread,

Oct 26, 2023, 1:40:16 PM10/26/23

to us...@couchdb.apache.org

That's currently the expected behavior for _replicate (transient)
replication jobs. There is retries_per_request parameter
https://docs.couchdb.org/en/stable/config/replicator.html#replicator/retries_per_request
to help configure retries for individual http requests the replication
job makes, but if it the whole job fails it will be removed. The jobs
which are transient are expected to be managed/monitored by some
external application code. However if you do want the jobs to keep
trying after failure, consider using regular replication jobs backed
by a document in a `_replicator` database.

Cheers,
-Nick

Matthieu RAKOTOJAONA RAINIMANGAVELO

unread,

Oct 27, 2023, 11:25:26 AM10/27/23

to us...@couchdb.apache.org

Amazing, I knew there was a replicator database but I didn't see it in the API reference. It's working as expected, thanks !

--
Matthieu Rakotojaona
Research Engineer, Inria <https://www.inria.fr/>
STACK team <https://stack-research-group.gitlabpages.inria.fr/web/>

-----Message original-----

Reply all

Reply to author

Forward