Replicator lastError property

20 views
Skip to first unread message

Scott Ahten

unread,
Apr 11, 2016, 6:46:21 PM4/11/16
to Couchbase Mobile
I'm running into an issue with handling errors with replication with the latest sync gateway and 1.1 or greater CBL. 

During first login and sync, I'm using replicator change notifications to monitor the Replicator's lastError property, among others, to determine if an initial sync is complete. However, when an error occurs at sometime during sync, it's unclear when sync resumes error free, as opposed to continually retrying with the previous error.  When the app goes offline, then online, this property seems to be reset. And with 401s, we restart replication. But with transient errors, like a few new 502s we're seeing from the new Sync Gateway, replication keeps going and retains the error. This causes our sync logic to assume the initial sync failed. 

While seems to reflect the properties name of "lastError", it makes it difficult to determine if sync went idle without an error. If I kill the app, which restarts replicator, this clears the property. Is there any other way to reset this value once the error has been read and handled other than a restart? 

Thanks, 

- Scott

Scott Ahten

unread,
Apr 12, 2016, 10:50:12 AM4/12/16
to Couchbase Mobile
To clarify, with 401s we display a login prompt, which eventually reconfigures replication, which restarts it. And with most other errors, such as offline transitions, CBL seems to have its own retry mechanism. I was hoping to avoid writing my own to handle this case. However, after further investigation, it appears that the error is a 502 generated by the client, not the server. So, it's not clear there is a prescribed way to 'Retry" this sort of error. Looking at the source for the iOS library...

kCBLStatusUpstreamError,        502, "Invalid response from remote replication server"},

Jens Alfke

unread,
Apr 12, 2016, 12:39:27 PM4/12/16
to mobile-c...@googlegroups.com
On Apr 11, 2016, at 3:46 PM, Scott Ahten <lightand...@gmail.com> wrote:

when an error occurs at sometime during sync, it's unclear when sync resumes error free, as opposed to continually retrying with the previous error.

Yeah, this is a problem with the API, and I’ve been unsure how to fix it. It’s hard to convey error status when there are lots of requests running in parallel, and not all errors are fatal. I’m open to new ideas!

One possibility is to add an ‘errors’ property that’s an array. So every time there’s an error, it appends it to the end of the array. By remembering the count of the array, you can find out whether any new errors have occurred since a given time.
Or alternatively, add a notification that gets invoked every time there’s a new error.

This ties in with the related issue of per-document status reporting: if these errors indicated the associated document(s), you could tell which documents had errors.

What these wouldn’t tell you is when a previous error is resolved: for example, if there was an error downloading a revision, but then on the next try the download succeeds, so the error is no longer relevant. (This would actually be difficult for the replicator to keep track of, since it would have to remember the operation that caused every error, and look it up again the next time it performed that operation.)

 When the app goes offline, then online, this property seems to be reset.

Yes. It also resets when a continuous replication finishes its work but failed to transfer some revisions, and decides to try the failed revisions again.

But with transient errors, like a few new 502s we're seeing from the new Sync Gateway, replication keeps going and retains the error. This causes our sync logic to assume the initial sync failed. 

It does mean it failed to some degree, because not all revisions got transferred.

If I kill the app, which restarts replicator, this clears the property. Is there any other way to reset this value once the error has been read and handled other than a restart? 

No, the only way for the app to clear it is to restart the replication.

—Jens

Scott Ahten

unread,
Apr 12, 2016, 2:12:42 PM4/12/16
to Couchbase Mobile
Thanks, Jens. 

I got that impression, but wanted to make sure. 

On Tuesday, April 12, 2016 at 12:39:27 PM UTC-4, Jens Alfke wrote:

Yeah, this is a problem with the API, and I’ve been unsure how to fix it. It’s hard to convey error status when there are lots of requests running in parallel, and not all errors are fatal. I’m open to new ideas!
 
One idea would be to create an "stack" of errors backed by a mutable array, where the lastError property returns the last error on the stack. In addition, a method could be used to pull the last error off the stack to process it. The userInfo dictionary for each error could contain the ID of the document that caused the error, in addition to indicating the severity of the error, such as if it was fatal, retry-able by the replicator, etc. If I process errors as they occur (and discovered none where fatal) the stack would be empty when the replicator goes idle and my current initial sync logic would work as planned. 
But with transient errors, like a few new 502s we're seeing from the new Sync Gateway, replication keeps going and retains the error. This causes our sync logic to assume the initial sync failed. 

It does mean it failed to some degree, because not all revisions got transferred.

 Is this an error that would be retried at some point but CBL, or is it fatal in that some of the data would not have been synced when the replicator went idle? Including this information as part of the userInfo of the error would be very useful. 

Jens Alfke

unread,
Apr 12, 2016, 2:28:14 PM4/12/16
to mobile-c...@googlegroups.com
On Apr 12, 2016, at 11:12 AM, Scott Ahten <lightand...@gmail.com> wrote:

One idea would be to create an "stack" of errors backed by a mutable array, where the lastError property returns the last error on the stack. In addition, a method could be used to pull the last error off the stack to process it.

Basically a FIFO stream of errors. This seems conceptually the same as the notification I proposed, but the notification seems more idiomatic.

The userInfo dictionary for each error could contain the ID of the document that caused the error, in addition to indicating the severity of the error, such as if it was fatal, retry-able by the replicator, etc.

Definitely. Although with custom properties like these, it starts to seem that these errors should have a custom domain & codes, with the actual network/HTTP/server errors attached via the NSUnderlyingErrorKey.

 Is this an error that would be retried at some point but CBL, or is it fatal in that some of the data would not have been synced when the replicator went idle? Including this information as part of the userInfo of the error would be very useful. 

It depends on the type of replication. A one-shot replicator will stop even if some of the documents got errors. A continuous replication will circle back and retry instead of going idle. (This has to do with the goal that a continuous replication is “fire and forget”: it should be able to run forever without intervention.)

Underlying this is some lower-level retry logic. If an individual HTTP request fails for a reason that we judge to be temporary, like a socket disconnection or a 50x status, that request will retry several times before giving up and reporting the error up the chain to the replicator. So a truly intermittent error won’t even be visible to you.

—Jens
Reply all
Reply to author
Forward
0 new messages