Potentially irrecoverable lost updates

200 views
Skip to first unread message

anton.m...@safetyculture.io

unread,
Dec 4, 2014, 1:52:43 AM12/4/14
to mobile-c...@googlegroups.com
Hi,

After speaking with Andrew Reslan and looking through Sync Gateway's source code, it seems that when Sync Gateway receives updates from Couchbase's TAP feed out of order, it starts to buffer them, while waiting for the gaps to be filled.

In cases when it buffers more than 10k items, or waits for longer than 5 seconds, it gives up, and puts the items into its cache.

This can happen when Couchbase and Sync Gateway are under load. We can see it happen in our load tests.

The requests to the _changes feed are served from the cache (if possible).

In cases where the Sync Gateway gave up waiting for missed updates, any subscriber to the _changes feed that looks at the highest sequence in the feed and uses it next time as its "since" parameter, will lose updates. As the client does not know that it's "since" parameter is "too high" it will continue to get only later changes, as its "since" parameter will continue to increase. Therefore, unless the document that triggered the lost sequence is updated again, it's update will be lost to these clients.

As far as I know Couchbase Lite (at least on Android) always looks at the highest received sequence and uses it as the "since" parameter in it's sync requests.

Therefore lost updates can happen when using Couchbase Lite on Android.

From what I can tell the situation is not very easy to fix, as a sequence can get lost in cases where Sync Gateway acquired a new sequence (using Incr function) but an update failed. So it is difficult for the Sync Gateway to tell if updates on the TAP feed are delayed, and are arriving out of order, or are actually lost.

Am I understanding the situation correctly?

If so, would you be able to explain how this can be avoided? Do you have a plan on getting this fixed? If so, I would love to learn what your approach will be?

If I am misunderstanding the situation, could you correct where I am going wrong? How does a client ensure it does not lose updates in cases after Sync Gateway "gave up" waiting for some sequences?

Cheers,
Anton

Jens Alfke

unread,
Dec 4, 2014, 1:42:12 PM12/4/14
to mobile-c...@googlegroups.com
Anton,

Yes, you're describing a worst-case scenario that can sometimes happen under very heavy database-server load. The root of the problem is that the database may not deliver all notifications of document updates (the "TAP" feed) in a timely manner when it's very busy. TAP itself isn't an ordered stream, but Sync Gateway updates do have chronological sequence numbers and need to be processed in order. So if one update goes missing for a long time (it always arrives eventually but it can take minutes) it has to buffer up all numerically-later updates until the missing one arrives.

To keep the gateway from blocking its update notifications indefinitely, there's a timeout, as you said. After a while it will give up on a missing update and proceed without it. When that update does arrive, it has to ignore it because its sequence number is now out of order so there's no way to re-insert it into change feeds that it's already delivered to clients.

The solution to this is the new update-notification system in Couchbase Server 3.0, which is called DCP (Database Change Protocol). It's much better about timely delivery. We unfortunately didn't have time to update Sync Gateway to use this new protocol before Couchbase 3 shipped, but the work is underway now and we plan to have it in a new Gateway release soon.

Until then, one workaround is to provision enough database-server resources that the cluster nodes won't reach those levels of load during actual use. (As I'm sure you've seen, the Gateway logs a warning whenever a sequence gets dropped on the floor, so it's easy to detect the problem.)

Another workaround is to increase those limits ("buffers more than 10k items, or waits for longer than 5 seconds") in the Gateway source code and rebuild it. The downside is that instead of change notifications being lost, you'll instead get greater latency in delivering them, but that may be acceptable depending on your use case. These constants are in the file change_cache.go:

var MaxChannelLogPendingCount = 10000 // Max number of waiting sequences
var MaxChannelLogPendingWaitTime = 5 * time.Second // Max time we'll wait for a missing sequence

(In hindsight, the timeout should probably have been more like 60 seconds.)

—Jens

anton.m...@safetyculture.io

unread,
Dec 4, 2014, 4:37:03 PM12/4/14
to mobile-c...@googlegroups.com
Hi Jens,

Thank you for your clear reply. I am glad to hear that there is a solid plan to improve the situation.

1. Would you be able to shine more light on the technical details about how DCP solves the delivery of out of order and delayed updates?

2. Am I right that sequences can get permanently lost (e.g. if a sequence is acquired with an incr call, but then the update fails)? If so, even when using DCP, how will Sync Gateway be able to differentiate between permanently lost sequences and delays?

3. If we increase the timeout and provision enough hardware resources, but the changes feed with a gap still gets delivered, the only way one could get the client that received that feed to request the missed sequence is to do a sync from the start.

However, there does not seem to be a way (i.e. there is no API) to trigger a sync with no "since" parameter in Couchbase Lite.

As it is very hard to predict the exact load one will see, especially in AWS where sometimes things like disk and network delays spike, it is difficult to ensure that not a single update will be missed. In these rare cases we would love to be able to fix the problem for the customers who have been hit with it. Without the API to do a sync from scratch, from what I can see, we have no way to fix the problem, and have nothing to tell our customers who have hit it. We cannot be in this situation :) We need to be able to guide a customer to a resolution where they have all their data up to date.

4. Thank you for the suggestion to increase the time out in the Sync Gateway.

Would it be possible to do this in your code base? And maybe even accept a parameter from the command line?

As you release new versions of Sync Gateway we would love to be able to deploy them, without having to rebuild our own version each time.

Thanks a lot for your help, Jens! We look forward to using your solution and working with you to improve it.

Cheers,
Anton

J. Chris Anderson

unread,
Dec 5, 2014, 10:27:40 AM12/5/14
to mobile-c...@googlegroups.com


On Thursday, December 4, 2014 1:37:03 PM UTC-8, anton.m...@safetyculture.io wrote:
Hi Jens,

Thank you for your clear reply. I am glad to hear that there is a solid plan to improve the situation.

1. Would you be able to shine more light on the technical details about how DCP solves the delivery of out of order and delayed updates?

2. Am I right that sequences can get permanently lost (e.g. if a sequence is acquired with an incr call, but then the update fails)? If so, even when using DCP, how will Sync Gateway be able to differentiate between permanently lost sequences and delays?

3. If we increase the timeout and provision enough hardware resources, but the changes feed with a gap still gets delivered, the only way one could get the client that received that feed to request the missed sequence is to do a sync from the start.

However, there does not seem to be a way (i.e. there is no API) to trigger a sync with no "since" parameter in Couchbase Lite.


One thing we've talked about is the possibility of sending changes that show up on TAP during the out of order waiting interval of a missing sequence, with a lower-than-actual sequence number on the wire (use the last stable sequence number, the one before the missing one.)

So this allows low-latency delivery of newer messages, and then for the sequence number to catch up when any waiting changes come through TAP. The cost is potential revs diff chatter on any sync client that disconnects / reconnects during a window where we are waiting for a sequence to come through TAP.

So we can avoid latency issues and instead have the client rollback window remain open at the earlier point, by only ever sending the lower stable sequence number in changes feeds.

Jens Alfke

unread,
Dec 5, 2014, 11:57:55 AM12/5/14
to mobile-c...@googlegroups.com

On Dec 5, 2014, at 7:27 AM, J. Chris Anderson <jch...@couchbase.com> wrote:

One thing we've talked about is the possibility of sending changes that show up on TAP during the out of order waiting interval of a missing sequence, with a lower-than-actual sequence number on the wire (use the last stable sequence number, the one before the missing one.)

Yeah, but this has a danger of recreating those infinite pull loops, where the client uses the lower sequence number as its checkpoint and then starts over from there again on its next _changes fetch.

I'm optimistic that using DCP will make this problem mostly go away since it's better about sending timely notifications. But we won't know for sure until we have DCP support implemented and can run performance tests with loaded servers.

—Jens

anton.m...@safetyculture.io

unread,
Dec 5, 2014, 6:46:57 PM12/5/14
to mobile-c...@googlegroups.com
Hi Guys,

Thank you very much for your replies.

Just some back ground: I work at SafetyCulture:
where we are building a mobile solution the core of which relies on solid replication between Mobile apps and our cloud service.

The solution involves creating and exchanging documents. Our customers have to have up to date data. They can wait for the right data (in the worst cases even up to 30 mins, if these cases are very rare of course), but we should never be in a situation where they have missed updates, especially if they cannot recover them easily.

Would you be able to answer the 4 questions I have posted above?

We really need that information to help us build a solid solution. I believe Sync Gateway is very close to providing a perfect solution, but we need to understand how various cases are handled.

Do you have an estimate for when Sync Gateway will use DCP? Has the implementation work already begun?

Thanks again!

Cheers,
Anton

Justin

unread,
Dec 17, 2014, 7:55:43 AM12/17/14
to mobile-c...@googlegroups.com
Jens,

I believe I am currently seeing this issue but I am not certain.  In the logs  below, the interval is 5 seconds.  Are these the logs you mentioned?

00:42:33.669456 Cache: Received #2 ("_user/bb039d924013d63f9e10ae99d46eeeb1bc59ffc9b2fcb4f6f692bf6af87eefdf")
00:42:33.669688 Cache:   Deferring #2 (1 now waiting for #1...#1)

00:42:39.564527 WARNING: changeCache: Giving up, accepting #2 even though #1 is missing -- db.(*changeCache)._addPendingLogs() at change_cache.go:320

Prior to seeing this post I had opened a issue on the android CBL github assuming it was an issue with the CBL. 

If this my problem.  No sweat I'll compile the latest sync gateway code and give it a try with the new 60 second timeout. 

Thanks in advance,

-Justin

P.S. 

Anton awesome thread -Thanks a million.

Jens Alfke

unread,
Dec 17, 2014, 11:54:54 AM12/17/14
to mobile-c...@googlegroups.com

> On Dec 17, 2014, at 4:55 AM, Justin <justin...@gmail.com> wrote:
>
> I believe I am currently seeing this issue but I am not certain. In the logs below, the interval is 5 seconds. Are these the logs you mentioned?

Yes. Although it's weird that this would happen to the very first change the gateway creates ("#1"). Was the DB server already under heavy load?

—Jens

Justin

unread,
Dec 20, 2014, 6:33:23 AM12/20/14
to mobile-c...@googlegroups.com
Jens,  

Thanks for the quick response and the confirmation.  I thought I signed up for e-mail notifications on this thread but I didn't
 
I would need to be back and look but I didn't see high utilization on the  DB server.  I was using the SG in a way that wasn't intended.  I'm working on a end to end encrypted solution and being myopically focused on the app I started using the sync gateway to assign the recently joined clients to what would be a pre-determistic channel for each hashed phone contacts and then having the client distribute a symmetric key for known contacts who currently have accounts to allow their profile image could be viewable.  So 1000 phone contacts equals 1000 channels created immediately plus subsequent key exchange.  

Bottom line I was being a jack ass. I need to code the key exchange server this weekend. 

The skipping docs through me, I wrongly assumed it was CBL for android.

Thanks again Jens,

-Justin

Justin

unread,
Dec 29, 2014, 4:10:13 AM12/29/14
to mobile-c...@googlegroups.com
Jens,

I got my back end public key server up.  However, I am still seeing the first sequence number being skipped but no longer do I see subsequent sequence numbers be skipped. 

This occurs only the first time an account is created through the sync gateway.  I can see that the user I've created appears as a local doc on the CB server with the sequence number of 2.  

However, I cannot see anything with a sequence number of one.  All are local docs, I'm guessing the _sync:syncdata should have a sequence # of 1 assigned to it or that the local user doc should have the first sequence #.  Or more likely I'm missing something basic.  

I've attached screen shots of the issue.  I can replicate it on windows and linux both running CB Server 3.0.1 and SG 1.0.3

Thanks Jens,

-Justin

P.S. If this is an issue I think it's just cosmetic. Oh and hopefully you had a kick ass Christmas.


On Wednesday, 17 December 2014 23:54:54 UTC+7, Jens Alfke wrote:
CBserver_all_docs.JPG
curl_command.JPG
SG_function_config.JPG
SG_log_with_warning.JPG
sync_doc.JPG
user_doc.JPG

Adam Fraser

unread,
Dec 29, 2014, 3:11:27 PM12/29/14
to mobile-c...@googlegroups.com
Anton,

Here are the specific answers to the four questions you'd posted earlier:


1. Would you be able to shine more light on the technical details about how DCP solves the delivery of out of order and delayed updates?

The main difference when using DCP is reduced latency for updates appearing on the stream.  TAP waits for disk persistence before sending the update to the stream, which increases the potential for high latency under heavy load. DCP only waits for memory persistence before sending to the stream. In practice this means that there's much less chance that SG will reach the wait timeout and discard the document under DCP, particularly with the timeout change made in https://github.com/couchbase/sync_gateway/issues/517.

2. Am I right that sequences can get permanently lost (e.g. if a sequence is acquired with an incr call, but then the update fails)? If so, even when using DCP, how will Sync Gateway be able to differentiate between permanently lost sequences and delays?
 
This case will need to be taken into consideration when implementing https://github.com/couchbase/sync_gateway/issues/525.  

3. If we increase the timeout and provision enough hardware resources, but the changes feed with a gap still gets delivered, the only way one could get the client that received that feed to request the missed sequence is to do a sync from the start.

However, there does not seem to be a way (i.e. there is no API) to trigger a sync with no "since" parameter in Couchbase Lite.

 
As it is very hard to predict the exact load one will see, especially in AWS where sometimes things like disk and network delays spike, it is difficult to ensure that not a single update will be missed. In these rare cases we would love to be able to fix the problem for the customers who have been hit with it. Without the API to do a sync from scratch, from what I can see, we have no way to fix the problem, and have nothing to tell our customers who have hit it. We cannot be in this situation :) We need to be able to guide a customer to a resolution where they have all their data up to date.
This has been covered somewhat by Chris's previous updates on this topic, and will be included in https://github.com/couchbase/sync_gateway/issues/525.  This enhancement has been prioritized, and I expect it should be available (at least on a feature branch) in January.
 

4. Thank you for the suggestion to increase the time out in the Sync Gateway.

Would it be possible to do this in your code base? And maybe even accept a parameter from the command line?
This has been increased to 60 sec (https://github.com/couchbase/sync_gateway/issues/517). 


Thanks,
Adam

Matt Ingenthron

unread,
Dec 29, 2014, 3:29:13 PM12/29/14
to mobile-c...@googlegroups.com
HI Adam, all,


On Dec 29, 2014, at 12:11 PM, Adam Fraser <adamc...@gmail.com> wrote:

Anton,

Here are the specific answers to the four questions you'd posted earlier:


1. Would you be able to shine more light on the technical details about how DCP solves the delivery of out of order and delayed updates?

The main difference when using DCP is reduced latency for updates appearing on the stream.  TAP waits for disk persistence before sending the update to the stream, which increases the potential for high latency under heavy load. DCP only waits for memory persistence before sending to the stream. In practice this means that there's much less chance that SG will reach the wait timeout and discard the document under DCP, particularly with the timeout change made in https://github.com/couchbase/sync_gateway/issues/517.


TAP doesn’t wait for disk persistence.  Update to views in releases prior to 3.0 wait for disk persistence, but in 3.0 that no longer goes through disk IO.  

I suspect the thing you’re thinking of is that if you restart a TAP stream from the beginning, it does have to do a “backfill” which involves getting all of the data on disk along with all of the items in memory.  In TAP, there is guaranteed delivery (for the life of the stream) but no guaranteed order and little ability to pick up at a particular place in time, which is the advantage DCP will bring.



2. Am I right that sequences can get permanently lost (e.g. if a sequence is acquired with an incr call, but then the update fails)? If so, even when using DCP, how will Sync Gateway be able to differentiate between permanently lost sequences and delays?
 
This case will need to be taken into consideration when implementing https://github.com/couchbase/sync_gateway/issues/525.  

3. If we increase the timeout and provision enough hardware resources, but the changes feed with a gap still gets delivered, the only way one could get the client that received that feed to request the missed sequence is to do a sync from the start.

However, there does not seem to be a way (i.e. there is no API) to trigger a sync with no "since" parameter in Couchbase Lite.

 
As it is very hard to predict the exact load one will see, especially in AWS where sometimes things like disk and network delays spike, it is difficult to ensure that not a single update will be missed. In these rare cases we would love to be able to fix the problem for the customers who have been hit with it. Without the API to do a sync from scratch, from what I can see, we have no way to fix the problem, and have nothing to tell our customers who have hit it. We cannot be in this situation :) We need to be able to guide a customer to a resolution where they have all their data up to date.
This has been covered somewhat by Chris's previous updates on this topic, and will be included in https://github.com/couchbase/sync_gateway/issues/525.  This enhancement has been prioritized, and I expect it should be available (at least on a feature branch) in January.
 

4. Thank you for the suggestion to increase the time out in the Sync Gateway.

Would it be possible to do this in your code base? And maybe even accept a parameter from the command line?
This has been increased to 60 sec (https://github.com/couchbase/sync_gateway/issues/517). 


Thanks,
Adam

--
You received this message because you are subscribed to the Google Groups "Couchbase Mobile" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mobile-couchba...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/mobile-couchbase/d8ce82cb-a582-4de1-b89e-d51000299491%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

-- 
Matt Ingenthron
Couchbase, Inc.

Reply all
Reply to author
Forward
0 new messages