sync gateway issue with getting changes within one document

206 views
Skip to first unread message

dom utom

unread,
Apr 6, 2015, 3:31:16 PM4/6/15
to mobile-c...@googlegroups.com
Hi there,

got a serious replication issue with changing one document in couchbase lite iOS within a short time (below 3sec). I think it´s a sync_gateway cache issue. If I do fast changes on different documents everything works fine. If I do the same with one document sync_gateway is caching the changes and will start replication after around 80sec which is not realtime enough for my application. I have the feeling that this was much better in previous versions of sg, maybe before 1.0.3. Thanks again for your help. wish you all the best. Please find attached my sg log.

best
dominik



document changes within 2 seconds from couchbase lite ios

21:05:44.272012 HTTP:  #110: POST /whwgroup_sg/_revs_diff  (as PfY7eAcPnUj7Gwnfo9)
21:05:44.275717 HTTP+: #110:     --> 200   (4.5 ms)
21:05:44.317182 HTTP:  #111: POST /whwgroup_sg/_bulk_docs  (as PfY7eAcPnUj7Gwnfo9)
21:05:44.320315 CRUD+: Invoking sync on doc "59238463-B4DD-48CB-A607-D3701DBA794F" rev 36-f203f0c8163c9bdb1fdf9c3d7dc613df
21:05:44.321101 CRUD+: Saving old revision "59238463-B4DD-48CB-A607-D3701DBA794F" / "35-2677e936a89b439fb1173339c6c5d28e" (305 bytes)
21:05:44.321468 CRUD+: Backed up obsolete rev "59238463-B4DD-48CB-A607-D3701DBA794F"/"35-2677e936a89b439fb1173339c6c5d28e"
21:05:44.322515 Cache: SAVING #11297
21:05:44.323521 CRUD: Stored doc "59238463-B4DD-48CB-A607-D3701DBA794F" / "36-f203f0c8163c9bdb1fdf9c3d7dc613df"
21:05:44.323600 HTTP+: #111:     --> 201   (7.2 ms)
21:05:44.741974 HTTP:  #112: POST /whwgroup_sg/_revs_diff  (as PfY7eAcPnUj7Gwnfo9)
21:05:44.745658 HTTP+: #112:     --> 200   (4.5 ms)
21:05:44.826215 HTTP:  #113: POST /whwgroup_sg/_bulk_docs  (as PfY7eAcPnUj7Gwnfo9)
21:05:44.829333 CRUD+: Invoking sync on doc "59238463-B4DD-48CB-A607-D3701DBA794F" rev 37-211b94bddd4dbb44d3918a1eda6d8ccf
21:05:44.829519 CRUD+: Saving old revision "59238463-B4DD-48CB-A607-D3701DBA794F" / "36-f203f0c8163c9bdb1fdf9c3d7dc613df" (304 bytes)
21:05:44.829624 CRUD+: Backed up obsolete rev "59238463-B4DD-48CB-A607-D3701DBA794F"/"36-f203f0c8163c9bdb1fdf9c3d7dc613df"
21:05:44.829780 Cache: SAVING #11298
21:05:44.830011 CRUD: Stored doc "59238463-B4DD-48CB-A607-D3701DBA794F" / "37-211b94bddd4dbb44d3918a1eda6d8ccf"
21:05:44.830028 HTTP+: #113:     --> 201   (9.4 ms)
21:05:45.691936 Cache: Received #11298 after 862ms ("59238463-B4DD-48CB-A607-D3701DBA794F" / "37-211b94bddd4dbb44d3918a1eda6d8ccf")
21:05:45.691950 Cache:   Deferring #11298 (1 now waiting for #11297...#11297)
21:05:45.692385 Shadow: Pushing "59238463-B4DD-48CB-A607-D3701DBA794F", rev "37-211b94bddd4dbb44d3918a1eda6d8ccf"
21:05:46.701115 Shadow+: Pulling "59238463-B4DD-48CB-A607-D3701DBA794F", CAS=7bbe4225486027 ... have UpstreamRev="35-2677e936a89b439fb1173339c6c5d28e", UpstreamCAS=c210233490
21:05:46.701142 Shadow+: Not pulling "59238463-B4DD-48CB-A607-D3701DBA794F", CAS=7bbe4225486027 (echo of rev "37-211b94bddd4dbb44d3918a1eda6d8ccf")
21:05:46.701185 CRUD+: Invoking sync on doc "59238463-B4DD-48CB-A607-D3701DBA794F" rev 37-211b94bddd4dbb44d3918a1eda6d8ccf
21:05:46.701301 CRUD+: Got old revision "59238463-B4DD-48CB-A607-D3701DBA794F" / "36-f203f0c8163c9bdb1fdf9c3d7dc613df" --> 304 bytes
21:05:46.701530 CRUD+: updateDoc("59238463-B4DD-48CB-A607-D3701DBA794F"): Rev "37-211b94bddd4dbb44d3918a1eda6d8ccf" leaves "37-211b94bddd4dbb44d3918a1eda6d8ccf" still current
21:05:46.701674 Cache: SAVING #11299
21:05:46.701863 CRUD: Stored doc "59238463-B4DD-48CB-A607-D3701DBA794F" / "37-211b94bddd4dbb44d3918a1eda6d8ccf"
21:05:47.669161 Cache: Received #11299 after 967ms ("59238463-B4DD-48CB-A607-D3701DBA794F" / "37-211b94bddd4dbb44d3918a1eda6d8ccf")
21:05:47.669171 Cache:   Deferring #11299 (2 now waiting for #11297...#11297)
21:05:49.385009 HTTP:  #114: PUT /whwgroup_sg/_local/a229f2c95375a826f7b04140ea38d1b1ad307a6d  (as PfY7eAcPnUj7Gwnfo9)
21:05:49.385207 HTTP+: #114:     --> 201   (0.4 ms)
21:05:56.928506 HTTP:  #115: GET /
21:05:56.928542 HTTP+: #115:     --> 200   (0.0 ms)




cached replication after around 80 seconds.

21:06:13.523228 HTTP:  #116: GET /
21:06:13.523259 HTTP+: #116:     --> 200   (0.0 ms)
21:06:21.996337 HTTP:  #117: GET /
21:06:21.996367 HTTP+: #117:     --> 200   (0.0 ms)
21:06:24.303221 HTTP:  #118: GET /
21:06:24.303249 HTTP+: #118:     --> 200   (0.0 ms)
21:06:25.172388 HTTP:  #119: GET /
21:06:25.172433 HTTP+: #119:     --> 200   (0.1 ms)
21:07:02.354625 WARNING: changeCache: Giving up, accepting #11298 even though #11297 is missing -- db.(*changeCache)._addPendingLogs() at change_cache.go:320
21:07:02.354686 Cache:     #11298 ==> channel "whw_news"
21:07:02.354708 Cache:     #11298 ==> channel "*"
21:07:02.354752 Cache:     #11299 ==> channel "whw_news"
21:07:02.354769 Cache:     #11299 ==> channel "*"
21:07:02.354837 Changes+: Notifying that "whwgroup_sg" changed (keys="{*, whw_news}") count=3
21:07:02.355019 Changes+: MultiChangesFeed: channels expand to channels.TimedSet{"whw_news":0x1} ...   (to PfY7eAcPnUj7Gwnfo9)
21:07:02.355030 Changes+: MultiChangesFeed: channels expand to channels.TimedSet{"whw_news":0x1} ...   (to PfY7eAcPnUj7Gwnfo9)
21:07:02.355030 Changes+: MultiChangesFeed: channels expand to channels.TimedSet{"whw_news":0x1} ...   (to PfY7eAcPnUj7Gwnfo9)
21:07:02.355147 Cache: getCachedChanges("whw_news", {0 11296}) --> 1 changes valid from #11286
21:07:02.355085 Cache: getCachedChanges("whw_news", {0 11296}) --> 1 changes valid from #11286
21:07:02.355220 Cache: getCachedChanges("whw_news", {0 11296}) --> 1 changes valid from #11286
21:07:02.358133 Changes+: MultiChangesFeed sending &{Seq:11299 ID:59238463-B4DD-48CB-A607-D3701DBA794F Deleted:false Removed:{} Doc:map[] Changes:[map[rev:37-211b94bddd4dbb44d3918a1eda6d8ccf] map[rev:27-97424609e18e1df04b6d725b9920651d] map[rev:16-e4fa6c34f21b165556e951b08f54b5ed]] branched:true}   (to PfY7eAcPnUj7Gwnfo9)
21:07:02.358177 Changes: MultiChangesFeed done   (to PfY7eAcPnUj7Gwnfo9)
21:07:02.358272 HTTP+: #099:     --> 200 OK  (0.0 ms)
21:07:02.358367 Changes+: MultiChangesFeed sending &{Seq:11299 ID:59238463-B4DD-48CB-A607-D3701DBA794F Deleted:false Removed:{} Doc:map[] Changes:[map[rev:37-211b94bddd4dbb44d3918a1eda6d8ccf] map[rev:27-97424609e18e1df04b6d725b9920651d] map[rev:16-e4fa6c34f21b165556e951b08f54b5ed]] branched:true}   (to PfY7eAcPnUj7Gwnfo9)
21:07:02.358409 Changes: MultiChangesFeed done   (to PfY7eAcPnUj7Gwnfo9)
21:07:02.358496 HTTP+: #103:     --> 200 OK  (0.0 ms)
21:07:02.358607 Changes+: MultiChangesFeed sending &{Seq:11299 ID:59238463-B4DD-48CB-A607-D3701DBA794F Deleted:false Removed:{} Doc:map[] Changes:[map[rev:37-211b94bddd4dbb44d3918a1eda6d8ccf] map[rev:27-97424609e18e1df04b6d725b9920651d] map[rev:16-e4fa6c34f21b165556e951b08f54b5ed]] branched:true}   (to PfY7eAcPnUj7Gwnfo9)
21:07:02.358659 Changes: MultiChangesFeed done   (to PfY7eAcPnUj7Gwnfo9)
21:07:02.358736 HTTP+: #100:     --> 200 OK  (0.0 ms)
21:07:02.505633 HTTP:  #120: GET /whwgroup_sg/59238463-B4DD-48CB-A607-D3701DBA794F?rev=37-211b94bddd4dbb44d3918a1eda6d8ccf&revs=true&attachments=true  (as PfY7eAcPnUj7Gwnfo9)
21:07:02.505953 HTTP+: #120:     --> 200   (0.5 ms)
21:07:02.510610 HTTP:  #121: POST /whwgroup_sg/_changes  (as PfY7eAcPnUj7Gwnfo9)
21:07:02.510655 Changes: MultiChangesFeed({whw_news}, {Since:11299 Limit:0 Conflicts:true IncludeDocs:false Wait:true Continuous:false Terminator:0xc2105f8840}) ...   (to PfY7eAcPnUj7Gwnfo9)
21:07:02.510737 Changes+: MultiChangesFeed: channels expand to channels.TimedSet{"whw_news":0x1} ...   (to PfY7eAcPnUj7Gwnfo9)
21:07:02.510755 Changes+: MultiChangesFeed waiting...   (to PfY7eAcPnUj7Gwnfo9)
21:07:02.510759 Changes+: Waiting for "whwgroup_sg"'s count to pass 3
21:07:02.785815 HTTP:  #122: POST /whwgroup_sg/_changes  (as PfY7eAcPnUj7Gwnfo9)
21:07:02.785855 Changes: MultiChangesFeed({whw_news}, {Since:11299 Limit:0 Conflicts:true IncludeDocs:false Wait:true Continuous:false Terminator:0xc2102e5c00}) ...   (to PfY7eAcPnUj7Gwnfo9)
21:07:02.785983 Changes+: MultiChangesFeed: channels expand to channels.TimedSet{"whw_news":0x1} ...   (to PfY7eAcPnUj7Gwnfo9)
21:07:02.785990 Changes+: MultiChangesFeed waiting...   (to PfY7eAcPnUj7Gwnfo9)
21:07:02.785995 Changes+: Waiting for "whwgroup_sg"'s count to pass 3
21:07:04.426567 HTTP:  #123: POST /whwgroup_sg/_changes  (as PfY7eAcPnUj7Gwnfo9)
21:07:04.426645 Changes: MultiChangesFeed({whw_news}, {Since:11299 Limit:0 Conflicts:true IncludeDocs:false Wait:true Continuous:false Terminator:0xc2109d5e40}) ...   (to PfY7eAcPnUj7Gwnfo9)
21:07:04.426839 Changes+: MultiChangesFeed: channels expand to channels.TimedSet{"whw_news":0x1} ...   (to PfY7eAcPnUj7Gwnfo9)
21:07:04.426853 Changes+: MultiChangesFeed waiting...   (to PfY7eAcPnUj7Gwnfo9)
21:07:04.426863 Changes+: Waiting for "whwgroup_sg"'s count to pass 3
21:07:04.434544 HTTP:  #124: GET /whwgroup_sg/59238463-B4DD-48CB-A607-D3701DBA794F?rev=37-211b94bddd4dbb44d3918a1eda6d8ccf&revs=true&attachments=true  (as PfY7eAcPnUj7Gwnfo9)
21:07:04.435124 HTTP+: #124:     --> 200   (0.9 ms)
21:07:07.546798 HTTP:  #125: PUT /whwgroup_sg/_local/70107ccc88c7c12c91393eb729783ea600649dbc  (as PfY7eAcPnUj7Gwnfo9)
21:07:07.547625 HTTP+: #125:     --> 201   (1.6 ms)
21:07:07.699151 HTTP:  #126: PUT /whwgroup_sg/_local/2df45004d9ca2371137005997333d6ffdef07650  (as PfY7eAcPnUj7Gwnfo9)
21:07:07.699904 HTTP+: #126:     --> 201   (1.5 ms)
21:07:09.853778 HTTP:  #127: PUT /whwgroup_sg/_local/21e4877fb354bf25314e04780f5e2f7034389ce0  (as PfY7eAcPnUj7Gwnfo9)
21:07:09.853971 HTTP+: #127:     --> 201   (0.4 ms)
21:07:13.694054 HTTP:  #128: GET /
21:07:13.694158 HTTP+: #128:     --> 200   (0.1 ms)

Adam Fraser

unread,
Apr 6, 2015, 5:40:18 PM4/6/15
to mobile-c...@googlegroups.com
Based on the "WARNING: changeCache: Giving up, accepting #11298 even though #11297 is missing -- db.(*changeCache)._addPendingLogs() at change_cache.go:320" message after the break in your logs, this looks like a case of the initial revision (#11297) not appearing on the TAP feed, and Sync Gateway blocking waiting for that revision.  

What version of Sync Gateway are you running?  If it's 1.0.4, I expect the increased time you're seeing is due to https://github.com/couchbase/sync_gateway/issues/517.

The timeout handling is addressed in master, based on the changes delivered for https://github.com/couchbase/sync_gateway/issues/525 - it no longer blocks waiting for slow sequences.

However, I'd like to better understand the real root cause - i.e. why the initial revision (#11297) isn't showing up on the TAP feed as expected.  A few questions:
1. Is this consistently reproducible for frequent single-doc updates?
2. Can you upload (to a gist) the full logs for the 80s window, to shed some light on what else is happening for the feed in that window?

Thanks,
Adam  

dom utom

unread,
Apr 6, 2015, 7:06:57 PM4/6/15
to mobile-c...@googlegroups.com
Hi Adam,
thanks for your very fast reply and taking care of this.

1. Yes, it´s sg 1.0.4 but I guess the issue was in 1.0.3 too. And yes, it´s reproducible. If I do changes (updates) on one single document within 3 seconds I will get the WARNING and the delayed replication.
2. I´ve put the complete log in Gist https://gist.github.com/domutom/3748f143f523a6ff95b4 from restarting sg.
there was no other traffic on sync gateway during the logging.

thanks,
dominik

Adam Fraser

unread,
Apr 6, 2015, 8:07:42 PM4/6/15
to mobile-c...@googlegroups.com
Hi Dominik,

Thanks for the gist.  However, in this particular gist (post-restart), it looks to me like you're actually hitting issue #673 (https://github.com/couchbase/sync_gateway/issues/673) - where a sequence is being allocated on startup for a user defined in the sync gateway config (in your log that's sequence #11313), and the delay is being caused by sync gateway waiting for that (non-existent) sequence, and not a sequence related to the 3-second update.  

Can you try to reproduce your scenario after that initial Sync Gateway unused sequence has timed out (should be 60 seconds after startup in sync gateway 1.0.4), and pass those logs along?

Thanks,
Adam



dom utom

unread,
Apr 6, 2015, 8:50:06 PM4/6/15
to mobile-c...@googlegroups.com
Thanks Adam, please find a new log in the Gist. I restarted sg and updated one doc 2 times at 02:39:07.447487(after 120sec) and you could see the delayed replication at 02:40:15.671770.
The behavior of sg is always very similar. https://gist.github.com/domutom/3748f143f523a6ff95b4

thanks,
dominik

Adam Fraser

unread,
Apr 7, 2015, 12:59:14 PM4/7/15
to mobile-c...@googlegroups.com
Interesting.  I think what's actually happening here is the same issue as #673 - an unused sequence is being created on SG startup.  However, the 60 second timeout (waiting for #11326) doesn't start until your first update shows up on the the TAP feed:

02:39:08.911342 Cache: Received #11328 after 580ms ("59238463-B4DD-48CB-A607-D3701DBA794F" / "51-4a2a2fa5bce8750f1fb5d590ff24d907")
02:39:08.911352 Cache: Deferring #11328 (1 now waiting for #11326...#11327)

You could verify this in a couple of ways:
1. I'd expect this to be unrelated to making multiple updates on the same document.  Even updates on separate documents should hit this issue.
2. You could temporarily remove any user definitions from your config file - that will avoid the unused sequence.
3. You could test using a build of SG master branch (which has the fix for #673)

When you originally hit this issue (the log from the first post on this thread), do you remember whether it was for a Sync Gateway that had been recently started?  Since you were originally reporting this only for multiple updates to a single document, and not different documents, it sounds like there still may be more going on than just the 'unused sequence on startup' bug.  If you're able to reproduce the original issue, I'd really like to see the details.  One way to reproduce might be:
- restart SG
- send an single update (to trigger the timeout waiting for the unused sequence)
- wait for the timeout (60+ seconds)
- attempt multiple updates to a single doc, and see if you're able to reproduce

Thanks,
Adam
Message has been deleted

dom utom

unread,
Apr 7, 2015, 3:33:01 PM4/7/15
to mobile-c...@googlegroups.com
Hi Adam,
thanks for your reply. I did some heavy testing with the released 1.0.4 and put the log in the gist https://gist.github.com/domutom/3748f143f523a6ff95b4 named tuesday.
will try to install the Master branch now.

Here is the test sequence which I did: Please ask for more detailed testing if you need something special.
 
A
20:33:52.504987-20:33:56.768576  restarting sync gateway

B
20:35:55.650127-20:36:00.788502 one change in one doc --> (no replication !)
20:37:22.568167-20:37:23.179093 delayed replication

C
20:38:40.166944-20:38:57.164195 one change in one doc --> (complete replication)

D
20:40:57.761472-20:41:01.149174 one change in one doc --> (complete replication)

E
20:42:14.114454-20:42:19.220205 two changes in one doc action <3sec --> (no replication)
20:43:22.570530-20:43:22.962715 delayed replication

F
20:43:54.664340-20:44:33.412083 two changes in two docs action <3sec --> (direct replication)

G
20:46:46.210060-20:46:49.833087 one change in one doc --> (complete replication)

H
20:48:02.712216-20:48:05.888083 one change in one doc --> (complete replication)

20:48:54.822455-20:48:57.915750 action not related !!

I
20:51:08.231261-20:51:13.347759 two changes in one doc action <3sec --> (no replication)
20:52:22.573744-20:52:22.943901 replication

20:53:56.175043-20:53:58.298725 action not related !!

J
20:56:35.579062-20:56:40.480740 two changes in one doc activated on two devices action <3sec --> (no replication)
20:57:52.576277-20:57:52.989834 replication

20:58:56.626162-20:58:58.680681 action not related !!

Adam Fraser

unread,
Apr 7, 2015, 4:58:08 PM4/7/15
to mobile-c...@googlegroups.com
Thanks Dominik - that's very helpful. I'm trying to reproduce here - I'll post an update once I've completed further testing.

Adam

Adam Fraser

unread,
Apr 8, 2015, 6:18:22 PM4/8/15
to mobile-c...@googlegroups.com
I've been able to reproduce - it looks like this is down to deduplication of entries by doc ID on the Couchbase server feed that Sync Gateway uses to manage the _changes request.  Multiple updates to the same doc are assigned different sequence numbers, and Sync Gateway currently buffers those sequence numbers to ensure data consistency for clients. When a sequence is deduplicated by the feed, Sync Gateway treats that as a still pending sequence, and blocks any further replication pending that sequence.

The existing fix on the master branch for issue #525 (https://github.com/couchbase/sync_gateway/issues/525) already provides a fix for the latency issues you're seeing - it allows Sync Gateway to stream subsequent changes while waiting for slow-arriving sequences.  However, I've also filed an additional ticket to better handle the scenario where a sequence is not actually slow - it's been deduplicated by the Couchbase server feed.  I'm tracking that here:  https://github.com/couchbase/sync_gateway/issues/775

Reply all
Reply to author
Forward
0 new messages