During Offline Compaction RS Secondary replication lag can be coupled

27 views
Skip to first unread message

Andrew Gross

unread,
May 23, 2012, 3:54:15 PM5/23/12
to mongod...@googlegroups.com
MongoDB Version: 2.04/2.05 (Machine upgraded after compaction)
Database Type: Sharded 
Total Shards: 4
Replication: Replica Set
RS Members: 3
Hosting: AWS
Server Type: m2.2xlarge

During our biweekly maintenance I noticed something strange while compacting our secondaries.  When I would compact one secondary of a replica set I would see the replica lag increase for that server as expected.  However, I would also see the replica lag increase on the other secondary in the replica set in a similar fashion.


 
The graph is a little confusing, but here are the main points.
  1. Shards WickerMan, FaceOff and TheRock all showed a coupled replica lag between secondaries.
  2. The secondary being compacted at the time is the one with the red line on the graph.
  3. Shard ConAir did NOT show the coupled replication lag, although it is hard to see here as MMS has some weirdness with timescales.
After each machine was compacted it was upgraded from 2.04 => 2.05 and the machine was rebooted. After the initial compactions were completed and all secondaries were caught up I performed compactions on the other secondary in each shard.  This round of compactions showed similar characteristics to the first round, 3 of 4 shards had replication lag coupling.  However, ConAir showed replication lag this time and FaceOff did not.

Finally, after the second round of compactions completed, upgrades finished, and all machines were in sync with their respective primaries I proceeded to step down the primaries and perform the compaction/upgrade cycle.

In this final round of compactions only WickerMan showed any significant replication lag coupling as the former primary was compacted.

Not sure what is going on here.  The logs did not show any errors, they only showed a notification that the other member of the replica set was going into the RECOVERING state.  Later I performed a test where I took down 1 secondary of a replica set but this did not induce any replication lag in the other secondary, it seems to only be triggered by compaction.

Any ideas?

Andrew 

Kristina Chodorow

unread,
May 23, 2012, 4:13:58 PM5/23/12
to mongodb-user
Probably one secondary was syncing from the other. You can check who
is syncing from who by using rs.status() on a secondary and looking at
its syncingTo field.

Say you have P<-S1<-S2 (P=primary, S1 is a secondary syncing from the
primary, S2 is syncing from S1). There isn't a good way of forcing a
secondary to sync from someone else in 2.0 in general, but if you
start compacting S1 and then restart S2, S2 will recalculate who to
sync from and not choose S1 again (because it's in recovering state).


On May 23, 3:54 pm, Andrew Gross <and...@yipit.com> wrote:
> MongoDB Version: 2.04/2.05 (Machine upgraded after compaction)
> Database Type: Sharded
> Total Shards: 4
> Replication: Replica Set
> RS Members: 3
> Hosting: AWS
> Server Type: m2.2xlarge
>
> During our biweekly maintenance I noticed something strange while
> compacting our secondaries.  When I would compact one secondary of a
> replica set I would see the replica lag increase for that server as
> expected.  However, I would also see the replica lag increase on the other
> secondary in the replica set in a similar fashion.
>
> <https://lh4.googleusercontent.com/-ky0HaZ9X1GU/T707nTp98cI/AAAAAAAAAC...>
>
> The graph is a little confusing, but here are the main points.
>
>    1. Shards WickerMan, FaceOff and TheRock all showed a coupled replica
>    lag between secondaries.
>    2. The secondary being compacted at the time is the one with the red
>    line on the graph.
>    3. Shard ConAir did NOT show the coupled replication lag, although it is

Andrew Gross

unread,
May 23, 2012, 4:17:46 PM5/23/12
to mongod...@googlegroups.com
Ok thanks.  Out of curiosity why would it not automatically try the other members in the replica set for more up to date information? It seems like that is the behavior if the other secondary is DOWN instead of RECOVERING

Andrew

--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To post to this group, send email to mongod...@googlegroups.com
To unsubscribe from this group, send email to
mongodb-user...@googlegroups.com
See also the IRC channel -- freenode.net#mongodb

Kristina Chodorow

unread,
May 23, 2012, 4:41:18 PM5/23/12
to mongodb-user
Optimally, yes (and 2.2 should be better about this). In 2.0, calling
getMore on a recovering node didn't return an error, so the secondary
at the end of the chain would just keep waiting for more results,
unaware that there was anything the matter. (That has since been
fixed.)


On May 23, 4:17 pm, Andrew Gross <and...@yipit.com> wrote:
> Ok thanks.  Out of curiosity why would it not automatically try the other
> members in the replica set for more up to date information? It seems like
> that is the behavior if the other secondary is DOWN instead of RECOVERING
>
> Andrew
>

Andrew Gross

unread,
May 23, 2012, 4:48:14 PM5/23/12
to mongod...@googlegroups.com
Ok, thanks for the info

Andrew
Reply all
Reply to author
Forward
0 new messages