Strange increase of Replication Lag

Stephan

unread,

Oct 9, 2012, 10:56:33 AM10/9/12

to mongod...@googlegroups.com

Hi,

I posted my findings also in the topic "Strange behavior of W" as I think, this is related.

We have a replica set with 4 nodes (+ one arbiter), all with different priorities, one of them is prio 0.

Today I found, that setting W in writeconcern to more than 1 resulted in a timeout.... Then I did some tests, and then i found out that the replication lag was about 1500 secs - for all nodes. I cannot explain, why this happened, as there was no heavy writing at that time.

The network was ok all the time (10GB network - not much load on the servers).

During that time, the load on all secondary nodes was about 4.0 - whereas the load on the primary node was ~0.2!

This can also be seen in the MMS (https://mms.10gen.com/chart/bookmark/50743aa2e4b0aa8bcc9b5d16).

My Questions:

- what operations might cause an increase of the replication lag?

- how could that be avoided?

- is there a setup error, or what kind of error might cause such a behavior?

Kind Regards,

Stephan

unread,

Oct 9, 2012, 11:05:34 AM10/9/12

to mongod...@googlegroups.com

On additional thing... this war really strange: the Replication Lag was growing and growing, only changing in small steps... and then, within about 120 seconds - it went down to 0!

Jenna deBoisblanc

unread,

Oct 10, 2012, 2:42:08 PM10/10/12

to mongod...@googlegroups.com

Hello Stephan,

How are you calculating replication lag- using MMS or a command? Could you tell us the exact time window when you this lag grew and suddenly dropped off? How are you measuring load, and at what time did you make these measurements?

As for your questions- there are a handful of reasons why replication lag may increase. I noticed in MMS that there were spikes in write operations around 6AM, October 9th on the primary, which appear to be correlated to a jump in page faults, lock percentage, and background flush average. High load on the primary increases the likelihood of replication lag on secondaries.

If you can point to specific times in MMS when you experienced these issues, it will be easier to begin debugging the problem.

Stephan

unread,

Oct 11, 2012, 2:31:28 AM10/11/12

to mongod...@googlegroups.com

Hi,

we have quite a bit of monitoring here, posting all to graphite.

The Replication lag is also shown in the MMS - that's where i saw it first.

System load is: current active users (which was very low at that point), and the linux system load (all posted regularly into graphite).

There were peaks in the replication lag yesterday at about 10:30 am CET and today at 4:00 am CET...

I cannot track it down yet, it might be, there is some process going nuts, writing stuff into mongo, as these Lags obviously happen more often...

Strange thing was: during the time the replication lag was higher than 1000 seconds, the load on primary was zero - almost nothing took place, System load was < 0.2 and active users < 10...

Unfortunately these lags are not the reason for the other problem (setting w to number of nodes causing timeouts).

Kind Regards,

Stephan

Jenna deBoisblanc

unread,

Oct 11, 2012, 4:47:43 PM10/11/12

to mongod...@googlegroups.com

Hello Stephen,

Replication lag is calculated by taking the difference between the timestamp of the last op applied on the primary and the last op applied on the secondary. If there is a long period of time when no operations occur, the replication lag may appear to spike, despite the fact that the servers are healthy (in other words, it might be possible that the spike is a reporting flaw and not a true lag in replication).

It looks like there are load spikes on the primary around the specified time periods, which I can see in MMS- a jump in write operations, locking, page faults. If these write spikes continue to occur, you may want to consider sharding your data to distribute the load and prevent replication lag.

Reply all

Reply to author

Forward