Replication questions (internal mechanism)

Simon T.

unread,

Jul 5, 2012, 2:54:54 PM7/5/12

to mongod...@googlegroups.com

Hi guys,

I have two questions concerning the replication mechanism inside a replica set.

My first question is about "real-time replication":

Under load, we notify different behavior when it comes to replication:

Under normal load when writes are done to the primary, it seems like its replicated almost in real-time to the secondaries
Under a high load, it seems that replication is done more in burst. Not exactly real-time but almost every few seconds
Under extreme loads, it seems like the servers stop replicating until the load is over on the primary

I am right ? If so, what is the criteria for choosing replication strategy ? At which point does the system stop replicating ? Is that an option you can configure, or setup ?

-----

My second question is about a server that stopped replicating, I did some major operations (remove a few millions documents that took a couple GB and then re-insert a few millions documents). The secondaries didn't replicate until the primary wasn't under load anymore. When the insertions were over (both secondaries were 2 hours behind the primary), one of the secondary start to catch up, but not the other one. It wasn't until next morning that I saw that the other secondary did catch up, something like 7 hours later. That means for 7 hours my data was 2 hours old.

So:

Why did the server waited that long before replicating ?
Is that normal behavior ?
If it's a bug, how can we force it to catch the primary ? (I've seen nothing like syncing, other than shut down the server and restart it)

FYI: those are 3 linux boxes running on MongoDB 2.0.4, they got 24 cores and 148 GB of RAM and 2TB of disks space.

Thank you very much.

Simon

William Z

unread,

Jul 5, 2012, 6:28:02 PM7/5/12

to mongod...@googlegroups.com

Hi Simon!

To address your first question:

1) MongoDB uses a single replication algorithm, regardless of the load. At no time does MongoDB intentionally throttle the speed of the replication.

2) The replication algorithm is as follows:
- The Primary (read/write) member of the replication set writes all data modification commands into the oplog.
- Any secondary that is replicating from the primary runs a query against the primary's oplog. This query is run using a tailable cursor, so the secondary will receive updates from the primary as fast as possible
- When the secondary receives the oplog instructions, it applies them as fast as it can

3) Unfortunately, in MongoDB version 2.0.x, the performance of replication under an extreme write load is not as robust as it might be.

The problem is that in that version, MongoDB uses a single reader/writer latch to control all access to the database. The latch allows multiple concurrent readers, but only a single writer at one time. In addition, when the latch is held by a writer, all readers are blocked.

Normally, this isn't a problem. However, when the primary is handling an overload of write operations, then it may become overloaded with the write lock, and all readers will block. The problem with replication arises because the secondary is getting its data via a read operation on the oplog, and that read will block as well.

4) The good news is that this problem is largely alleviated in MongoDB 2.2, which is due for production release in the next few months. In version 2.2, the global reader/writer latch is supplemented with an additional set of reader/writer latches -- one for each database. Since the oplog is kept in a separate database from the other collections, this means that heavy write operations on the primary don't impact the ability of the secondaries to read from the oplog as much as they do in version 2.0.x.

5) You can read more about replication internals here:
- http://docs.mongodb.org/manual/replication/
- http://docs.mongodb.org/manual/core/replication-internals/

To address your second question:

1) While there isn't enough data in your problem description to definitively diagnose what happened, it seems likely that what happened is that your secondaries were starved of data due to heavy write load on the primary, as described above.

2) Replication lag is normal in the case that your primary is heavily overloaded. Eventually the secondaries should catch up.

3) There are many possible causes of replication lag: some of the most common include an overloaded primary, slow network, and slow disks on the secondary.

4) Some of the possible reasons you could have overloaded your primary include:
- Working set size significantly larger than RAM
- Heavy write load
- Disk slowdown (disk overload, SAN slowdown, controller failure, etc.)

Please let me know if you have further questions.

-William

Grégoire Seux

unread,

Jul 6, 2012, 8:17:54 AM7/6/12

to mongod...@googlegroups.com

Hello Simon,

This issue might be due to the poor reader/writer lock which is completely unfair. We have reported the bug (https://jira.mongodb.org/browse/SERVER-6004) and patched our cluster to have a fairer lock which has solved most of our issues.

--

Greg

--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To post to this group, send email to mongod...@googlegroups.com
To unsubscribe from this group, send email to
mongodb-user...@googlegroups.com
See also the IRC channel -- freenode.net#mongodb

Simon T.

unread,

Jul 6, 2012, 9:44:40 AM7/6/12

to mongod...@googlegroups.com

Hi William,

For the first part of your answer: it's great news that the replication mechanism is improve for 2.2, thank you very much !

As for the second part:

1) Let's say this issue happens again, what kind of information should I save for you ? What do you need ? (Our cluster is registered on MMS)

2) I do understand that replication lag under extreme load is normal, and I'm fine with that. The thing that bugs me, is that this server wasn't live, after I did the massive deletes and then inserts, the primary wasn't under load, not at all. I was monitoring mongostat, and there were no insert/update/query at all, locked level was at 0%, etc, etc. Again, for 7 hours the cluster was almost sleeping, and then, out of nowhere of the secondary start to catch up ? Why ? What was the trigger at that point ?

3) Overload primary: yeah, but once it was not overload anymore it didn't catch up either.

Slow network: highly doubt, they're all sitting next to each other in the same sub-net

Slow disk: we suspect that our arrayws are slower than they should be, but each box got their own RAID 5+0 (onboard controller), so it shouldn't be THAT bad.

4) - Working set fits all in RAM, as far as I know, mongoDB don't even use all the RAM at its disposal

- They were heavy write load that caused this

- Again we suspect our disks to be slower than they should be, but its an onboard array of RAID 5+0

Finally, there is a question that you didn't answer:

If it happens again, when the primary is unloaded, that the cluster is not working at all, but a secondary refuse to catch up, and that its pretty far behind, what should we do ? Is there something we can do, to force syncing, or we can only watch and hope no one is doing a query on data that is 2 hours old ?

Thank you very much William !

Simon

Simon T.

unread,

Jul 6, 2012, 9:55:29 AM7/6/12

to mongod...@googlegroups.com

Oh by the way, I forgot to mention:

before you ask, our oplog size is 124GB and cover about a month of operations, so I don't see why it could be a case of oplogs that doesn't overlap.

Simon

William Z

unread,

Jul 9, 2012, 2:16:40 PM7/9/12

to mongod...@googlegroups.com

Hi Simon!

To address your questions in semi-random order:

1) The fact that your oplog is large enough to cover a month of updates is good: that means that you won't ever get into the error state where the secondary refuses to replicate because it's so far behind that it has lost some data. However, the size of the oplog shouldn't be relevant to the problem you're describing.

2) If this happens again, then the key information that we'd need to see is in MMS. If you PM a link to your MMS page, plus indicate what time the failure occurred, then we'll have enough information to begin the diagnosis.

Without a link to MMS, it would be good to have the output of rs.status(), rs.config(), and db.currentOp() on the slow secondary, at the time the slowdown is occurring. The output of 'mongostat --discover' would be helpful as well. In general, the MMS link is preferable, since it displays lots more information and displays it over time.

3) Without the MMS link, I don't have enough information at this point to say what was holding up the secondary, or what made it start catching up. Feel free to PM me with that link, along with the times that you were experiencing the slowdown and the name of the host that was having the problem, and I'll be glad to take a look into it.

4) The list of possible causes I gave was not exhaustive -- it was just a sample of things that could potentially be the problem. Again -- I'd need more information before being able to diagnose this.

5) If this happens again, there are a couple of standard things that you can try:
- You can look at the write lock percentage on the master. If it's high, then that might be the cause of the slow replication, and you'll need to address that
- You can check for another process on the machine with the secondary that is using CPU or IOPS.
- If both primary and secondary are no longer loaded, you can try stopping and re-starting the stuck secondary to see if that causes replication to restart

6) You write: "can only watch and hope no one is doing a query on data that is 2 hours old ?". From this I infer that you have client programs which are doing reads with SlaveOK set. (As I hope you're aware, all client queries will go to the primary unless SlaveOK is set.) Please be aware that you should only be doing reads with SlaveOK if you are OK with the client program getting potentially stale data.

I hope this helps. Please let me know if you have further questions.

-William

Reply all

Reply to author

Forward