replica set initial sync large dataset, heavy writes

359 views
Skip to first unread message

Daniel Doyle

unread,
Jan 17, 2017, 10:39:55 AM1/17/17
to mongodb-user
Hi guys,

We have a situation where we are trying to add replication to our existing deployment and running into some issues. The application is very write intensive. Some technical details:

sizeof all database - 1.8TB
oplog size - 50G ("logSizeMB" : 51200)
timeDiffHours - 16.65

I tried standing this up and letting mongo do its initial sync, which took a fair amount of time, roughly 10 days. After it finished it then went from STARTUP2 to RECOVERING state and I periodically get "RS102" errors in the logs:

2017-01-17T03:16:17.639+0000 I REPL     [ReplicationExecutor] could not find member to sync from
2017-01-17T03:16:17.639+0000 I REPL     [rsBackgroundSync] replSet error RS102 too stale to catch up
2017-01-17T03:16:17.639+0000 I REPL     [rsBackgroundSync] replSet our last optime : Jan  5 07:15:36 586df298:721
2017-01-17T03:16:17.639+0000 I REPL     [rsBackgroundSync] replSet oldest available is Jan 16 10:39:25 587ca2dd:437

If I were a betting man, I would imagine that there is a problem where if initial sync doesn't happen faster than oplog's $timeDiffHours, it will never succeed. Is this accurate? It isn't something that is explicitly called out anywhere I could find in the docs and seems like a pretty important detail, so this has me doubting myself a bit.

If that is true, is there any other way to do this? Using some back of the napkin math, doing the "seed secondary" approach of manually copying the files would result in a 3:45h downtime assuming we could linerate the 1G NIC. We have no recent backups to seed from - part of this effort is to get into a situation where we can have more backups, but I don't think is particularly relevant because if things ever got desync'd and the backup was outside the oplog window we would be right here again.

Alternatively, is the correct option here to size the oplog to something much larger for initial sync and just wait through it? With 50G giving 16h, it seems like to reach a safety of say 12d that would require roughly a 1TB oplog size (if I did the math correctly). And if my assumptions are correct, this is still a function of total size and so while 1TB might work now, a year from now if we had to do this it might take 2TB since the total datasize is growing.

Any thoughts and suggestions would be appreciated. 

Thanks.

Rhys Campbell

unread,
Jan 17, 2017, 11:42:02 AM1/17/17
to mongodb-user
You're correct in your assumptions about the oplog. What version are you running? There have been improvements in the initial sync feature so a simple upgrade may help.

How about considering sharding? It might be feasible to split up your data so the initial-sync completes so you can then add replicas. You could even drain shards after the fact to re-consolidate.

Attila Tozser

unread,
Jan 17, 2017, 12:05:45 PM1/17/17
to mongod...@googlegroups.com
You can try tricks to copy over the data, in case you can make a consistent snapshot like LVM for example. Try to copy the content of the snapshot with rsync (or something bit more parallel) and if it fits in the window the secondary will sync the delta from the oplog. 

This can work in an iterative way even without the snapshot in the first iterations, the iportant part is written here: http://superuser.com/questions/847850/behavior-of-rsync-with-file-thats-still-being-written
using the inplace option, but in the last round you need to have a consistent state with the snapshot and it should fit in the 16,xx hours window.

I would increase the oplogsize and consider sharding also.

--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.
 
For other MongoDB technical support options, see: https://docs.mongodb.com/manual/support/
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user+unsubscribe@googlegroups.com.
To post to this group, send email to mongod...@googlegroups.com.
Visit this group at https://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/a7d9f459-1afe-484d-9f8c-fde03bb705af%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Rhys Campbell

unread,
Jan 24, 2017, 8:57:27 AM1/24/17
to mongodb-user
Improvements in 3.4 would help...

"Changed in version 3.4: The replication oplog window no longer needs to cover the time needed to restore a replica set member via initial sync as the oplog records are pulled during the data copy. However, the member being restored must have enough disk space in the local database to temporarily store these oplog records for the duration of this data copy stage."

Reply all
Reply to author
Forward
0 new messages