I would like to report a strange behaviour here, because I was able to reproduce it consistently, on MongoDB 3.0.4 with MMAPv1. One server never manages to join a replica set where it should initial sync almost 1TB of data. The oplog size is logically 50GB and covers more or less 40 hours of data. Here is what happens consistently, what I think to understand and what in my eyes is wrong:
i.e. H6 = start time + 6 hours
| In the meanwhile many socket errors....
H31 : build index done. oplog sync 1 of 3
H38 : initial sync building indexes (WTF ?! Without any other messages, oplog sync 3 of 3 never comes... But maybe at that point, the oplog is "over")
| This recreation of indexes (already successfully completed before) never finishes, it is much slower than the previous one, the disk is swapping a lot, mongo is piling connections (several thousands in the logfile after 1 day) and the queued reads (server Status)
Then I also tried to restart the mongod server in the middle, this is also quite funny:
H0 : restart. Build index not finished on one collection
H13: initial sync drop all databases (LOL, if mongo planned to do that, it was very useful to build the previous indexes !!!)
In the first case, if there is a problem with the oplog that might be too old, it definitely needs some proper logs explaining what happens (even more since the rest is VERY verbose). But maybe there was some other issue here??? I'm looking forward to hear from you.
In the second case, I don't know if it had any chance to succeed, probably not. But even in that case, mongo should realize first the state in which it is and if it plans to drop everything, avoid to rebuild the indexes first.
As I said, after several iterations I realized the same thing happen again and again and decided to wait to raise oplog time by adding a new shard or to wait to migrate to WiredTiger. Else no chance for now, it is condemned to fail...
Here is some raw logs, you will see from the timestamp that I have selected logs from the following sequence: I had the top scenario, I have restarted mongo and had the bottom scenario, at the end of the finished index build start the logs, and then again the top scenario
2015-07-06T23:53:46.342+0200 I - [initandlisten] Index: (2/3) BTree Bottom Up Progress: 277723400/279451059 99%
2015-07-06T23:53:55.493+0200 I INDEX [initandlisten] done building bottom layer, going to commit
2015-07-06T23:53:55.493+0200 I INDEX [initandlisten] build index done. scanned 279451059 total records. 46909 secs
2015-07-06T23:54:02.063+0200 I NETWORK [initandlisten] waiting for connections on port 27017
2015-07-06T23:54:02.080+0200 I NETWORK [initandlisten] connection accepted from internal.ip.18:34915 #1 (1 connection now open)
2015-07-06T23:54:02.369+0200 I REPL [ReplicationExecutor] New replica set config in use: { _id: "rs1", version: 8, members: [ { _id: 0, host: "HOST1:27017", arbiterOnly: false, buildIndexes:
true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 1, host: "HOST2:27017", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDela
y: 0, votes: 1 }, { _id: 3, host: "HOSTSEC:27017", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 0.1, tags: {}, slaveDelay: 0, votes: 1 } ], settings: { chainingAllowed: true,
heartbeatTimeoutSecs: 10, getLastErrorModes: {}, getLastErrorDefaults: { w: 1, wtimeout: 0 } } }
2015-07-06T23:54:02.370+0200 I REPL [ReplicationExecutor] This node is HOSTSEC:27017 in the config
2015-07-06T23:54:02.370+0200 I REPL [ReplicationExecutor] transition to STARTUP2
2015-07-06T23:54:02.578+0200 I NETWORK [initandlisten] connection accepted from internal.ip.16:54724 #2 (2 connections now open)
2015-07-06T23:54:02.626+0200 I NETWORK [initandlisten] connection accepted from internal.ip.65:33608 #3 (3 connections now open)
2015-07-06T23:54:02.825+0200 I REPL [ReplicationExecutor] Starting replication applier threads
2015-07-06T23:54:02.826+0200 I COMMAND [conn1] command admin.$cmd command: isMaster { isMaster: 1 } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:0 reslen:356 locks:{} 495ms
2015-07-06T23:54:02.826+0200 I COMMAND [conn2] command admin.$cmd command: isMaster { isMaster: 1 } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:0 reslen:356 locks:{} 247ms
2015-07-06T23:54:02.828+0200 I COMMAND [conn3] command admin.$cmd command: isMaster { isMaster: 1 } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:0 reslen:356 locks:{} 200ms
2015-07-06T23:54:02.829+0200 I REPL [ReplicationExecutor] Member HOST1:27017 is now in state SECONDARY
2015-07-06T23:54:02.829+0200 I REPL [ReplicationExecutor] Member HOST2:27017 is now in state PRIMARY
2015-07-06T23:54:02.919+0200 I NETWORK [initandlisten] connection accepted from internal.ip.24:35911 #4 (4 connections now open)
2015-07-06T23:54:03.079+0200 I REPL [rsSync] initial sync pending
2015-07-06T23:54:03.174+0200 I NETWORK [initandlisten] connection accepted from internal.ip.12:60560 #5 (5 connections now open)
2015-07-06T23:54:03.465+0200 I REPL [rsSync] no valid sync sources found in current replset to do an initial sync
2015-07-06T23:54:03.933+0200 I NETWORK [initandlisten] connection accepted from internal.ip.23:36835 #6 (6 connections now open)
2015-07-06T23:54:04.285+0200 I NETWORK [initandlisten] connection accepted from internal.ip.17:45091 #7 (7 connections now open)
2015-07-06T23:54:04.465+0200 I REPL [rsSync] initial sync pending
2015-07-06T23:54:04.466+0200 I REPL [rsSync] no valid sync sources found in current replset to do an initial sync
2015-07-06T23:54:05.233+0200 I NETWORK [initandlisten] connection accepted from
127.0.0.1:53087 #8 (8 connections now open)
2015-07-06T23:54:07.197+0200 I NETWORK [initandlisten] connection accepted from internal.ip.20:44611 #9 (9 connections now open)
2015-07-06T23:54:09.624+0200 I NETWORK [initandlisten] connection accepted from internal.ip.25:53462 #10 (10 connections now open)
2015-07-06T23:54:10.355+0200 I NETWORK [initandlisten] connection accepted from internal.ip.27:51038 #11 (11 connections now open)
2015-07-06T23:54:10.355+0200 I NETWORK [initandlisten] connection accepted from internal.ip.19:51669 #12 (12 connections now open)
2015-07-06T23:54:10.572+0200 I NETWORK [initandlisten] connection accepted from internal.ip.11:48601 #13 (13 connections now open)
2015-07-06T23:54:11.447+0200 I REPL [rsSync] initial sync pending
2015-07-06T23:54:11.447+0200 I COMMAND [conn8] command db.$cmd command: dbStats { dbStats: 1 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:315 locks:{ Global: { acquireCount: { r: 2 } }, MM
APV1Journal: { acquireCount: { r: 1 } }, Database: { acquireCount: { R: 1 } }, Metadata: { acquireCount: { R: 1 } } } 6196ms
2015-07-06T23:54:11.448+0200 I REPL [ReplicationExecutor] syncing from: HOST2:27017
2015-07-06T23:54:11.457+0200 I REPL [rsSync] initial sync drop all databases
[LOL]
2015-07-06T23:54:11.458+0200 I STORAGE [rsSync] dropAllDatabasesExceptLocal 3
2015-07-06T23:54:11.578+0200 I NETWORK [initandlisten] connection accepted from
127.0.0.1:53089 #14 (14 connections now open)
2015-07-06T23:54:11.825+0200 I NETWORK [conn8] end connection
127.0.0.1:53087 (13 connections now open)
2015-07-06T23:54:12.102+0200 I JOURNAL [rsSync] journalCleanup...
2015-07-06T23:54:12.102+0200 I JOURNAL [rsSync] removeJournalFiles
2015-07-06T23:54:13.318+0200 I JOURNAL [rsSync] journalCleanup...
2015-07-06T23:54:13.318+0200 I JOURNAL [rsSync] removeJournalFiles
2015-07-06T23:54:23.641+0200 I NETWORK [initandlisten] connection accepted from
127.0.0.1:53090 #15 (14 connections now open)
[...] (restarts from scratch, top scenario)
2015-07-08T11:53:15.574+0200 I INDEX [rsSync] build index done. scanned 11070 total records. 0 secs
2015-07-08T11:53:15.595+0200 I REPL [rsSync] initial sync data copy, starting syncup
2015-07-08T11:53:15.963+0200 I REPL [rsSync] oplog sync 1 of 3
2015-07-08T11:53:16.613+0200 I NETWORK [initandlisten] connection accepted from
127.0.0.1:51367 #23853 (13 connections now open)
2015-07-08T11:53:16.617+0200 I NETWORK [conn23853] end connection
127.0.0.1:51367 (12 connections now open)
2015-07-08T11:53:16.756+0200 I NETWORK [initandlisten] connection accepted from
127.0.0.1:51368 #23854 (13 connections now open)
2015-07-08T11:53:16.786+0200 I NETWORK [conn23854] end connection
127.0.0.1:51368 (12 connections now open)
2015-07-08T11:53:16.796+0200 I REPL [ReplicationExecutor] syncing from: HOST2:27017
2015-07-08T11:53:16.799+0200 I REPL [SyncSourceFeedback] replset setting syncSourceFeedback to HOST2:27017
[...]
2015-07-08T11:53:22.565+0200 I REPL [repl writer worker 13] replication failed to apply update: {JSON}
2015-07-08T11:53:22.575+0200 I REPL [repl writer worker 13] replication info adding missing object
2015-07-08T11:53:22.582+0200 I REPL [repl writer worker 13] replication missing object not found on source. presumably deleted later in oplog
2015-07-08T11:53:22.583+0200 I REPL [repl writer worker 13] replication o2: { _id: ObjectId('UUID') }
2015-07-08T11:53:22.583+0200 I REPL [repl writer worker 13] replication o firstfield: $set
[...]
2015-07-08T16:13:30.136+0200 I COMMAND [repl writer worker 14] CMD: drop db.tmp.mrs.reports_1436348079_6814
2015-07-08T16:13:30.261+0200 I COMMAND [repl writer worker 14] CMD: drop db.tmp.mrs.reports_1436348079_6815
[...]
2015-07-08T18:01:54.124+0200 I COMMAND [conn26678] command db.$cmd command: collStats { collstats: "coll1" } keyUpdates:0 writeConflicts:0 numYields:0 reslen:420 locks:{ Global: { acquireCount: { r: 2 }, acquireWaitCount: { r: 1 }, timeAcquiringMicros: { r: 10266409 } }, MMAPV1Journal: { acquireCount: { r: 1 } }, Database: { acquireCount: { r: 1 } }, Collection: { acquireCount: { R: 1 } } } 2858ms
2015-07-08T18:01:54.127+0200 I NETWORK [conn26678] end connection
127.0.0.1:58499 (12 connections now open)
[WTF!?]
2015-07-08T18:01:54.287+0200 I REPL [rsSync] initial sync building indexes
2015-07-08T18:01:54.297+0200 I REPL [rsSync] initial sync cloning indexes for : admin
2015-07-08T18:01:54.377+0200 I NETWORK [initandlisten] connection accepted from
127.0.0.1:58500 #26680 (13 connections now open)
2015-07-08T18:01:54.379+0200 I NETWORK [conn26680] end connection
127.0.0.1:58500 (12 connections now open)
2015-07-08T18:01:54.446+0200 I REPL [rsSync] initial sync cloning indexes for : db
2015-07-08T18:01:54.462+0200 I STORAGE [rsSync] copying indexes for: { name: "coll1", options: {} }
2015-07-08T18:01:54.616+0200 I NETWORK [initandlisten] connection accepted from
127.0.0.1:58503 #26681 (13 connections now open)
2015-07-08T18:01:54.672+0200 I INDEX [rsSync] build index on: db.coll1 properties: { v: 1, key: { KEY }, name: "#_1_uuid_1", ns: "db.coll1" }
2015-07-08T18:01:54.672+0200 I INDEX [rsSync] building index using bulk method