I was doing maintenance on 3-node MongoDB 3.4.7 replica set. As per instruction, I did maintenance one node at a time, starting with secondaries. After bringing up node after maintenance I ensured that it has caught up before proceeding to next. When time has come to update current primary, I have verified that both secondaries are caught up within second and happily issued `rs.stepDown()` command on primary, assuming it will trigger instant re-election and things go smoothly. Well, they did not. As later log analysis shows, after current primary stepped down and became secondary, all 3 nodes waited in SECONDARY state for 10 seconds until starting re-election. Of course this had dismal effects on live system that was using this replica set.
2017-12-08 11:31:09.574 prod-mongo-empires1 I REPL ReplicationExecutor Member prod-mongo-empires2:27017 is now in state SECONDARY
2017-12-08 11:31:10.010 prod-mongo-empires3 I REPL ReplicationExecutor Member prod-mongo-empires2:27017 is now in state SECONDARY
^^^^^^^^^^^^^^^^^ here all 3 nodes already become SECONDARY, yet...
2017-12-08 11:31:11.329 prod-mongo-empires3 I REPL rsBackgroundSync could not find member to sync from
2017-12-08 11:31:11.329 prod-mongo-empires3 I REPL replication-2 Choosing new sync source because our current sync source, prod-mongo-empires2:27017, has an OpTime ({ ts: Timestamp 1512725466000|194, t: 14 }) which is not ahead of ours ({ ts: Timestamp 1512725466000|194, t: 14 }), it does not have a sync source, and it's not the primary (sync source does not know the primary)
2017-12-08 11:31:11.329 prod-mongo-empires3 I REPL replication-2 Canceling oplog query due to OplogQueryMetadata. We have to choose a new sync source. Current source: prod-mongo-empires2:27017, OpTime { ts: Timestamp 1512725466000|194, t: 14 }, its sync source index:-1
2017-12-08 11:31:11.329 prod-mongo-empires3 W REPL rsBackgroundSync Fetcher stopped querying remote oplog with error: InvalidSyncSource: sync source prod-mongo-empires2:27017 (config version: 17; last applied optime: { ts: Timestamp 1512725466000|194, t: 14 }; sync source index: -1; primary index: -1) is no longer valid
2017-12-08 11:31:11.330 prod-mongo-empires1 I REPL replication-7 Choosing new sync source because our current sync source, prod-mongo-empires2:27017, has an OpTime ({ ts: Timestamp 1512725466000|194, t: 14 }) which is not ahead of ours ({ ts: Timestamp 1512725466000|194, t: 14 }), it does not have a sync source, and it's not the primary (sync source does not know the primary)
2017-12-08 11:31:11.331 prod-mongo-empires1 I REPL replication-7 Canceling oplog query due to OplogQueryMetadata. We have to choose a new sync source. Current source: prod-mongo-empires2:27017, OpTime { ts: Timestamp 1512725466000|194, t: 14 }, its sync source index:-1
2017-12-08 11:31:11.331 prod-mongo-empires1 W REPL rsBackgroundSync Fetcher stopped querying remote oplog with error: InvalidSyncSource: sync source prod-mongo-empires2:27017 (config version: 17; last applied optime: { ts: Timestamp 1512725466000|194, t: 14 }; sync source index: -1; primary index: -1) is no longer valid
2017-12-08 11:31:11.331 prod-mongo-empires1 I REPL rsBackgroundSync could not find member to sync from
2017-12-08 11:31:16.329 prod-mongo-empires3 I REPL SyncSourceFeedback SyncSourceFeedback error sending update to prod-mongo-empires2:27017: InvalidSyncSource: Sync source was cleared. Was prod-mongo-empires2:27017
2017-12-08 11:31:16.330 prod-mongo-empires1 I REPL SyncSourceFeedback SyncSourceFeedback error sending update to prod-mongo-empires2:27017: InvalidSyncSource: Sync source was cleared. Was prod-mongo-empires2:27017
2017-12-08 11:31:17.936 prod-mongo-empires1 I REPL ReplicationExecutor Starting an election, since we've seen no PRIMARY in the past 10000ms
^^^^^^^^^^^^^^^^^ ONLY HERE WE START RE-ELECTION!!!!!
The rest of the process seemed to go OK, but I post it here is just in case:
2017-12-08 11:31:17.936 prod-mongo-empires1 I REPL ReplicationExecutor conducting a dry run election to see if we could be elected
2017-12-08 11:31:17.938 prod-mongo-empires1 I REPL ReplicationExecutor VoteRequester(term 14 dry run) received a yes vote from prod-mongo-empires3:27017; response message: { term: 14, voteGranted: true, reason: "", ok: 1.0 }
2017-12-08 11:31:17.938 prod-mongo-empires1 I REPL ReplicationExecutor dry election run succeeded, running for election
2017-12-08 11:31:17.941 prod-mongo-empires1 I REPL ReplicationExecutor VoteRequester(term 15) received a yes vote from prod-mongo-empires3:27017; response message: { term: 15, voteGranted: true, reason: "", ok: 1.0 }
2017-12-08 11:31:17.941 prod-mongo-empires1 I REPL ReplicationExecutor election succeeded, assuming primary role in term 15
2017-12-08 11:31:17.941 prod-mongo-empires1 I REPL ReplicationExecutor transition to PRIMARY
2017-12-08 11:31:17.941 prod-mongo-empires1 I REPL ReplicationExecutor Entering primary catch-up mode.
2017-12-08 11:31:17.941 prod-mongo-empires1 I ASIO NetworkInterfaceASIO-Replication-0 Connecting to prod-mongo-empires2:27017
2017-12-08 11:31:17.944 prod-mongo-empires1 I ASIO NetworkInterfaceASIO-Replication-0 Successfully connected to prod-mongo-empires2:27017, took 3ms (1 connections now open to prod-mongo-empires2:27017)
2017-12-08 11:31:17.945 prod-mongo-empires1 I REPL ReplicationExecutor Caught up to the latest optime known via heartbeats after becoming primary.
2017-12-08 11:31:17.945 prod-mongo-empires1 I REPL ReplicationExecutor Exited primary catch-up mode.
2017-12-08 11:31:18.329 prod-mongo-empires2 I REPL ReplicationExecutor Member prod-mongo-empires1:27017 is now in state PRIMARY
2017-12-08 11:31:19.328 prod-mongo-empires1 I REPL rsSync transition to primary complete; database writes are now permitted
2017-12-08 11:31:21.332 prod-mongo-empires3 I REPL ReplicationExecutor Member prod-mongo-empires1:27017 is now in state PRIMARY