rs.stepDown() did not trigger immediate re-election after becoming secondary

Dmitry Mikhaylov

unread,

Dec 13, 2017, 4:15:04 PM12/13/17

to mongodb-user

Hi!

I was doing maintenance on 3-node MongoDB 3.4.7 replica set. As per instruction, I did maintenance one node at a time, starting with secondaries. After bringing up node after maintenance I ensured that it has caught up before proceeding to next. When time has come to update current primary, I have verified that both secondaries are caught up within second and happily issued `rs.stepDown()` command on primary, assuming it will trigger instant re-election and things go smoothly. Well, they did not. As later log analysis shows, after current primary stepped down and became secondary, all 3 nodes waited in SECONDARY state for 10 seconds until starting re-election. Of course this had dismal effects on live system that was using this replica set.

Here are snippets from my mongodb logs:

2017-12-08 11:31:06.319 prod-mongo-empires2 I COMMAND conn4860 Attempting to step down in response to replSetStepDown command

2017-12-08 11:31:08.326 prod-mongo-empires2 I REPL conn4860 transition to SECONDARY

2017-12-08 11:31:08.336 prod-mongo-empires2 I - conn4775 AssertionException handling request, closing client connection: 172 Operation attempted on a closed transport Session.

... the whole shebang of those assertions and "end connection" messages

2017-12-08 11:31:09.574	prod-mongo-empires1	I	REPL	ReplicationExecutor	Member prod-mongo-empires2:27017 is now in state SECONDARY

2017-12-08 11:31:10.010	prod-mongo-empires3	I	REPL	ReplicationExecutor	Member prod-mongo-empires2:27017 is now in state SECONDARY

^^^^^^^^^^^^^^^^^ here all 3 nodes already become SECONDARY, yet...

2017-12-08 11:31:11.329	prod-mongo-empires3	I	REPL	rsBackgroundSync	could not find member to sync from

2017-12-08 11:31:11.329	prod-mongo-empires3	I	REPL	replication-2	Choosing new sync source because our current sync source, prod-mongo-empires2:27017, has an OpTime ({ ts: Timestamp 1512725466000|194, t: 14 }) which is not ahead of ours ({ ts: Timestamp 1512725466000|194, t: 14 }), it does not have a sync source, and it's not the primary (sync source does not know the primary)

2017-12-08 11:31:11.329	prod-mongo-empires3	I	REPL	replication-2	Canceling oplog query due to OplogQueryMetadata. We have to choose a new sync source. Current source: prod-mongo-empires2:27017, OpTime { ts: Timestamp 1512725466000|194, t: 14 }, its sync source index:-1

2017-12-08 11:31:11.329	prod-mongo-empires3	W	REPL	rsBackgroundSync	Fetcher stopped querying remote oplog with error: InvalidSyncSource: sync source prod-mongo-empires2:27017 (config version: 17; last applied optime: { ts: Timestamp 1512725466000|194, t: 14 }; sync source index: -1; primary index: -1) is no longer valid

2017-12-08 11:31:11.330	prod-mongo-empires1	I	REPL	replication-7	Choosing new sync source because our current sync source, prod-mongo-empires2:27017, has an OpTime ({ ts: Timestamp 1512725466000|194, t: 14 }) which is not ahead of ours ({ ts: Timestamp 1512725466000|194, t: 14 }), it does not have a sync source, and it's not the primary (sync source does not know the primary)

2017-12-08 11:31:11.331	prod-mongo-empires1	I	REPL	replication-7	Canceling oplog query due to OplogQueryMetadata. We have to choose a new sync source. Current source: prod-mongo-empires2:27017, OpTime { ts: Timestamp 1512725466000|194, t: 14 }, its sync source index:-1

2017-12-08 11:31:11.331	prod-mongo-empires1	W	REPL	rsBackgroundSync	Fetcher stopped querying remote oplog with error: InvalidSyncSource: sync source prod-mongo-empires2:27017 (config version: 17; last applied optime: { ts: Timestamp 1512725466000|194, t: 14 }; sync source index: -1; primary index: -1) is no longer valid

2017-12-08 11:31:11.331	prod-mongo-empires1	I	REPL	rsBackgroundSync	could not find member to sync from

2017-12-08 11:31:16.329	prod-mongo-empires3	I	REPL	SyncSourceFeedback	SyncSourceFeedback error sending update to prod-mongo-empires2:27017: InvalidSyncSource: Sync source was cleared. Was prod-mongo-empires2:27017

2017-12-08 11:31:16.330	prod-mongo-empires1	I	REPL	SyncSourceFeedback	SyncSourceFeedback error sending update to prod-mongo-empires2:27017: InvalidSyncSource: Sync source was cleared. Was prod-mongo-empires2:27017

2017-12-08 11:31:17.936	prod-mongo-empires1	I	REPL	ReplicationExecutor	Starting an election, since we've seen no PRIMARY in the past 10000ms

^^^^^^^^^^^^^^^^^ ONLY HERE WE START RE-ELECTION!!!!!

The rest of the process seemed to go OK, but I post it here is just in case:

2017-12-08 11:31:17.936	prod-mongo-empires1	I	REPL	ReplicationExecutor	conducting a dry run election to see if we could be elected

2017-12-08 11:31:17.938	prod-mongo-empires1	I	REPL	ReplicationExecutor	VoteRequester(term 14 dry run) received a yes vote from prod-mongo-empires3:27017; response message: { term: 14, voteGranted: true, reason: "", ok: 1.0 }

2017-12-08 11:31:17.938	prod-mongo-empires1	I	REPL	ReplicationExecutor	dry election run succeeded, running for election

2017-12-08 11:31:17.941	prod-mongo-empires1	I	REPL	ReplicationExecutor	VoteRequester(term 15) received a yes vote from prod-mongo-empires3:27017; response message: { term: 15, voteGranted: true, reason: "", ok: 1.0 }

2017-12-08 11:31:17.941	prod-mongo-empires1	I	REPL	ReplicationExecutor	election succeeded, assuming primary role in term 15

2017-12-08 11:31:17.941	prod-mongo-empires1	I	REPL	ReplicationExecutor	transition to PRIMARY

2017-12-08 11:31:17.941	prod-mongo-empires1	I	REPL	ReplicationExecutor	Entering primary catch-up mode.

2017-12-08 11:31:17.941	prod-mongo-empires1	I	ASIO	NetworkInterfaceASIO-Replication-0	Connecting to prod-mongo-empires2:27017

2017-12-08 11:31:17.944	prod-mongo-empires1	I	ASIO	NetworkInterfaceASIO-Replication-0	Successfully connected to prod-mongo-empires2:27017, took 3ms (1 connections now open to prod-mongo-empires2:27017)

2017-12-08 11:31:17.945	prod-mongo-empires1	I	REPL	ReplicationExecutor	Caught up to the latest optime known via heartbeats after becoming primary.

2017-12-08 11:31:17.945	prod-mongo-empires1	I	REPL	ReplicationExecutor	Exited primary catch-up mode.

2017-12-08 11:31:18.329	prod-mongo-empires2	I	REPL	ReplicationExecutor	Member prod-mongo-empires1:27017 is now in state PRIMARY

2017-12-08 11:31:19.328	prod-mongo-empires1	I	REPL	rsSync	transition to primary complete; database writes are now permitted

2017-12-08 11:31:21.332	prod-mongo-empires3	I	REPL	ReplicationExecutor	Member prod-mongo-empires1:27017 is now in state PRIMARY

The whole idea of having 3 nodes sitting there in SECONDARY state and looking at each other with blank faces for 10 seconds is so ridiculous, that I assume that something went wrong. Or is it me who is wrong? 

BR,
Dmitry

Kevin Adistambha

unread,

Dec 14, 2017, 8:20:17 PM12/14/17

to mongodb-user

Hi Dmitry

This is the correct behaviour, since I don’t believe stepDown is special with regards to the replication election protocol. The command just steps down the primary. Following the election protocol, the rest of the set will wait until electionTimeoutMillis (default is 10 seconds) before electing a new primary. Depending on your use case, you could change the setting to a lower value during your maintenance window (see Replica Set Configuration) to achieve your desired behaviour. Please note that lowering the electionTimeoutMillis value (even during maintenance window) will render the replica set more sensitive to transient network issues.

There is a feature request to allow stepDown to be special and immediately calls an election (SERVER-11086). Please watch/upvote the ticket for updates. You can also comment on the ticket describing your use case to better illustrate your needs.

Best regards
Kevin

Dmitry Mikhaylov

unread,

Jan 8, 2018, 3:41:24 AM1/8/18

to mongodb-user

Hi, Kevin!

Thank you for the reply and sorry for the delay. I've upvoted the issue you mentioned, and also added there a suggestion that it can be fixed with just having smaller timeout for the situation when all known members of replica set are online and in SECONDARY state. I'd be glad to provide more context of this can help to raise the priority of an issue.

BR,

Dmitry

Message has been deleted

Weishan Ang

unread,

Jan 9, 2018, 4:43:30 AM1/9/18

to mongodb-user

This blog article might be useful for you. I have used the steps in our production environment with less than a second of interruption.

https://newbiedba.wordpress.com/2017/09/20/decreasing-downtime-during-manual-failover-in-mongodb-3-2-and-later/

Dmitry Mikhaylov

unread,

Jan 9, 2018, 5:03:37 AM1/9/18

to mongodb-user

On Tuesday, January 9, 2018 at 11:43:30 AM UTC+2, Weishan Ang wrote:

This blog article might be useful for you. I have used the steps in our production environment with less than a second of interruption.

https://newbiedba.wordpress.com/2017/09/20/decreasing-downtime-during-manual-failover-in-mongodb-3-2-and-later/

Thanks a lot!

BR,

Dmitry

Reply all

Reply to author

Forward