Hi Marcus
This message is typically the result of an election happening in the set, which the current primary is not aware of. This is usually the result of network issues between the nodes in the replica set.
Usually you will see a similar line to this in the mongod
logs of the other nodes in the set (i.e. not the primary):
Starting an election, since we’ve seen no PRIMARY in the past 10000ms
From Replica set heartbeat: Replica set members send heartbeats (pings) to each other every two seconds. If a heartbeat does not return within 10 seconds, the other members mark the delinquent member as inaccessible. It is possible that the rest of the replica set did not receive the primary’s heartbeat after 10 seconds, so an election is called. Please see Replica set elections for more information about MongoDB’s election procedures.
If you’re still having issues with this, could you post:
rs.status()
mongod
processBest regards
Kevin
{ "set" : "nfxClickConv", "date" : ISODate("2017-12-13T14:19:12.355+0100"), "myState" : NumberInt(1), "term" : NumberLong(39), "heartbeatIntervalMillis" : NumberLong(2000), "optimes" : { "lastCommittedOpTime" : { "ts" : Timestamp(1513171152, 4), "t" : NumberLong(39) }, "appliedOpTime" : { "ts" : Timestamp(1513171152, 4), "t" : NumberLong(39) }, "durableOpTime" : { "ts" : Timestamp(1513171152, 4), "t" : NumberLong(39) } }, "members" : [ { "_id" : NumberInt(0), "name" : "mongoE1", "health" : NumberInt(1), "state" : NumberInt(2), "stateStr" : "SECONDARY", "uptime" : NumberInt(793219), "optime" : { "ts" : Timestamp(1513171151, 3), "t" : NumberLong(39) }, "optimeDurable" : { "ts" : Timestamp(1513171151, 3), "t" : NumberLong(39) }, "optimeDate" : ISODate("2017-12-13T14:19:11.000+0100"), "optimeDurableDate" : ISODate("2017-12-13T14:19:11.000+0100"), "lastHeartbeat" : ISODate("2017-12-13T14:19:11.565+0100"), "lastHeartbeatRecv" : ISODate("2017-12-13T14:19:10.465+0100"), "pingMs" : NumberLong(0), "syncingTo" : "mongoE3", "configVersion" : NumberInt(7) }, { "_id" : NumberInt(1), "name" : "mongoE2", "health" : NumberInt(1), "state" : NumberInt(1), "stateStr" : "PRIMARY", "uptime" : NumberInt(793289), "optime" : { "ts" : Timestamp(1513171152, 4), "t" : NumberLong(39) }, "optimeDate" : ISODate("2017-12-13T14:19:12.000+0100"), "electionTime" : Timestamp(1513163422, 2), "electionDate" : ISODate("2017-12-13T12:10:22.000+0100"), "configVersion" : NumberInt(7), "self" : true }, { "_id" : NumberInt(2), "name" : "mongoE3", "health" : NumberInt(1), "state" : NumberInt(2), "stateStr" : "SECONDARY", "uptime" : NumberInt(793219), "optime" : { "ts" : Timestamp(1513171151, 3), "t" : NumberLong(39) }, "optimeDurable" : { "ts" : Timestamp(1513171151, 3), "t" : NumberLong(39) }, "optimeDate" : ISODate("2017-12-13T14:19:11.000+0100"), "optimeDurableDate" : ISODate("2017-12-13T14:19:11.000+0100"), "lastHeartbeat" : ISODate("2017-12-13T14:19:11.568+0100"), "lastHeartbeatRecv" : ISODate("2017-12-13T14:19:10.486+0100"), "pingMs" : NumberLong(0), "syncingTo" : "mongoE2", "configVersion" : NumberInt(7) }, { "_id" : NumberInt(3), "name" : "mongoE4", "health" : NumberInt(1), "state" : NumberInt(2), "stateStr" : "SECONDARY", "uptime" : NumberInt(431025), "optime" : { "ts" : Timestamp(1513171151, 3), "t" : NumberLong(39) }, "optimeDurable" : { "ts" : Timestamp(1513171151, 3), "t" : NumberLong(39) }, "optimeDate" : ISODate("2017-12-13T14:19:11.000+0100"), "optimeDurableDate" : ISODate("2017-12-13T14:19:11.000+0100"), "lastHeartbeat" : ISODate("2017-12-13T14:19:11.568+0100"), "lastHeartbeatRecv" : ISODate("2017-12-13T14:19:10.503+0100"), "pingMs" : NumberLong(0), "syncingTo" : "mongoE3", "configVersion" : NumberInt(7) }, { "_id" : NumberInt(4), "name" : "mongoE5", "health" : NumberInt(1), "state" : NumberInt(7), "stateStr" : "ARBITER", "uptime" : NumberInt(431018), "lastHeartbeat" : ISODate("2017-12-13T14:19:11.551+0100"), "lastHeartbeatRecv" : ISODate("2017-12-13T14:19:09.077+0100"), "pingMs" : NumberLong(0), "configVersion" : NumberInt(7) } ], "ok" : NumberInt(1)}
Hi Marcus
If you ruled out network connectivity issues, another possibility is a priority takeover. This is noted in the log of the to-be primary with a line similar to:
2017-12-14T08:25:04.021+1100 I REPL [ReplicationExecutor] Scheduling priority takeover at 2017-12-14T08:25:14.698+1100
You can check the priority of the nodes in the replica set config using rs.conf()
Otherwise, there might be a hint of why the election is called on the other mongod
logs around the election time. Look for entries before the line [ReplicationExecutor] transition to PRIMARY
in the new primary. There should be a series of messages explaining the process leading to the election.
If you’re running on VM, you might want to check if your deployment follows the recommendations outlined in MongoDB on Virtual Environments in the Production Notes.
Best regards
Kevin
shard016:PRIMARY> rs.status(){ "set" : "shard016", "date" : ISODate("2018-02-14T16:43:50.963Z"), "myState" : 1, "term" : NumberLong(26), "heartbeatIntervalMillis" : NumberLong(2000), "members" : [ { "_id" : 0, "name" : "mongodb-server-16:395111", "health" : 1, "state" : 1, "stateStr" : "PRIMARY", "uptime" : 54608, "optime" : { "ts" : Timestamp(1518606835, 4198), "t" : NumberLong(26) }, "optimeDate" : ISODate("2018-02-14T11:13:55Z"), "electionTime" : Timestamp(1518572051, 1), "electionDate" : ISODate("2018-02-14T01:34:11Z"), "configVersion" : 209354, "self" : true }, { "_id" : 1, "name" : "mongodb-server-02:395112", "health" : 1, "state" : 2, "stateStr" : "SECONDARY", "uptime" : 54599, "optime" : { "ts" : Timestamp(1518606835, 4198), "t" : NumberLong(26) }, "optimeDate" : ISODate("2018-02-14T11:13:55Z"), "lastHeartbeat" : ISODate("2018-02-14T16:43:49.912Z"), "lastHeartbeatRecv" : ISODate("2018-02-14T16:43:49.965Z"), "pingMs" : NumberLong(0), "syncingTo" : "mongodb-server-16:395111", "configVersion" : 209354 }, { "_id" : 2, "name" : "mongodb-server-21:395113", "health" : 1, "state" : 2, "stateStr" : "SECONDARY", "uptime" : 54589, "optime" : { "ts" : Timestamp(1518606835, 4198), "t" : NumberLong(26) }, "optimeDate" : ISODate("2018-02-14T11:13:55Z"), "lastHeartbeat" : ISODate("2018-02-14T16:43:49.912Z"), "lastHeartbeatRecv" : ISODate("2018-02-14T16:43:49.912Z"), "pingMs" : NumberLong(0), "syncingTo" : "mongodb-server-02:395112", "configVersion" : 209354 } ], "ok" : 1}shard016:PRIMARY>
2018-02-13T18:15:44.343-0500 I INDEX [repl writer worker 4] building index using bulk method; build may temporarily use up to 500 megabytes of RAM2018-02-13T18:15:44.358-0500 I INDEX [repl writer worker 4] build index done. scanned 7767 total records. 0 secs2018-02-13T19:10:38.433-0500 I ASIO [NetworkInterfaceASIO-BGSync-0] Connecting to mongodb-server-16:395112018-02-13T19:10:38.488-0500 I ASIO [NetworkInterfaceASIO-BGSync-0] Successfully connected to mongodb-server-16:39511, took 55ms (2 connections now open to mongodb-server-16:39511)2018-02-13T19:11:38.488-0500 I ASIO [NetworkInterfaceASIO-BGSync-0] Ending idle connection to host mongodb-server-16:39511 because the pool meets constraints; 1 connections to that host remain open2018-02-13T19:14:31.419-0500 I ASIO [NetworkInterfaceASIO-BGSync-0] Connecting to mongodb-server-16:395112018-02-13T19:14:31.428-0500 I ASIO [NetworkInterfaceASIO-BGSync-0] Successfully connected to mongodb-server-16:39511, took 9ms (2 connections now open to mongodb-server-16:39511)2018-02-13T19:15:31.428-0500 I ASIO [NetworkInterfaceASIO-BGSync-0] Ending idle connection to host mongodb-server-16:39511 because the pool meets constraints; 1 connections to that host remain open2018-02-13T19:20:50.516-0500 I ASIO [NetworkInterfaceASIO-Replication-0] Ending connection to host mongodb-server-16:39511 due to bad connection status; 0 connections to that host remain open2018-02-13T19:20:50.516-0500 I REPL [ReplicationExecutor] Error in heartbeat request to mongodb-server-16:39511; ExceededTimeLimit: Operation timed out2018-02-13T19:20:51.179-0500 I REPL [ReplicationExecutor] Starting an election, since we've seen no PRIMARY in the past 10000ms2018-02-13T19:20:51.179-0500 I REPL [ReplicationExecutor] conducting a dry run election to see if we could be elected2018-02-13T19:20:51.179-0500 I ASIO [NetworkInterfaceASIO-Replication-0] Connecting to mongodb-server-16:395112018-02-13T19:20:51.179-0500 I REPL [ReplicationExecutor] dry election run succeeded, running for election2018-02-13T19:20:51.180-0500 I ASIO [NetworkInterfaceASIO-Replication-0] Connecting to mongodb-server-16:395112018-02-13T19:20:51.181-0500 I REPL [ReplicationExecutor] election succeeded, assuming primary role in term 242018-02-13T19:20:51.181-0500 I REPL [ReplicationExecutor] transition to PRIMARY2018-02-13T19:20:51.181-0500 I ASIO [NetworkInterfaceASIO-BGSync-0] Ending connection to host mongodb-server-16:39511 due to bad connection status; 0 connections to that host remain open2018-02-13T19:20:51.181-0500 I ASIO [NetworkInterfaceASIO-Replication-0] Connecting to mongodb-server-16:395112018-02-13T19:20:51.183-0500 I REPL [rsSync] transition to primary complete; database writes are now permitted2018-02-13T19:21:01.181-0500 I REPL [ReplicationExecutor] Error in heartbeat request to mongodb-server-16:39511; ExceededTimeLimit: Couldn't get a connection within the time limit2018-02-13T19:21:11.179-0500 I ASIO [NetworkInterfaceASIO-Replication-0] Failed to connect to mongodb-server-16:39511 - ExceededTimeLimit: Operation timed out2018-02-13T19:21:11.180-0500 I ASIO [NetworkInterfaceASIO-Replication-0] Failed to connect to mongodb-server-16:39511 - ExceededTimeLimit: Operation timed out2018-02-13T19:21:11.181-0500 I ASIO [NetworkInterfaceASIO-Replication-0] Failed to connect to mongodb-server-16:39511 - ExceededTimeLimit: Operation timed out2018-02-13T19:21:11.181-0500 I ASIO [NetworkInterfaceASIO-Replication-0] Connecting to mongodb-server-16:395112018-02-13T19:21:13.182-0500 I REPL [ReplicationExecutor] Error in heartbeat request to mongodb-server-16:39511; ExceededTimeLimit: Couldn't get a connection within the time limit2018-02-13T19:21:16.108-0500 I NETWORK [SyncSourceFeedback] Socket recv() timeout 50.15.16.49:395112018-02-13T19:21:16.108-0500 I NETWORK [SyncSourceFeedback] SocketException: remote: (NONE):0 error: 9001 socket exception [RECV_TIMEOUT] server [50.15.16.49:39511]2018-02-13T19:21:16.108-0500 I REPL [SyncSourceFeedback] SyncSourceFeedback error sending update: network error while attempting to run command 'replSetUpdatePosition' on host 'mongodb-server-16:39511'2018-02-13T19:21:16.108-0500 I REPL [SyncSourceFeedback] updateUpstream failed: HostUnreachable: network error while attempting to run command 'replSetUpdatePosition' on host 'mongodb-server-16:39511' , will retry2018-02-13T19:21:25.183-0500 I REPL [ReplicationExecutor] Error in heartbeat request to mongodb-server-16:39511; ExceededTimeLimit: Couldn't get a connection within the time limit2018-02-13T19:21:31.181-0500 I ASIO [NetworkInterfaceASIO-Replication-0] Failed to connect to mongodb-server-16:39511 - ExceededTimeLimit: Operation timed out2018-02-13T19:21:31.181-0500 I ASIO [NetworkInterfaceASIO-Replication-0] Connecting to mongodb-server-16:395112018-02-13T19:21:37.183-0500 I REPL [ReplicationExecutor] Error in heartbeat request to mongodb-server-16:39511; ExceededTimeLimit: Couldn't get a connection within the time limit2018-02-13T19:21:44.463-0500 I ASIO [NetworkInterfaceASIO-Replication-0] Successfully connected to mongodb-server-16:39511, took 13282ms (1 connections now open to mongodb-server-16:39511)2018-02-13T19:21:44.463-0500 I REPL [ReplicationExecutor] Member mongodb-server-16:39511 is now in state PRIMARY2018-02-13T19:21:46.463-0500 I ASIO [ReplicationExecutor] dropping unhealthy pooled connection to mongodb-server-16:395112018-02-13T19:21:46.463-0500 I ASIO [ReplicationExecutor] after drop, pool was empty, going to spawn some connections2018-02-13T19:21:46.463-0500 I ASIO [NetworkInterfaceASIO-Replication-0] Connecting to mongodb-server-16:395112018-02-13T19:21:46.473-0500 I ASIO [NetworkInterfaceASIO-Replication-0] Successfully connected to mongodb-server-16:39511, took 10ms (1 connections now open to mongodb-server-16:39511)2018-02-13T19:21:46.473-0500 I REPL [ReplicationExecutor] Member mongodb-server-16:39511 is now in state SECONDARY
2018-02-13T19:13:19.383-0500 I COMMAND [conn847] command admin.$cmd command: listDatabases { listDatabases: 1 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:1035 locks:{ Global: { acquireCount: { r: 26 } }, Database: { acquireCount: { r: 13 } } } protocol:op_command 114ms2018-02-13T19:13:46.198-0500 I COMMAND [conn1377] command admin.$cmd command: listDatabases { listDatabases: 1 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:1035 locks:{ Global: { acquireCount: { r: 26 } }, Database: { acquireCount: { r: 13 } } } protocol:op_command 108ms2018-02-13T19:16:39.685-0500 I COMMAND [conn2608] command admin.$cmd command: replSetUpdatePosition { replSetUpdatePosition: 1, optimes: [ { durableOpTime: { ts: Timestamp 1518540109000|2, t: 22 }, appliedOpTime: { ts: Timestamp 1518540109000|2, t: 22 }, memberId: 0, cfgver: 209354 }, { durableOpTime: { ts: Timestamp 1518567397000|1, t: 23 }, appliedOpTime: { ts: Timestamp 1518567399000|1, t: 23 }, memberId: 1, cfgver: 209354 }, { durableOpTime: { ts: Timestamp 1518567397000|1, t: 23 }, appliedOpTime: { ts: Timestamp 1518567397000|1, t: 23 }, memberId: 2, cfgver: 209354 } ], $replData: { term: 23, lastOpCommitted: { ts: Timestamp 1518567397000|1, t: 23 }, lastOpVisible: { ts: Timestamp 0|0, t: -1 }, configVersion: 209354, replicaSetId: ObjectId('5a6f923d586ad43df82efe4c'), primaryIndex: 0, syncSourceIndex: 0 } } keyUpdates:0 writeConflicts:0 numYields:0 reslen:134 locks:{} protocol:op_command 495ms2018-02-13T19:18:09.429-0500 I COMMAND [ftdc] serverStatus was very slow: { after basic: 0, after asserts: 0, after connections: 0, after extra_info: 1330, after globalLock: 1330, after locks: 1330, after network: 1330, after opcounters: 1330, after opcountersRepl: 1330, after repl: 1330, after sharding: 1330, after storageEngine: 1330, after tcmalloc: 1330, after wiredTiger: 1330, at end: 1340 }2018-02-13T19:18:09.453-0500 I COMMAND [conn847] command admin.$cmd command: listDatabases { listDatabases: 1 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:1035 locks:{ Global: { acquireCount: { r: 26 } }, Database: { acquireCount: { r: 13 } } } protocol:op_command 1130ms2018-02-13T19:18:49.660-0500 I COMMAND [conn1329] command admin.$cmd command: listDatabases { listDatabases: 1 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:1035 locks:{ Global: { acquireCount: { r: 26 } }, Database: { acquireCount: { r: 13 } } } protocol:op_command 105ms2018-02-13T19:19:00.246-0500 I COMMAND [conn1035] command admin.$cmd command: listDatabases { listDatabases: 1 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:1035 locks:{ Global: { acquireCount: { r: 26 } }, Database: { acquireCount: { r: 13 } } } protocol:op_command 123ms.........2018-02-13T19:21:44.390-0500 I COMMAND [conn3928] command admin.$cmd command: isMaster { isMaster: 1 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:447 locks:{} protocol:op_query 29430ms2018-02-13T19:21:44.390-0500 I COMMAND [conn3422] command admin.$cmd command: isMaster { isMaster: 1 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:447 locks:{} protocol:op_query 28112ms2018-02-13T19:21:44.393-0500 I REPL [ReplicationExecutor] stepping down from primary, because a new term has begun: 242018-02-13T19:21:44.394-0500 I REPL [ReplicationExecutor] stepping down from primary, because a new term has begun: 242018-02-13T19:21:44.394-0500 I REPL [ReplicationExecutor] Member mongodb-server-02:39512 is now in state PRIMARY2018-02-13T19:21:44.398-0500 I COMMAND [conn3474] command admin.$cmd command: isMaster { ismaster: 1 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:447 locks:{} protocol:op_query 42980ms2018-02-13T19:21:44.398-0500 I COMMAND [conn3787] command admin.$cmd command: isMaster { ismaster: 1 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:447 locks:{} protocol:op_query 56567ms2018-02-13T19:21:44.398-0500 I COMMAND [conn3896] command admin.$cmd command: isMaster { isMaster: 1 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:447 locks:{} protocol:op_query 49237ms
shard016:PRIMARY> rs.conf(){ "_id" : "shard016", "version" : 209354, "protocolVersion" : NumberLong(1), "members" : [ { "_id" : 0, "host" : "mongodb-server-16:39511", "arbiterOnly" : false, "buildIndexes" : true, "hidden" : false, "priority" : 5, "tags" : {
}, "slaveDelay" : NumberLong(0), "votes" : 1 }, { "_id" : 1, "host" : "mongodb-server-02:39512", "arbiterOnly" : false, "buildIndexes" : true, "hidden" : false, "priority" : 3, "tags" : {
}, "slaveDelay" : NumberLong(0), "votes" : 1 }, { "_id" : 2, "host" : "mongodb-server-21:39513", "arbiterOnly" : false, "buildIndexes" : true, "hidden" : true, "priority" : 0, "tags" : {
}, "slaveDelay" : NumberLong(0), "votes" : 1 } ], "settings" : { "chainingAllowed" : true, "heartbeatIntervalMillis" : 2000, "heartbeatTimeoutSecs" : 10, "electionTimeoutMillis" : 10000, "getLastErrorModes" : {
}, "getLastErrorDefaults" : { "w" : 1, "wtimeout" : 0 }, "replicaSetId" : ObjectId("5a6f923d586ad43df82efe4c") }}
Hi Darshan
I see the same behaviour in my sharded cluster running MongoDB 3.2.18 on RHEL7.1
It is very likely that the cause of your issue is different from Marcus, although superficially they look similar. Please open a new thread next time to limit the discussion on a thread to a single topic, to prevent confusion from two different conversations in a thread.
I believe this part of the log you posted shows the cause of your issue:
2018-02-13T19:20:50.516-0500 I REPL [ReplicationExecutor] Error in heartbeat request to mongodb-server-16:39511; ExceededTimeLimit: Operation timed out
2018-02-13T19:20:51.179-0500 I REPL [ReplicationExecutor] Starting an election, since we've seen no PRIMARY in the past 10000ms
So the secondary lost contact with the primary, tried to ping it, and failed to see a reply in 10 seconds (which is a relatively long time in computing terms). It then assumes the primary is down, and runs an election to maintain a writable replica set.
However, this log from the primary:
2018-02-13T19:21:44.394-0500 I REPL [ReplicationExecutor] stepping down from primary, because a new term has begun: 24
2018-02-13T19:21:44.394-0500 I REPL [ReplicationExecutor] Member mongodb-server-02:39512 is now in state PRIMARY
says that it was surprised to see this election event, and immediately steps down to prevent two primaries in the set. For lack of better terms, this looks like a “one-way partition” (where the secondary cannot see the primary, but the primary can see the secondary). Either way, the secondary believes the replica set is without a primary and thus runs an election, as per replica set high availability spec.
To solve this, please ensure that you have a reliable network connection between members of your replica set.
Best regards
Kevin