ReplicaSet member suddenly steps down from Master

1,792 views
Skip to first unread message

inte...@googlemail.com

unread,
Dec 8, 2017, 5:32:07 AM12/8/17
to mongodb-user
Hi,
we have new ReplicaSet consisting of three nodes running MongoDB version 3.4 at Ubuntu 16.04.

Sometimes the primary suddenly steps down from primary without any reason. The following message appears within the log:

I REPL     [conn23637] stepping down from primary, because a new term has begun: 24

I searched and could not find any useful information about that. Does anybody have any information about that?

Regards, Marcus

Kevin Adistambha

unread,
Dec 12, 2017, 11:32:05 PM12/12/17
to mongodb-user

Hi Marcus

This message is typically the result of an election happening in the set, which the current primary is not aware of. This is usually the result of network issues between the nodes in the replica set.

Usually you will see a similar line to this in the mongod logs of the other nodes in the set (i.e. not the primary):

Starting an election, since we’ve seen no PRIMARY in the past 10000ms

From Replica set heartbeat: Replica set members send heartbeats (pings) to each other every two seconds. If a heartbeat does not return within 10 seconds, the other members mark the delinquent member as inaccessible. It is possible that the rest of the replica set did not receive the primary’s heartbeat after 10 seconds, so an election is called. Please see Replica set elections for more information about MongoDB’s election procedures.

If you’re still having issues with this, could you post:

  • Your MongoDB exact version (e.g. 3.4.10, not just 3.4)
  • Your deployment topology
  • Output of rs.status()
  • Suspicious log lines that mentions any kind of failure in any mongod process

Best regards
Kevin

inte...@googlemail.com

unread,
Dec 13, 2017, 8:38:29 AM12/13/17
to mongodb-user
Hi Kevin,
thanks for your feedback.

First of all about our env.

There are 4 data bearing nodes plus on arbiter. We have a virtualized environment, each mongodb node has their own VM. Each server runs with Ubuntu 16.04.3 and MongoDB 3.4.10.
I've added our rs.config at the end of the post.

Network issues was also our first idea, but I'm not completely sure:
- we checked network traffic at this special VLAN and could not identify any traffic peaks which will lead to the step-down or becoming a primary inaccessible.
- in order to eliminate a dns issue we added the hostname to /etc/hosts
- we could not identify any mongodb logs around the message "... stepping down from primary, ...". Everything looks normal, there are connections from the cloud manager as well as from our app. And suddenly from nothing 
2017-12-13T12:10:11.099+0100 I REPL     [conn164727] stepping down from primary, because a new term has begun: 38
2017-12-13T12:10:11.099+0100 I REPL     [replExecDBWorker-0] transition to SECONDARY

We have these step-downs around once a day and they also occure not at the same time.

Here is our RS config.

    "set" : "nfxClickConv", 
    "date" : ISODate("2017-12-13T14:19:12.355+0100"), 
    "myState" : NumberInt(1), 
    "term" : NumberLong(39), 
    "heartbeatIntervalMillis" : NumberLong(2000), 
    "optimes" : {
        "lastCommittedOpTime" : {
            "ts" : Timestamp(1513171152, 4), 
            "t" : NumberLong(39)
        }, 
        "appliedOpTime" : {
            "ts" : Timestamp(1513171152, 4), 
            "t" : NumberLong(39)
        }, 
        "durableOpTime" : {
            "ts" : Timestamp(1513171152, 4), 
            "t" : NumberLong(39)
        }
    }, 
    "members" : [
        {
            "_id" : NumberInt(0), 
            "name" : "mongoE1", 
            "health" : NumberInt(1), 
            "state" : NumberInt(2), 
            "stateStr" : "SECONDARY", 
            "uptime" : NumberInt(793219), 
            "optime" : {
                "ts" : Timestamp(1513171151, 3), 
                "t" : NumberLong(39)
            }, 
            "optimeDurable" : {
                "ts" : Timestamp(1513171151, 3), 
                "t" : NumberLong(39)
            }, 
            "optimeDate" : ISODate("2017-12-13T14:19:11.000+0100"), 
            "optimeDurableDate" : ISODate("2017-12-13T14:19:11.000+0100"), 
            "lastHeartbeat" : ISODate("2017-12-13T14:19:11.565+0100"), 
            "lastHeartbeatRecv" : ISODate("2017-12-13T14:19:10.465+0100"), 
            "pingMs" : NumberLong(0), 
            "syncingTo" : "mongoE3", 
            "configVersion" : NumberInt(7)
        }, 
        {
            "_id" : NumberInt(1), 
            "name" : "mongoE2", 
            "health" : NumberInt(1), 
            "state" : NumberInt(1), 
            "stateStr" : "PRIMARY", 
            "uptime" : NumberInt(793289), 
            "optime" : {
                "ts" : Timestamp(1513171152, 4), 
                "t" : NumberLong(39)
            }, 
            "optimeDate" : ISODate("2017-12-13T14:19:12.000+0100"), 
            "electionTime" : Timestamp(1513163422, 2), 
            "electionDate" : ISODate("2017-12-13T12:10:22.000+0100"), 
            "configVersion" : NumberInt(7), 
            "self" : true
        }, 
        {
            "_id" : NumberInt(2), 
            "name" : "mongoE3", 
            "health" : NumberInt(1), 
            "state" : NumberInt(2), 
            "stateStr" : "SECONDARY", 
            "uptime" : NumberInt(793219), 
            "optime" : {
                "ts" : Timestamp(1513171151, 3), 
                "t" : NumberLong(39)
            }, 
            "optimeDurable" : {
                "ts" : Timestamp(1513171151, 3), 
                "t" : NumberLong(39)
            }, 
            "optimeDate" : ISODate("2017-12-13T14:19:11.000+0100"), 
            "optimeDurableDate" : ISODate("2017-12-13T14:19:11.000+0100"), 
            "lastHeartbeat" : ISODate("2017-12-13T14:19:11.568+0100"), 
            "lastHeartbeatRecv" : ISODate("2017-12-13T14:19:10.486+0100"), 
            "pingMs" : NumberLong(0), 
            "syncingTo" : "mongoE2", 
            "configVersion" : NumberInt(7)
        }, 
        {
            "_id" : NumberInt(3), 
            "name" : "mongoE4", 
            "health" : NumberInt(1), 
            "state" : NumberInt(2), 
            "stateStr" : "SECONDARY", 
            "uptime" : NumberInt(431025), 
            "optime" : {
                "ts" : Timestamp(1513171151, 3), 
                "t" : NumberLong(39)
            }, 
            "optimeDurable" : {
                "ts" : Timestamp(1513171151, 3), 
                "t" : NumberLong(39)
            }, 
            "optimeDate" : ISODate("2017-12-13T14:19:11.000+0100"), 
            "optimeDurableDate" : ISODate("2017-12-13T14:19:11.000+0100"), 
            "lastHeartbeat" : ISODate("2017-12-13T14:19:11.568+0100"), 
            "lastHeartbeatRecv" : ISODate("2017-12-13T14:19:10.503+0100"), 
            "pingMs" : NumberLong(0), 
            "syncingTo" : "mongoE3", 
            "configVersion" : NumberInt(7)
        }, 
        {
            "_id" : NumberInt(4), 
            "name" : "mongoE5", 
            "health" : NumberInt(1), 
            "state" : NumberInt(7), 
            "stateStr" : "ARBITER", 
            "uptime" : NumberInt(431018), 
            "lastHeartbeat" : ISODate("2017-12-13T14:19:11.551+0100"), 
            "lastHeartbeatRecv" : ISODate("2017-12-13T14:19:09.077+0100"), 
            "pingMs" : NumberLong(0), 
            "configVersion" : NumberInt(7)
        }
    ], 
    "ok" : NumberInt(1)
}

Please let me know if you need more information.
Best regards, 
Marcus

Kevin Adistambha

unread,
Dec 13, 2017, 5:20:04 PM12/13/17
to mongodb-user

Hi Marcus

If you ruled out network connectivity issues, another possibility is a priority takeover. This is noted in the log of the to-be primary with a line similar to:

2017-12-14T08:25:04.021+1100 I REPL [ReplicationExecutor] Scheduling priority takeover at 2017-12-14T08:25:14.698+1100

You can check the priority of the nodes in the replica set config using rs.conf()

Otherwise, there might be a hint of why the election is called on the other mongod logs around the election time. Look for entries before the line [ReplicationExecutor] transition to PRIMARY in the new primary. There should be a series of messages explaining the process leading to the election.

If you’re running on VM, you might want to check if your deployment follows the recommendations outlined in MongoDB on Virtual Environments in the Production Notes.

Best regards
Kevin

Darshan Shah

unread,
Feb 14, 2018, 12:01:30 PM2/14/18
to mongodb-user
Hi Kevin,

I see the same behaviour in my sharded cluster running MongoDB 3.2.18 on RHEL7.1
The cluster has 30 shards with each shard being a 3 member replicaset.
All the required info is attached below.

Thanks!

Here is the output of rs.status() after the cluster was restarted to fix this issue:

shard016:PRIMARY> rs.status()
{
        "set" : "shard016",
        "date" : ISODate("2018-02-14T16:43:50.963Z"),
        "myState" : 1,
        "term" : NumberLong(26),
        "heartbeatIntervalMillis" : NumberLong(2000),
        "members" : [
                {
                        "_id" : 0,
                        "name" : "mongodb-server-16:395111",
                        "health" : 1,
                        "state" : 1,
                        "stateStr" : "PRIMARY",
                        "uptime" : 54608,
                        "optime" : {
                                "ts" : Timestamp(1518606835, 4198),
                                "t" : NumberLong(26)
                        },
                        "optimeDate" : ISODate("2018-02-14T11:13:55Z"),
                        "electionTime" : Timestamp(1518572051, 1),
                        "electionDate" : ISODate("2018-02-14T01:34:11Z"),
                        "configVersion" : 209354,
                        "self" : true
                },
                {
                        "_id" : 1,
                        "name" : "mongodb-server-02:395112",
                        "health" : 1,
                        "state" : 2,
                        "stateStr" : "SECONDARY",
                        "uptime" : 54599,
                        "optime" : {
                                "ts" : Timestamp(1518606835, 4198),
                                "t" : NumberLong(26)
                        },
                        "optimeDate" : ISODate("2018-02-14T11:13:55Z"),
                        "lastHeartbeat" : ISODate("2018-02-14T16:43:49.912Z"),
                        "lastHeartbeatRecv" : ISODate("2018-02-14T16:43:49.965Z"),
                        "pingMs" : NumberLong(0),
                        "syncingTo" : "mongodb-server-16:395111",
                        "configVersion" : 209354
                },
                {
                        "_id" : 2,
                        "name" : "mongodb-server-21:395113",
                        "health" : 1,
                        "state" : 2,
                        "stateStr" : "SECONDARY",
                        "uptime" : 54589,
                        "optime" : {
                                "ts" : Timestamp(1518606835, 4198),
                                "t" : NumberLong(26)
                        },
                        "optimeDate" : ISODate("2018-02-14T11:13:55Z"),
                        "lastHeartbeat" : ISODate("2018-02-14T16:43:49.912Z"),
                        "lastHeartbeatRecv" : ISODate("2018-02-14T16:43:49.912Z"),
                        "pingMs" : NumberLong(0),
                        "syncingTo" : "mongodb-server-02:395112",
                        "configVersion" : 209354
                }
        ],
        "ok" : 1
}
shard016:PRIMARY>


Here is the log from the Secondary that transitioned to Primary:

2018-02-13T18:15:44.343-0500 I INDEX    [repl writer worker 4]   building index using bulk method; build may temporarily use up to 500 megabytes of RAM
2018-02-13T18:15:44.358-0500 I INDEX    [repl writer worker 4] build index done.  scanned 7767 total records. 0 secs
2018-02-13T19:10:38.433-0500 I ASIO     [NetworkInterfaceASIO-BGSync-0] Connecting to mongodb-server-16:39511
2018-02-13T19:10:38.488-0500 I ASIO     [NetworkInterfaceASIO-BGSync-0] Successfully connected to mongodb-server-16:39511, took 55ms (2 connections now open to mongodb-server-16:39511)
2018-02-13T19:11:38.488-0500 I ASIO     [NetworkInterfaceASIO-BGSync-0] Ending idle connection to host mongodb-server-16:39511 because the pool meets constraints; 1 connections to that host remain open
2018-02-13T19:14:31.419-0500 I ASIO     [NetworkInterfaceASIO-BGSync-0] Connecting to mongodb-server-16:39511
2018-02-13T19:14:31.428-0500 I ASIO     [NetworkInterfaceASIO-BGSync-0] Successfully connected to mongodb-server-16:39511, took 9ms (2 connections now open to mongodb-server-16:39511)
2018-02-13T19:15:31.428-0500 I ASIO     [NetworkInterfaceASIO-BGSync-0] Ending idle connection to host mongodb-server-16:39511 because the pool meets constraints; 1 connections to that host remain open
2018-02-13T19:20:50.516-0500 I ASIO     [NetworkInterfaceASIO-Replication-0] Ending connection to host mongodb-server-16:39511 due to bad connection status; 0 connections to that host remain open
2018-02-13T19:20:50.516-0500 I REPL     [ReplicationExecutor] Error in heartbeat request to mongodb-server-16:39511; ExceededTimeLimit: Operation timed out
2018-02-13T19:20:51.179-0500 I REPL     [ReplicationExecutor] Starting an election, since we've seen no PRIMARY in the past 10000ms
2018-02-13T19:20:51.179-0500 I REPL     [ReplicationExecutor] conducting a dry run election to see if we could be elected
2018-02-13T19:20:51.179-0500 I ASIO     [NetworkInterfaceASIO-Replication-0] Connecting to mongodb-server-16:39511
2018-02-13T19:20:51.179-0500 I REPL     [ReplicationExecutor] dry election run succeeded, running for election
2018-02-13T19:20:51.180-0500 I ASIO     [NetworkInterfaceASIO-Replication-0] Connecting to mongodb-server-16:39511
2018-02-13T19:20:51.181-0500 I REPL     [ReplicationExecutor] election succeeded, assuming primary role in term 24
2018-02-13T19:20:51.181-0500 I REPL     [ReplicationExecutor] transition to PRIMARY
2018-02-13T19:20:51.181-0500 I ASIO     [NetworkInterfaceASIO-BGSync-0] Ending connection to host mongodb-server-16:39511 due to bad connection status; 0 connections to that host remain open
2018-02-13T19:20:51.181-0500 I ASIO     [NetworkInterfaceASIO-Replication-0] Connecting to mongodb-server-16:39511
2018-02-13T19:20:51.183-0500 I REPL     [rsSync] transition to primary complete; database writes are now permitted
2018-02-13T19:21:01.181-0500 I REPL     [ReplicationExecutor] Error in heartbeat request to mongodb-server-16:39511; ExceededTimeLimit: Couldn't get a connection within the time limit
2018-02-13T19:21:11.179-0500 I ASIO     [NetworkInterfaceASIO-Replication-0] Failed to connect to mongodb-server-16:39511 - ExceededTimeLimit: Operation timed out
2018-02-13T19:21:11.180-0500 I ASIO     [NetworkInterfaceASIO-Replication-0] Failed to connect to mongodb-server-16:39511 - ExceededTimeLimit: Operation timed out
2018-02-13T19:21:11.181-0500 I ASIO     [NetworkInterfaceASIO-Replication-0] Failed to connect to mongodb-server-16:39511 - ExceededTimeLimit: Operation timed out
2018-02-13T19:21:11.181-0500 I ASIO     [NetworkInterfaceASIO-Replication-0] Connecting to mongodb-server-16:39511
2018-02-13T19:21:13.182-0500 I REPL     [ReplicationExecutor] Error in heartbeat request to mongodb-server-16:39511; ExceededTimeLimit: Couldn't get a connection within the time limit
2018-02-13T19:21:16.108-0500 I NETWORK  [SyncSourceFeedback] Socket recv() timeout  50.15.16.49:39511
2018-02-13T19:21:16.108-0500 I NETWORK  [SyncSourceFeedback] SocketException: remote: (NONE):0 error: 9001 socket exception [RECV_TIMEOUT] server [50.15.16.49:39511]
2018-02-13T19:21:16.108-0500 I REPL     [SyncSourceFeedback] SyncSourceFeedback error sending update: network error while attempting to run command 'replSetUpdatePosition' on host 'mongodb-server-16:39511'
2018-02-13T19:21:16.108-0500 I REPL     [SyncSourceFeedback] updateUpstream failed: HostUnreachable: network error while attempting to run command 'replSetUpdatePosition' on host 'mongodb-server-16:39511' , will retry
2018-02-13T19:21:25.183-0500 I REPL     [ReplicationExecutor] Error in heartbeat request to mongodb-server-16:39511; ExceededTimeLimit: Couldn't get a connection within the time limit
2018-02-13T19:21:31.181-0500 I ASIO     [NetworkInterfaceASIO-Replication-0] Failed to connect to mongodb-server-16:39511 - ExceededTimeLimit: Operation timed out
2018-02-13T19:21:31.181-0500 I ASIO     [NetworkInterfaceASIO-Replication-0] Connecting to mongodb-server-16:39511
2018-02-13T19:21:37.183-0500 I REPL     [ReplicationExecutor] Error in heartbeat request to mongodb-server-16:39511; ExceededTimeLimit: Couldn't get a connection within the time limit
2018-02-13T19:21:44.463-0500 I ASIO     [NetworkInterfaceASIO-Replication-0] Successfully connected to mongodb-server-16:39511, took 13282ms (1 connections now open to mongodb-server-16:39511)
2018-02-13T19:21:44.463-0500 I REPL     [ReplicationExecutor] Member mongodb-server-16:39511 is now in state PRIMARY
2018-02-13T19:21:46.463-0500 I ASIO     [ReplicationExecutor] dropping unhealthy pooled connection to mongodb-server-16:39511
2018-02-13T19:21:46.463-0500 I ASIO     [ReplicationExecutor] after drop, pool was empty, going to spawn some connections
2018-02-13T19:21:46.463-0500 I ASIO     [NetworkInterfaceASIO-Replication-0] Connecting to mongodb-server-16:39511
2018-02-13T19:21:46.473-0500 I ASIO     [NetworkInterfaceASIO-Replication-0] Successfully connected to mongodb-server-16:39511, took 10ms (1 connections now open to mongodb-server-16:39511)
2018-02-13T19:21:46.473-0500 I REPL     [ReplicationExecutor] Member mongodb-server-16:39511 is now in state SECONDARY



Here is the log from Primary that stepped down to Secondary:

2018-02-13T19:13:19.383-0500 I COMMAND  [conn847] command admin.$cmd command: listDatabases { listDatabases: 1 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:1035 locks:{ Global: { acquireCount: { r: 26 } }, Database: { acquireCount: { r: 13 } } } protocol:op_command 114ms
2018-02-13T19:13:46.198-0500 I COMMAND  [conn1377] command admin.$cmd command: listDatabases { listDatabases: 1 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:1035 locks:{ Global: { acquireCount: { r: 26 } }, Database: { acquireCount: { r: 13 } } } protocol:op_command 108ms
2018-02-13T19:16:39.685-0500 I COMMAND  [conn2608] command admin.$cmd command: replSetUpdatePosition { replSetUpdatePosition: 1, optimes: [ { durableOpTime: { ts: Timestamp 1518540109000|2, t: 22 }, appliedOpTime: { ts: Timestamp 1518540109000|2, t: 22 }, memberId: 0, cfgver: 209354 }, { durableOpTime: { ts: Timestamp 1518567397000|1, t: 23 }, appliedOpTime: { ts: Timestamp 1518567399000|1, t: 23 }, memberId: 1, cfgver: 209354 }, { durableOpTime: { ts: Timestamp 1518567397000|1, t: 23 }, appliedOpTime: { ts: Timestamp 1518567397000|1, t: 23 }, memberId: 2, cfgver: 209354 } ], $replData: { term: 23, lastOpCommitted: { ts: Timestamp 1518567397000|1, t: 23 }, lastOpVisible: { ts: Timestamp 0|0, t: -1 }, configVersion: 209354, replicaSetId: ObjectId('5a6f923d586ad43df82efe4c'), primaryIndex: 0, syncSourceIndex: 0 } } keyUpdates:0 writeConflicts:0 numYields:0 reslen:134 locks:{} protocol:op_command 495ms
2018-02-13T19:18:09.429-0500 I COMMAND  [ftdc] serverStatus was very slow: { after basic: 0, after asserts: 0, after connections: 0, after extra_info: 1330, after globalLock: 1330, after locks: 1330, after network: 1330, after opcounters: 1330, after opcountersRepl: 1330, after repl: 1330, after sharding: 1330, after storageEngine: 1330, after tcmalloc: 1330, after wiredTiger: 1330, at end: 1340 }
2018-02-13T19:18:09.453-0500 I COMMAND  [conn847] command admin.$cmd command: listDatabases { listDatabases: 1 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:1035 locks:{ Global: { acquireCount: { r: 26 } }, Database: { acquireCount: { r: 13 } } } protocol:op_command 1130ms
2018-02-13T19:18:49.660-0500 I COMMAND  [conn1329] command admin.$cmd command: listDatabases { listDatabases: 1 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:1035 locks:{ Global: { acquireCount: { r: 26 } }, Database: { acquireCount: { r: 13 } } } protocol:op_command 105ms
2018-02-13T19:19:00.246-0500 I COMMAND  [conn1035] command admin.$cmd command: listDatabases { listDatabases: 1 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:1035 locks:{ Global: { acquireCount: { r: 26 } }, Database: { acquireCount: { r: 13 } } } protocol:op_command 123ms
...
...
...
2018-02-13T19:21:44.390-0500 I COMMAND  [conn3928] command admin.$cmd command: isMaster { isMaster: 1 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:447 locks:{} protocol:op_query 29430ms
2018-02-13T19:21:44.390-0500 I COMMAND  [conn3422] command admin.$cmd command: isMaster { isMaster: 1 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:447 locks:{} protocol:op_query 28112ms
2018-02-13T19:21:44.393-0500 I REPL     [ReplicationExecutor] stepping down from primary, because a new term has begun: 24
2018-02-13T19:21:44.394-0500 I REPL     [ReplicationExecutor] stepping down from primary, because a new term has begun: 24
2018-02-13T19:21:44.394-0500 I REPL     [ReplicationExecutor] Member mongodb-server-02:39512 is now in state PRIMARY
2018-02-13T19:21:44.398-0500 I COMMAND  [conn3474] command admin.$cmd command: isMaster { ismaster: 1 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:447 locks:{} protocol:op_query 42980ms
2018-02-13T19:21:44.398-0500 I COMMAND  [conn3787] command admin.$cmd command: isMaster { ismaster: 1 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:447 locks:{} protocol:op_query 56567ms
2018-02-13T19:21:44.398-0500 I COMMAND  [conn3896] command admin.$cmd command: isMaster { isMaster: 1 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:447 locks:{} protocol:op_query 49237ms

Darshan Shah

unread,
Feb 14, 2018, 12:08:19 PM2/14/18
to mongodb-user
Since the previous post became too long - I split it into two posts.

Here is output of rs.conf()


shard016:PRIMARY> rs.conf()
{
        "_id" : "shard016",
        "version" : 209354,
        "protocolVersion" : NumberLong(1),
        "members" : [
                {
                        "_id" : 0,
                        "host" : "mongodb-server-16:39511",
                        "arbiterOnly" : false,
                        "buildIndexes" : true,
                        "hidden" : false,
                        "priority" : 5,
                        "tags" : {

                        },
                        "slaveDelay" : NumberLong(0),
                        "votes" : 1
                },
                {
                        "_id" : 1,
                        "host" : "mongodb-server-02:39512",
                        "arbiterOnly" : false,
                        "buildIndexes" : true,
                        "hidden" : false,
                        "priority" : 3,
                        "tags" : {

                        },
                        "slaveDelay" : NumberLong(0),
                        "votes" : 1
                },
                {
                        "_id" : 2,
                        "host" : "mongodb-server-21:39513",
                        "arbiterOnly" : false,
                        "buildIndexes" : true,
                        "hidden" : true,
                        "priority" : 0,
                        "tags" : {

                        },
                        "slaveDelay" : NumberLong(0),
                        "votes" : 1
                }
        ],
        "settings" : {
                "chainingAllowed" : true,
                "heartbeatIntervalMillis" : 2000,
                "heartbeatTimeoutSecs" : 10,
                "electionTimeoutMillis" : 10000,
                "getLastErrorModes" : {

                },
                "getLastErrorDefaults" : {
                        "w" : 1,
                        "wtimeout" : 0
                },
                "replicaSetId" : ObjectId("5a6f923d586ad43df82efe4c")
        }
}



Please review and let me know how to solve this intermittent problem.

Thanks!

Kevin Adistambha

unread,
Feb 21, 2018, 12:13:59 AM2/21/18
to mongodb-user

Hi Darshan

I see the same behaviour in my sharded cluster running MongoDB 3.2.18 on RHEL7.1

It is very likely that the cause of your issue is different from Marcus, although superficially they look similar. Please open a new thread next time to limit the discussion on a thread to a single topic, to prevent confusion from two different conversations in a thread.

I believe this part of the log you posted shows the cause of your issue:

2018-02-13T19:20:50.516-0500 I REPL     [ReplicationExecutor] Error in heartbeat request to mongodb-server-16:39511; ExceededTimeLimit: Operation timed out
2018-02-13T19:20:51.179-0500 I REPL     [ReplicationExecutor] Starting an election, since we've seen no PRIMARY in the past 10000ms

So the secondary lost contact with the primary, tried to ping it, and failed to see a reply in 10 seconds (which is a relatively long time in computing terms). It then assumes the primary is down, and runs an election to maintain a writable replica set.

However, this log from the primary:

2018-02-13T19:21:44.394-0500 I REPL     [ReplicationExecutor] stepping down from primary, because a new term has begun: 24
2018-02-13T19:21:44.394-0500 I REPL     [ReplicationExecutor] Member mongodb-server-02:39512 is now in state PRIMARY

says that it was surprised to see this election event, and immediately steps down to prevent two primaries in the set. For lack of better terms, this looks like a “one-way partition” (where the secondary cannot see the primary, but the primary can see the secondary). Either way, the secondary believes the replica set is without a primary and thus runs an election, as per replica set high availability spec.

To solve this, please ensure that you have a reliable network connection between members of your replica set.

Best regards
Kevin

Reply all
Reply to author
Forward
0 new messages