Hi,
We are using mongodb v2.0.4 in a replica set with 3 nodes.
Recently we've encountered a replication lag, which was tracked down to replica set slaves syncingTo each other after transient network fail at master.
Is that a known (or already fixed) issue, or should I create Jira bug on that?
Thanks.
Relevant log entries (obtained with `fgrep "[rs" mongodb.log`, hostnames and replica set names has been changed to protect the innocent):
_NODE_F_ (master before and after the network flap):
Wed Jun 20 18:09:08 [rsHealthPoll] DBClientCursor::init call() failed
Wed Jun 20 18:09:08 [rsHealthPoll] replSet info _NODE_D_:27017 is down (or slow to respond): DBClientBase::findN: transport error: _NODE_D_:27017 query: { replSetHeartbeat: "_RS_", v: 3, pv: 1, checkEmpty: false, from: "_NODE_F_:27017" }
Wed Jun 20 18:09:08 [rsHealthPoll] replSet member _NODE_D_:27017 is now in state DOWN
Wed Jun 20 18:09:09 [rsHealthPoll] DBClientCursor::init call() failed
Wed Jun 20 18:09:09 [rsHealthPoll] replSet info _NODE_E_:27017 is down (or slow to respond): DBClientBase::findN: transport error: _NODE_E_:27017 query: { replSetHeartbeat: "_RS_", v: 3, pv: 1, checkEmpty: false, from: "_NODE_F_:27017" }
Wed Jun 20 18:09:09 [rsHealthPoll] replSet member _NODE_E_:27017 is now in state DOWN
Wed Jun 20 18:09:10 [rsHealthPoll] replSet member _NODE_D_:27017 is up
Wed Jun 20 18:09:10 [rsHealthPoll] replSet member _NODE_D_:27017 is now in state SECONDARY
Wed Jun 20 18:09:11 [rsHealthPoll] replSet member _NODE_E_:27017 is up
Wed Jun 20 18:09:11 [rsHealthPoll] replSet member _NODE_E_:27017 is now in state RECOVERING
Wed Jun 20 18:09:13 [rsHealthPoll] replSet member _NODE_E_:27017 is now in state SECONDARY
_NODE_E_ (temporary master during _NODE_F_ outage):
Wed Jun 20 18:09:08 [rsHealthPoll] DBClientCursor::init call() failed
Wed Jun 20 18:09:08 [rsHealthPoll] replSet info _NODE_F_:27017 is down (or slow to respond): DBClientBase::findN: transport error: _NODE_F_:27017 query: { replSetHeartbeat: "_RS_", v: 3, pv: 1, checkEmpty: false, from: "_NODE_E_:27017" }
Wed Jun 20 18:09:08 [rsHealthPoll] replSet member _NODE_F_:27017 is now in state DOWN
Wed Jun 20 18:09:08 [rsMgr] replSet info electSelf 0
Wed Jun 20 18:09:08 [rsMgr] replSet PRIMARY
Wed Jun 20 18:09:10 [rsHealthPoll] replSet member _NODE_F_:27017 is up
Wed Jun 20 18:09:10 [rsHealthPoll] replSet member _NODE_F_:27017 is now in state PRIMARY
Wed Jun 20 18:09:11 [rsSync] replSet syncing to: _NODE_D_:27017
Wed Jun 20 18:09:12 [rsSync] replSet SECONDARY
_NODE_D_ (forever slave):
Wed Jun 20 18:09:08 [rsHealthPoll] DBClientCursor::init call() failed
Wed Jun 20 18:09:08 [rsHealthPoll] replSet info _NODE_F_:27017 is down (or slow to respond): DBClientBase::findN: transport error: _NODE_F_:27017 query: { replSetHeartbeat: "_RS_", v: 3, pv: 1, checkEmpty: false, from: "_NODE_D_:27017" }
Wed Jun 20 18:09:08 [rsHealthPoll] replSet member _NODE_F_:27017 is now in state DOWN
Wed Jun 20 18:09:08 [rsMgr] not electing self, _NODE_E_:27017 would veto
Wed Jun 20 18:09:10 [rsHealthPoll] replSet member _NODE_E_:27017 is now in state PRIMARY
Wed Jun 20 18:09:10 [rsHealthPoll] replSet member _NODE_F_:27017 is up
Wed Jun 20 18:09:10 [rsHealthPoll] replSet member _NODE_F_:27017 is now in state PRIMARY
Wed Jun 20 18:09:10 [rsMgr] replSet info two primaries (transiently)
Wed Jun 20 18:09:10 [rsSync] replSet syncing to: _NODE_E_:27017
Wed Jun 20 18:09:12 [rsHealthPoll] replSet member _NODE_E_:27017 is now in state SECONDARY