Re: Slaves syncingTo each other

207 views

Skip to first unread message

Kristina Chodorow

unread,

Jun 29, 2012, 4:29:24 PM6/29/12

to mongod...@googlegroups.com

There's one known case where it can happen (https://jira.mongodb.org/browse/SERVER-5258), which looks like what you hit. Node E should have closed all sockets when it became a secondary, which would have forced Node D to find someone else to sync to.

On Monday, June 25, 2012 9:25:06 AM UTC-4, Alexey Preobrazhensky wrote:

Hi,

We are using mongodb v2.0.4 in a replica set with 3 nodes.
Recently we've encountered a replication lag, which was tracked down to replica set slaves syncingTo each other after transient network fail at master.

Is that a known (or already fixed) issue, or should I create Jira bug on that?

Thanks.

Relevant log entries (obtained with `fgrep "[rs" mongodb.log`, hostnames and replica set names has been changed to protect the innocent):

_NODE_F_ (master before and after the network flap):

Wed Jun 20 18:09:08 [rsHealthPoll] DBClientCursor::init call() failed
Wed Jun 20 18:09:08 [rsHealthPoll] replSet info _NODE_D_:27017 is down (or slow to respond): DBClientBase::findN: transport error: _NODE_D_:27017 query: { replSetHeartbeat: "_RS_", v: 3, pv: 1, checkEmpty: false, from: "_NODE_F_:27017" }
Wed Jun 20 18:09:08 [rsHealthPoll] replSet member _NODE_D_:27017 is now in state DOWN
Wed Jun 20 18:09:09 [rsHealthPoll] DBClientCursor::init call() failed
Wed Jun 20 18:09:09 [rsHealthPoll] replSet info _NODE_E_:27017 is down (or slow to respond): DBClientBase::findN: transport error: _NODE_E_:27017 query: { replSetHeartbeat: "_RS_", v: 3, pv: 1, checkEmpty: false, from: "_NODE_F_:27017" }
Wed Jun 20 18:09:09 [rsHealthPoll] replSet member _NODE_E_:27017 is now in state DOWN
Wed Jun 20 18:09:10 [rsHealthPoll] replSet member _NODE_D_:27017 is up
Wed Jun 20 18:09:10 [rsHealthPoll] replSet member _NODE_D_:27017 is now in state SECONDARY
Wed Jun 20 18:09:11 [rsHealthPoll] replSet member _NODE_E_:27017 is up
Wed Jun 20 18:09:11 [rsHealthPoll] replSet member _NODE_E_:27017 is now in state RECOVERING
Wed Jun 20 18:09:13 [rsHealthPoll] replSet member _NODE_E_:27017 is now in state SECONDARY

_NODE_E_ (temporary master during _NODE_F_ outage):

Wed Jun 20 18:09:08 [rsHealthPoll] DBClientCursor::init call() failed
Wed Jun 20 18:09:08 [rsHealthPoll] replSet info _NODE_F_:27017 is down (or slow to respond): DBClientBase::findN: transport error: _NODE_F_:27017 query: { replSetHeartbeat: "_RS_", v: 3, pv: 1, checkEmpty: false, from: "_NODE_E_:27017" }
Wed Jun 20 18:09:08 [rsHealthPoll] replSet member _NODE_F_:27017 is now in state DOWN
Wed Jun 20 18:09:08 [rsMgr] replSet info electSelf 0
Wed Jun 20 18:09:08 [rsMgr] replSet PRIMARY
Wed Jun 20 18:09:10 [rsHealthPoll] replSet member _NODE_F_:27017 is up
Wed Jun 20 18:09:10 [rsHealthPoll] replSet member _NODE_F_:27017 is now in state PRIMARY
Wed Jun 20 18:09:11 [rsSync] replSet syncing to: _NODE_D_:27017
Wed Jun 20 18:09:12 [rsSync] replSet SECONDARY

_NODE_D_ (forever slave):

Wed Jun 20 18:09:08 [rsHealthPoll] DBClientCursor::init call() failed
Wed Jun 20 18:09:08 [rsHealthPoll] replSet info _NODE_F_:27017 is down (or slow to respond): DBClientBase::findN: transport error: _NODE_F_:27017 query: { replSetHeartbeat: "_RS_", v: 3, pv: 1, checkEmpty: false, from: "_NODE_D_:27017" }
Wed Jun 20 18:09:08 [rsHealthPoll] replSet member _NODE_F_:27017 is now in state DOWN
Wed Jun 20 18:09:08 [rsMgr] not electing self, _NODE_E_:27017 would veto
Wed Jun 20 18:09:10 [rsHealthPoll] replSet member _NODE_E_:27017 is now in state PRIMARY
Wed Jun 20 18:09:10 [rsHealthPoll] replSet member _NODE_F_:27017 is up
Wed Jun 20 18:09:10 [rsHealthPoll] replSet member _NODE_F_:27017 is now in state PRIMARY
Wed Jun 20 18:09:10 [rsMgr] replSet info two primaries (transiently)
Wed Jun 20 18:09:10 [rsSync] replSet syncing to: _NODE_E_:27017
Wed Jun 20 18:09:12 [rsHealthPoll] replSet member _NODE_E_:27017 is now in state SECONDARY

Reply all

Reply to author

Forward

0 new messages