Hi, I have a 3 set replica set with 1 primary, 1 secondary and 1 arbiter. I had an incident where the the replica sets all became secondary and wouldn't reelevate. Here's the logs for each:
2014-12-12T02:43:55.067+0000 [conn1413096] end connection
10.0.64.12:58483 (512 connections now open)
2014-12-12T02:43:55.067+0000 [initandlisten] connection accepted from
10.0.64.12:58485 #1413098 (513 connections now open)
2014-12-12T02:44:01.068+0000 [conn1413097] end connection
10.0.64.11:35195 (512 connections now open)
2014-12-12T02:44:01.069+0000 [initandlisten] connection accepted from
10.0.64.11:35197 #1413099 (513 connections now open)
2014-12-12T02:44:14.070+0000 [rsHealthPoll] couldn't connect to
10.0.64.12:27017: couldn't connect to server
10.0.64.12:27017 (10.0.64.12), connection attempt failed
2014-12-12T02:44:19.071+0000 [rsHealthPoll] couldn't connect to
10.0.64.12:27017: couldn't connect to server
10.0.64.12:27017 (10.0.64.12) failed, connection attempt failed
2014-12-12T02:44:22.072+0000 [rsHealthPoll] couldn't connect to
10.0.64.11:27017: couldn't connect to server
10.0.64.11:27017 (10.0.64.11), connection attempt failed
2014-12-12T02:44:24.072+0000 [rsHealthPoll] replset info
10.0.64.12:27017 just heartbeated us, but our heartbeat failed: , not changing state
2014-12-12T02:44:25.073+0000 [conn1413098] end connection
10.0.64.12:58485 (512 connections now open)
2014-12-12T02:44:27.072+0000 [rsHealthPoll] couldn't connect to
10.0.64.11:27017: couldn't connect to server
10.0.64.11:27017 (10.0.64.11) failed, connection attempt failed
2014-12-12T02:44:31.072+0000 [rsHealthPoll] couldn't connect to
10.0.64.12:27017: couldn't connect to server
10.0.64.12:27017 (10.0.64.12) failed, connection attempt failed
2014-12-12T02:44:31.075+0000 [conn1413099] end connection
10.0.64.11:35197 (511 connections now open)
2014-12-12T02:44:32.074+0000 [rsHealthPoll] replSet info
10.0.64.11:27017 is down (or slow to respond):
2014-12-12T02:44:32.074+0000 [rsHealthPoll] replSet member
10.0.64.11:27017 is now in state DOWN
2014-12-12T02:44:35.873+0000 [initandlisten] connection accepted from
10.0.64.9:43513 #1413100 (512 connections now open)
2014-12-12T02:44:35.878+0000 [conn1413100] authenticate db: admin { authenticate: 1, nonce: "xxx", user: "loguetr", key: "xxx" }
2014-12-12T02:44:36.073+0000 [rsHealthPoll] couldn't connect to
10.0.64.12:27017: couldn't connect to server
10.0.64.12:27017 (10.0.64.12) failed, connection attempt failed
2014-12-12T02:44:38.626+0000 [conn1413100] end connection
10.0.64.9:43513 (511 connections now open)
2014-12-12T02:44:39.074+0000 [rsHealthPoll] couldn't connect to
10.0.64.11:27017: couldn't connect to server
10.0.64.11:27017 (10.0.64.11) failed, connection attempt failed
2014-12-12T02:44:41.074+0000 [rsHealthPoll] replSet info
10.0.64.12:27017 is down (or slow to respond):
2014-12-12T02:44:41.074+0000 [rsHealthPoll] replSet member
10.0.64.12:27017 is now in state DOWN
2014-12-12T02:44:41.074+0000 [rsMgr] can't see a majority of the set, relinquishing primary
2014-12-12T02:44:41.074+0000 [rsMgr] replSet relinquishing primary state
2014-12-12T02:44:41.074+0000 [rsMgr] replSet SECONDARY
2014-12-12T02:44:41.074+0000 [rsMgr] replSet closing client sockets after relinquishing primary
2014-12-12T02:44:28.076+0000 [rsHealthPoll] couldn't connect to
10.0.64.12:27017: couldn't connect to server
10.0.64.12:27017 (10.0.64.12), connection attempt failed
2014-12-12T02:44:33.076+0000 [rsHealthPoll] couldn't connect to
10.0.64.12:27017: couldn't connect to server
10.0.64.12:27017 (10.0.64.12) failed, connection attempt failed
2014-12-12T02:44:36.077+0000 [rsHealthPoll] couldn't connect to
10.0.64.10:27017: couldn't connect to server
10.0.64.10:27017 (10.0.64.10), connection attempt failed
2014-12-12T02:44:38.077+0000 [rsHealthPoll] replSet info
10.0.64.12:27017 is down (or slow to respond):
2014-12-12T02:44:38.077+0000 [rsHealthPoll] replSet member
10.0.64.12:27017 is now in state DOWN
2014-12-12T02:44:41.078+0000 [rsHealthPoll] couldn't connect to
10.0.64.10:27017: couldn't connect to server
10.0.64.10:27017 (10.0.64.10) failed, connection attempt failed
2014-12-12T02:44:41.088+0000 [rsBackgroundSync] replSet sync source problem: 10278 dbclient error communicating with server:
10.0.64.10:270172014-12-12T02:44:41.088+0000 [rsBackgroundSync] replSet syncing to:
10.0.64.10:270172014-12-12T02:44:43.145+0000 [initandlisten] connection accepted from
10.0.0.11:40772 #56196 (7 connections now open)
2014-12-12T02:44:45.078+0000 [rsHealthPoll] couldn't connect to
10.0.64.12:27017: couldn't connect to server
10.0.64.12:27017 (10.0.64.12) failed, connection attempt failed
2014-12-12T02:44:46.079+0000 [rsHealthPoll] replSet info
10.0.64.10:27017 is down (or slow to respond):
2014-12-12T02:44:46.079+0000 [rsHealthPoll] replSet member
10.0.64.10:27017 is now in state DOWN
2014-12-12T02:44:46.079+0000 [rsMgr] replSet can't see a majority, will not try to elect self
2014-12-12T02:44:20.075+0000 [rsHealthPoll] couldn't connect to
10.0.64.11:27017: couldn't connect to server
10.0.64.11:27017 (10.0.64.11), connection attempt failed
2014-12-12T02:44:23.077+0000 [conn55973] end connection
10.0.64.11:50146 (6 connections now open)
2014-12-12T02:44:25.076+0000 [rsHealthPoll] couldn't connect to
10.0.64.11:27017: couldn't connect to server
10.0.64.11:27017 (10.0.64.11) failed, connection attempt failed
2014-12-12T02:44:30.077+0000 [rsHealthPoll] replSet info
10.0.64.11:27017 is down (or slow to respond):
2014-12-12T02:44:30.077+0000 [rsHealthPoll] replSet member
10.0.64.11:27017 is now in state DOWN
2014-12-12T02:44:30.077+0000 [rsHealthPoll] couldn't connect to
10.0.64.10:27017: couldn't connect to server
10.0.64.10:27017 (10.0.64.10), connection attempt failed
2014-12-12T02:44:35.077+0000 [rsHealthPoll] couldn't connect to
10.0.64.10:27017: couldn't connect to server
10.0.64.10:27017 (10.0.64.10) failed, connection attempt failed
2014-12-12T02:44:37.077+0000 [rsHealthPoll] couldn't connect to
10.0.64.11:27017: couldn't connect to server
10.0.64.11:27017 (10.0.64.11) failed, connection attempt failed
2014-12-12T02:44:40.079+0000 [rsHealthPoll] replSet info
10.0.64.10:27017 is down (or slow to respond):
2014-12-12T02:44:40.079+0000 [rsHealthPoll] replSet member
10.0.64.10:27017 is now in state DOWN
2014-12-12T02:44:40.080+0000 [rsMgr] replSet can't see a majority, will not try to elect self
The environment is on AWS EC2 instances and are the db appliances offered by MongoDB.