All of my 3 set replica set became secondary

48 views
Skip to first unread message

Dickson Wong

unread,
Dec 12, 2014, 1:17:59 PM12/12/14
to mongod...@googlegroups.com
Hi, I have a 3 set replica set with 1 primary, 1 secondary and 1 arbiter.  I had an incident where the the replica sets all became secondary and wouldn't reelevate.  Here's the logs for each:

Primary db1

2014-12-12T02:43:55.067+0000 [conn1413096] end connection 10.0.64.12:58483 (512 connections now open)
2014-12-12T02:43:55.067+0000 [initandlisten] connection accepted from 10.0.64.12:58485 #1413098 (513 connections now open)
2014-12-12T02:44:01.068+0000 [conn1413097] end connection 10.0.64.11:35195 (512 connections now open)
2014-12-12T02:44:01.069+0000 [initandlisten] connection accepted from 10.0.64.11:35197 #1413099 (513 connections now open)
2014-12-12T02:44:14.070+0000 [rsHealthPoll] couldn't connect to 10.0.64.12:27017: couldn't connect to server 10.0.64.12:27017 (10.0.64.12), connection attempt failed
2014-12-12T02:44:19.071+0000 [rsHealthPoll] couldn't connect to 10.0.64.12:27017: couldn't connect to server 10.0.64.12:27017 (10.0.64.12) failed, connection attempt failed
2014-12-12T02:44:22.072+0000 [rsHealthPoll] couldn't connect to 10.0.64.11:27017: couldn't connect to server 10.0.64.11:27017 (10.0.64.11), connection attempt failed
2014-12-12T02:44:24.072+0000 [rsHealthPoll] replset info 10.0.64.12:27017 just heartbeated us, but our heartbeat failed: , not changing state
2014-12-12T02:44:25.073+0000 [conn1413098] end connection 10.0.64.12:58485 (512 connections now open)
2014-12-12T02:44:27.072+0000 [rsHealthPoll] couldn't connect to 10.0.64.11:27017: couldn't connect to server 10.0.64.11:27017 (10.0.64.11) failed, connection attempt failed
2014-12-12T02:44:31.072+0000 [rsHealthPoll] couldn't connect to 10.0.64.12:27017: couldn't connect to server 10.0.64.12:27017 (10.0.64.12) failed, connection attempt failed
2014-12-12T02:44:31.075+0000 [conn1413099] end connection 10.0.64.11:35197 (511 connections now open)
2014-12-12T02:44:32.074+0000 [rsHealthPoll] replSet info 10.0.64.11:27017 is down (or slow to respond): 
2014-12-12T02:44:32.074+0000 [rsHealthPoll] replSet member 10.0.64.11:27017 is now in state DOWN
2014-12-12T02:44:35.873+0000 [initandlisten] connection accepted from 10.0.64.9:43513 #1413100 (512 connections now open)
2014-12-12T02:44:35.878+0000 [conn1413100]  authenticate db: admin { authenticate: 1, nonce: "xxx", user: "loguetr", key: "xxx" }
2014-12-12T02:44:36.073+0000 [rsHealthPoll] couldn't connect to 10.0.64.12:27017: couldn't connect to server 10.0.64.12:27017 (10.0.64.12) failed, connection attempt failed
2014-12-12T02:44:38.626+0000 [conn1413100] end connection 10.0.64.9:43513 (511 connections now open)
2014-12-12T02:44:39.074+0000 [rsHealthPoll] couldn't connect to 10.0.64.11:27017: couldn't connect to server 10.0.64.11:27017 (10.0.64.11) failed, connection attempt failed
2014-12-12T02:44:41.074+0000 [rsHealthPoll] replSet info 10.0.64.12:27017 is down (or slow to respond): 
2014-12-12T02:44:41.074+0000 [rsHealthPoll] replSet member 10.0.64.12:27017 is now in state DOWN
2014-12-12T02:44:41.074+0000 [rsMgr] can't see a majority of the set, relinquishing primary
2014-12-12T02:44:41.074+0000 [rsMgr] replSet relinquishing primary state
2014-12-12T02:44:41.074+0000 [rsMgr] replSet SECONDARY
2014-12-12T02:44:41.074+0000 [rsMgr] replSet closing client sockets after relinquishing primary

Secondary db2

2014-12-12T02:44:28.076+0000 [rsHealthPoll] couldn't connect to 10.0.64.12:27017: couldn't connect to server 10.0.64.12:27017 (10.0.64.12), connection attempt failed
2014-12-12T02:44:33.076+0000 [rsHealthPoll] couldn't connect to 10.0.64.12:27017: couldn't connect to server 10.0.64.12:27017 (10.0.64.12) failed, connection attempt failed
2014-12-12T02:44:36.077+0000 [rsHealthPoll] couldn't connect to 10.0.64.10:27017: couldn't connect to server 10.0.64.10:27017 (10.0.64.10), connection attempt failed
2014-12-12T02:44:38.077+0000 [rsHealthPoll] replSet info 10.0.64.12:27017 is down (or slow to respond): 
2014-12-12T02:44:38.077+0000 [rsHealthPoll] replSet member 10.0.64.12:27017 is now in state DOWN
2014-12-12T02:44:41.078+0000 [rsHealthPoll] couldn't connect to 10.0.64.10:27017: couldn't connect to server 10.0.64.10:27017 (10.0.64.10) failed, connection attempt failed
2014-12-12T02:44:41.088+0000 [rsBackgroundSync] replSet sync source problem: 10278 dbclient error communicating with server: 10.0.64.10:27017
2014-12-12T02:44:41.088+0000 [rsBackgroundSync] replSet syncing to: 10.0.64.10:27017
2014-12-12T02:44:43.145+0000 [initandlisten] connection accepted from 10.0.0.11:40772 #56196 (7 connections now open)
2014-12-12T02:44:45.078+0000 [rsHealthPoll] couldn't connect to 10.0.64.12:27017: couldn't connect to server 10.0.64.12:27017 (10.0.64.12) failed, connection attempt failed
2014-12-12T02:44:46.079+0000 [rsHealthPoll] replSet info 10.0.64.10:27017 is down (or slow to respond): 
2014-12-12T02:44:46.079+0000 [rsHealthPoll] replSet member 10.0.64.10:27017 is now in state DOWN
2014-12-12T02:44:46.079+0000 [rsMgr] replSet can't see a majority, will not try to elect self

Arbiter db3

2014-12-12T02:44:20.075+0000 [rsHealthPoll] couldn't connect to 10.0.64.11:27017: couldn't connect to server 10.0.64.11:27017 (10.0.64.11), connection attempt failed
2014-12-12T02:44:23.077+0000 [conn55973] end connection 10.0.64.11:50146 (6 connections now open)
2014-12-12T02:44:25.076+0000 [rsHealthPoll] couldn't connect to 10.0.64.11:27017: couldn't connect to server 10.0.64.11:27017 (10.0.64.11) failed, connection attempt failed
2014-12-12T02:44:30.077+0000 [rsHealthPoll] replSet info 10.0.64.11:27017 is down (or slow to respond): 
2014-12-12T02:44:30.077+0000 [rsHealthPoll] replSet member 10.0.64.11:27017 is now in state DOWN
2014-12-12T02:44:30.077+0000 [rsHealthPoll] couldn't connect to 10.0.64.10:27017: couldn't connect to server 10.0.64.10:27017 (10.0.64.10), connection attempt failed
2014-12-12T02:44:35.077+0000 [rsHealthPoll] couldn't connect to 10.0.64.10:27017: couldn't connect to server 10.0.64.10:27017 (10.0.64.10) failed, connection attempt failed
2014-12-12T02:44:37.077+0000 [rsHealthPoll] couldn't connect to 10.0.64.11:27017: couldn't connect to server 10.0.64.11:27017 (10.0.64.11) failed, connection attempt failed
2014-12-12T02:44:40.079+0000 [rsHealthPoll] replSet info 10.0.64.10:27017 is down (or slow to respond): 
2014-12-12T02:44:40.079+0000 [rsHealthPoll] replSet member 10.0.64.10:27017 is now in state DOWN
2014-12-12T02:44:40.080+0000 [rsMgr] replSet can't see a majority, will not try to elect self

The environment is on AWS EC2 instances and are the db appliances offered by MongoDB.

Stephen Steneker

unread,
Dec 15, 2014, 8:17:49 PM12/15/14
to mongod...@googlegroups.com
On Saturday, 13 December 2014 05:17:59 UTC+11, Dickson Wong wrote:
Hi, I have a 3 set replica set with 1 primary, 1 secondary and 1 arbiter.  I had an incident where the the replica sets all became secondary and wouldn't reelevate.

Hi Dickson,

In order for a primary to be elected a strict majority of the nodes in the replica set need to be able to see each other.

With your case of a three node replica set, that means 2/3 nodes available.

Based on the logs you have provided it looks like there was some sort of network event/interruption that caused the members to no longer be able to communicate with each other.

The current primary couldn't communicate with any other nodes, so stepped down to a secondary:
  2014-12-12T02:44:41.074+0000 [rsMgr] can't see a majority of the set, relinquishing primary

The secondary couldn't see a majority of nodes, so stayed a secondary:
 2014-12-12T02:44:46.079+0000 [rsMgr] replSet can't see a majority, will not try to elect self

The arbiter also wasn't able to connect to the other nodes.

Did you make any changes to firewall or networking configuration .. or did the problem resolve itself?

Regards,
Stephen

Reply all
Reply to author
Forward
0 new messages