Guys,
Wanted your thoughts / opinion on what might have happened in my local replication cluster.
BACKGROUND:
Am running my mongo cluster [Current Stable Release (3.2.6)] on CentOS 6 VM. The application was writing/reading to/from the mongo cluster, with replication happening almost always immediately. Haven't changed anything different from the default configuration that default mongo replication configuration provides. It's a 2 node cluster (n0 PRIMARY & n1 SECONDARY) & data/reads/writes is insignificant. As its my local cluster, am the only user / client. In other words, the application was able to read / write seamlessly & all read / writes were happening through PRIMARY node, with data being replicated without any issues.
The cluster is been running for couple of weeks & haven't seen any issue but for today morning. Having said that, something changed, noticed the replication is NO longer happening, & the reason being the replicaSetId no longer match. The logs had
"New and old configurations differ in replica set ID; old was 572945ca042e24d5daeaaecf, and new is 572c026a4739cf7c1f78992d"
Here is the command that I use to start the mongo daemon in replication mode
start mongo in replication mode
mongod --dbpath /usr/local/data --replSet xxx-cluster --fork --logpath /var/log/cluster.log
Question is, how did the replicaSetId change, I thought, the ID is set during rs.initiate() & trying to run initiate() again does not succeed. However in my case, looks like initiate did happen AGAIN (on a cluster where the replication was already running). The mongodb instance was restarted on one particular node (n0) & it's in that node the initiate happened (please see the log excerpt from below).
Here's is what I see in the mongo db log except from today (13th May)
2016-05-13T10:37:32.700-0400 I REPL [ReplicationExecutor] New replica set config in use: { _id: "xxx-cluster", version: 101930, protocolVersion: 1, members: [ { _id: 0, host: "n0:27017", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 } ], settings: { chainingAllowed: true, heartbeatIntervalMillis: 2000, heartbeatTimeoutSecs: 10, electionTimeoutMillis: 10000, getLastErrorModes: {}, getLastErrorDefaults: { w: 1, wtimeout: 0 }, replicaSetId: ObjectId('572945ca042e24d5daeaaecf') } }
The following log statements is from the other node (n1), when originally n1 & n0 were added (which was on 9th May) to the same replication cluster. Notice the replicaSetId then, which is still the same on node n1 (that no longer matches with n0), which is why the heartbeat failed & replication stopped.
2016-05-09T22:51:51.919-0400 I NETWORK [HostnameCanonicalizationWorker] Starting hostname canonicalization worker
2016-05-09T22:51:51.921-0400 I NETWORK [initandlisten] waiting for connections on port 27017
2016-05-09T22:51:51.927-0400 I REPL [ReplicationExecutor] New replica set config in use: { _id: "xxx-cluster", version: 2, protocolVersion: 1, members: [ { _id: 0, host: "n0:27017", arbiterOn
ly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 1, host: "n1:27017", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0,
tags: {}, slaveDelay: 0, votes: 1 } ], settings: { chainingAllowed: true, heartbeatIntervalMillis: 2000, heartbeatTimeoutSecs: 10, electionTimeoutMillis: 10000, getLastErrorModes: {}, getLastErr
orDefaults: { w: 1, wtimeout: 0 }, replicaSetId: ObjectId('572c026a4739cf7c1f78992d') } }
2016-05-09T22:51:51.927-0400 I REPL [ReplicationExecutor] This node is n1:27017 in the config
2016-05-09T22:51:51.927-0400 I REPL [ReplicationExecutor] transition to STARTUP2
2016-05-09T22:51:51.927-0400 I REPL [ReplicationExecutor] Starting replication applier threads
2016-05-09T22:51:51.928-0400 I REPL [ReplicationExecutor] transition to RECOVERING
2016-05-09T22:51:51.935-0400 I ASIO [NetworkInterfaceASIO-Replication-0] Successfully connected to n0:27017
2016-05-09T22:51:51.937-0400 I REPL [ReplicationExecutor] Member n0:27017 is now in state SECONDARY
2016-05-09T22:51:51.937-0400 I REPL [ReplicationExecutor] transition to SECONDARY
2016-05-09T22:51:52.040-0400 I NETWORK [initandlisten] connection accepted from
192.168.90.10:46654 #1 (1 connection now open)
2016-05-09T22:51:53.968-0400 I NETWORK [initandlisten] connection accepted from
127.0.0.1:45105 #2 (2 connections now open)
2016-05-09T22:51:56.939-0400 I REPL [ReplicationExecutor] Member n0:27017 is now in state PRIMARY
2016-05-09T22:51:56.967-0400 I NETWORK [initandlisten] connection accepted from
127.0.0.1:45106 #3 (3 connections now open)
2016-05-09T22:51:57.937-0400 I REPL [ReplicationExecutor] syncing from: n0:27017
2016-05-09T22:51:57.943-0400 I REPL [SyncSourceFeedback] setting syncSourceFeedback to n0:27017
2016-05-09T22:51:57.949-0400 I ASIO [NetworkInterfaceASIO-BGSync-0] Successfully connected to n0:27017
After 10:37 AM today, had to physically drop the local db in both node & re-instate replication as new to get to the state as before. Basically start the instance in non replication mode, drop the local db & then start in replication mode, initiate & add nodes to the cluster.
QUESTIONS:
- Why did n0 initialize the replication set (on 13th May when it was already part of the replication that was running from 9th May)? I thought its only by rs.initiate() one can initiatialize mongo replication set (May be am missing something here). More over in a cluster where replication is initialized, trying to run rs.initialize(), always failed with error "already initialized". How did n0 succeed (on May 13th, when it was rebooted) & what triggered the initialize operation. I just tried to shutdown the instance & start it all over again using the start mongo in replication mode command as above & noticed the log excerpt from above.
- How to recover from a situation like this if it happens, I had to drop the local db by running mongo in normal mode (a.k.a Standalone mode) & then start mongo db in replication mode & run rs.initiate() & add n1 to the replication via rs.add() as I did before, trying to set up new replication set configuration.
Thanks,
Ganesh