Hi!
We have a few MongoDB clusters in production. And something strange happened yesterday when I was upgrading our mongod binaries from 2.0.1 to 2.0.4.
Our setup
- 3 config servers
- 1 SECONDARY
- 1 PRIMARY
- 1 ARBITER
- This replica set represents one shard (in this cluster we do only have one shard)
- A bunch of mongos
Upgrade process
We are not just stopping mongod and replacing the binary, due to how our servers are managed in the AWS cloud. The process is that we replace the whole server and just switch AWS Elastic IP for the node.
1. rs.add("<non-elastic-dns-name>:27017")
2. Wait until the new node is synced
3. rs.remove("non-elastic-dns-name>:27017")
4. Stop the mongod process on the node, elastic-dns-name, which we are replacing (otherwise it takes a quite long time until the other nodes realize the change of server in the background)
5. Give the change the new node's DNS name from non-elastic-dns-name to elastic-dns-name
6. Restart all mongos routers, as we have seen that the non-elastic-dns-name sticks in some routers' view of the world
The new node is now part of the cluster with a new mongod binary
The problem
The problem this time was that when I, after completing step 6 above, started a new mongos router it tried to connect to the mongod at non-elastic-dns-name. The reason for that was that the config servers was distributing a new shard configuration it did only contain non-elastic-dns-name.
So, what was previously in the config servers:
> use config
> db.shards.find()
{ "_id" : "primarySet1", "host" : "primarySet1/elastic-dns-name1:27017,elastic-dns-name2:27017" }
Had become:
> use config
> db.shards.find()
{ "_id" : "primarySet1", "host" : "primarySet1/non-elastic-dns-name:27017" }
We have not seen this behavior before. The solution for us was to run a db.shards.update() and add the old configuration.
Is this expected behavior, have we something stupid in how we do this upgrade or are we running in to a bug?
Best regards
Sebastian Dahlgren