Production MongoDB routers fails to start

93 views
Skip to first unread message

Stefka Dimitrova

unread,
Feb 29, 2016, 1:44:59 PM2/29/16
to mongodb-user

Hello,
I currently support a production cluster of MongoDB sharded nodes in Gracenote as follows:

  • 3 shards, each has a primary and 2 replicas,
  • 3 config servers and
  • 3 router services on
  • all of them on MongoDB version 2.4.6
  • all of them in the same data center.

Today our Operations group performed a rolling restart of all servers in relation to an emergency Linux patching. All servers and services came back fine, except the mongo router services on two servers, which report the errors shown below. The third router service however started fine. All related applications were modified to use only the running mongo router and not try to use the ones currently down.


How can I bring the other two router services up?


Here is the log file for the router service on dms-cr2-mongo-1-3:

service mongodb_router start

Thu Feb 25 22:16:45.991 [mongosMain] build info: Linux ip-10-2-29-40 2.6.21.7-2.ec2.v1.2.fc8xen #1 SMP Fri Nov 20 17:48:28 EST 2009 x86_64 BOOST_LIB_VERSION=1_49
Thu Feb 25 22:16:45.991 [mongosMain] options: { config: "/etc/mongodb_router.conf", configdb: "dms-cr2-mongo1-3.globix-sc.gracenote.com:27019,dms-cr2-mongo2-3.globix-sc.gracenote.com:27019,dms-cr2-mongo3-3.globix-sc.gracenote.com:27019", keyFile: "/data/mongokey/mongokey", logpath: "/data/router/log/mongodb.log", port: 27017 }
Thu Feb 25 22:16:46.260 [mongosMain] warning: config servers dms-cr2-mongo1-3.globix-sc.gracenote.com:27019 and dms-cr2-mongo3-3.globix-sc.gracenote.com:27019 differ
Thu Feb 25 22:16:46.472 [mongosMain] warning: config servers dms-cr2-mongo1-3.globix-sc.gracenote.com:27019 and dms-cr2-mongo3-3.globix-sc.gracenote.com:27019 differ
Thu Feb 25 22:16:46.685 [mongosMain] warning: config servers dms-cr2-mongo1-3.globix-sc.gracenote.com:27019 and dms-cr2-mongo3-3.globix-sc.gracenote.com:27019 differ
Thu Feb 25 22:16:46.896 [mongosMain] warning: config servers dms-cr2-mongo1-3.globix-sc.gracenote.com:27019 and dms-cr2-mongo3-3.globix-sc.gracenote.com:27019 differ
Thu Feb 25 22:16:46.896 [mongosMain] ERROR: could not verify that config servers are in sync :: caused by :: config servers dms-cr2-mongo1-3.globix-sc.gracenote.com:27019 and dms-cr2-mongo3-3.globix-sc.gracenote.com:27019 differ
chunks: "3e093ea58d367d48df9955c4c5a83da7"    chunks: "ba2046891c22ceb79dfc1dbceaa37b2a"
databases: "00debde3ffe02a65dbe4bcbaeacb1fd4"    databases: "00debde3ffe02a65dbe4bcbaeacb1fd4"
Thu Feb 25 22:16:46.896 [mongosMain] configServer connection startup check failed


And for dms-cr2-mongo-2-3:

service mongodb_router start

Thu Feb 25 19:25:48.430 [mongosMain] warning:  couldn't check dbhash on config server dms-cr2-mongo2-3.globix-sc.gracenote.com:27019 :: caused by :: 11002 socket exception [CONNECT_ERROR] server [dms-cr2-mongo2-3.globix-sc.gracenote.com:27019] mongos connectionpool error: couldn't connect to server dms-cr2-mongo2-3.globix-sc.gracenote.com:27019
Thu Feb 25 19:25:48.496 [mongosMain] warning: config servers dms-cr2-mongo1-3.globix-sc.gracenote.com:27019 and dms-cr2-mongo3-3.globix-sc.gracenote.com:27019 differ
Thu Feb 25 19:25:48.565 [mongosMain] warning:  couldn't check dbhash on config server dms-cr2-mongo2-3.globix-sc.gracenote.com:27019 :: caused by :: 11002 socket exception [CONNECT_ERROR] server [dms-cr2-mongo2-3.globix-sc.gracenote.com:27019] mongos connectionpool error: couldn't connect to server dms-cr2-mongo2-3.globix-sc.gracenote.com:27019
Thu Feb 25 19:25:48.631 [mongosMain] warning: config servers dms-cr2-mongo1-3.globix-sc.gracenote.com:27019 and dms-cr2-mongo3-3.globix-sc.gracenote.com:27019 differ
Thu Feb 25 19:25:48.700 [mongosMain] warning:  couldn't check dbhash on config server dms-cr2-mongo2-3.globix-sc.gracenote.com:27019 :: caused by :: 11002 socket exception [CONNECT_ERROR] server [dms-cr2-mongo2-3.globix-sc.gracenote.com:27019] mongos connectionpool error: couldn't connect to server dms-cr2-mongo2-3.globix-sc.gracenote.com:27019
Thu Feb 25 19:25:48.767 [mongosMain] warning: config servers dms-cr2-mongo1-3.globix-sc.gracenote.com:27019 and dms-cr2-mongo3-3.globix-sc.gracenote.com:27019 differ
Thu Feb 25 19:25:48.767 [mongosMain] ERROR: could not verify that config servers are in sync :: caused by :: config servers dms-cr2-mongo1-3.globix-sc.gracenote.com:27019 and dms-cr2-mongo3-3.globix-sc.gracenote.com:27019 differ
chunks: "8f415e0cf812368e29512078431ca607"    chunks: "420016fbdd35884e2ffac54fd6bf3f95"
databases: "00debde3ffe02a65dbe4bcbaeacb1fd4"    databases: "00debde3ffe02a65dbe4bcbaeacb1fd4"
Thu Feb 25 19:25:48.767 [mongosMain] configServer connection startup check failed

Kevin Adistambha

unread,
Mar 13, 2016, 7:57:34 PM3/13/16
to mongodb-user

Hi Stefka,

Apologies for the delay in follow up. Were you able to resolve the issue with your config servers?

Thu Feb 25 19:25:48.565 [mongosMain] warning: couldn’t check dbhash on config server dms-cr2-mongo2-3.globix-sc.gracenote.com:27019 :: caused by :: 11002 socket exception [CONNECT_ERROR] server [dms-cr2-mongo2-3.globix-sc.gracenote.com:27019] mongos connectionpool error: couldn’t connect to server dms-cr2-mongo2-3.globix-sc.gracenote.com:27019

What does this message mean?

According to the logs, one of your config servers has different content compared to the others. Additionally, dms-cr2-mongo-2-3 cannot connect to dms-cr2-mongo2-3.globix-sc.gracenote.com:27019 so apparently there is also a connectivity issue in your cluster.

Things to check

1. Connectivity

There are a couple of things you could check regarding connectivity:

  • Check connectivity from the mongos to each of the config servers. i.e. does the command mongo --host CONFIG:PORT --eval 'db.version()' from each mongos to each config server able to connect and returns the correct version number?
  • From each server running mongos, try mongo --host CONFIG:PORT --eval 'db.version()' to each of your config servers and confirm the expected MongoDB version is returned (2.4.6). Echoing the server version is a useful way to check that the connection was successful.

2. Checksum the documents on each config server

Login into each config server and run the internal [dbHash command][dbHash], and check if the hash values of collections are the same. For example:

mongo --host CONFIG:PORT --eval 'printjson(db.getSiblingDB("config").runCommand({dbhash:1}))' > dbhash.txt

An example output of the dbHash command is as follows:

> db.runCommand("dbHash")
{
  "numCollections": 13,
  "host": "localhost:28001",
  "collections": {
    "changelog": "1aa40510e4e5c4bbc8a1d1bec9dcca1d",
    "chunks": "66b48f49690b34cfaa3475aa2f3a2f25",
    "collections": "dfa3a37c9efccd91a5d4d98e4fff2d35",
    ...
  },
  ...
}

In particular, for MongoDB version 2.4.x, please compare the dbHash values of these collections:

  • chunks
  • databases
  • shards
  • settings
  • version

The dbHash command will compute the md5 sum of all documents in the config server collections so you can identify any differences.

3. Check the latest metadata updates on each config server

You can also check which config server has the most current data by running these commands:

db.changelog.find().sort({ $natural: 1 }).limit(10);
db.chunks.find().sort({ $natural: 1 }).limit(10);

Next steps

Please inform us the result of those investigations. After which, there should be a clearer picture of what is happening with the config servers.

Please remember to perform a full backup of the three config servers before attempting any maintenance operation. If the config servers are lost, the whole cluster can become unrecoverable.

Best regards,
Kevin

Reply all
Reply to author
Forward
0 new messages