MongoS server not starting also not balancing Server's sharding metadata manager failed asking for instance is manually reset

Virendra Agarwal

unread,

May 2, 2016, 11:08:19 AM5/2/16

to mongodb-user

My MongoS servers are not staring they are sending this error in logs.

SHARDING [Balancer] caught exception while doing balance: Server's sharding metadata manager failed to initialize and will remain in this state until the instance is manually reset :: caused by :: HostNotFound: unable to resolve DNS for host confserv_1.xyz.com

2016-05-02T17:57:06.612+0530 I SHARDING [Balancer] about to log metadata event into actionlog: { _id: "DB2255-2016-05-02T17:57:06.611+0530-5727479aa1051c5fb04fcc49", server: "mongoS1", clientAddr: "", time: new Date(1462192026611), what: "balancer.round", ns: "", details: { executionTimeMillis: 35, errorOccured: true, errmsg: "Server's sharding metadata manager failed to initialize and will remain in this state until the instance is manually reset :: caused by :: HostNotFoun..." } }

When i connect config server using host name it is working fine.

I tried to restart MOngoS server it is not coming up.

Please help me.

Thanks

Viren

Virendra Agarwal

unread,

May 3, 2016, 1:17:30 AM5/3/16

to mongodb-user

I check Mongo code and found this error mentioned in

https://github.com/mongodb/mongo/blob/master/src/mongo/db/s/sharding_state.cpp

/ TODO: remove after v3.4.

// This is for backwards compatibility with old style initialization through metadata

// commands/setShardVersion. As well as all assignments to _initializationStatus and

// _setInitializationState_inlock in this method.

if (_getInitializationState() == InitializationState::kInitializing) {

auto waitStatus = _waitForInitialization_inlock(deadline, lk);

if (!waitStatus.isOK()) {

return waitStatus;

}

if (_getInitializationState() == InitializationState::kError) {

return {ErrorCodes::ManualInterventionRequired,

str::stream() << "Server's sharding metadata manager failed to initialize and will "

"remain in this state until the instance is manually reset"

<< causedBy(_initializationStatus)};

}

But it does not mention anything what manual intervention is required.

Thanks

Viren

Kevin Adistambha

unread,

May 8, 2016, 11:41:24 PM5/8/16

to mongodb-user

Hi Viren,

SHARDING [Balancer] caught exception while doing balance: Server’s sharding metadata manager failed to initialize and will remain in this state until the instance is manually reset :: caused by :: HostNotFound: unable to resolve DNS for host confserv_1.xyz.com

I believe the main issue is the inability of the mongos process to connect to the config server confserv_1.xyz.com due to DNS issues. Is this a constant issue, or is it intermittent?

When i connect config server using host name it is working fine.

Did you try to connect to confserv_1.xyz.com from the machine that is hosting the mongos process? Also, how did you determine that the connection between the two machines are fine (i.e. using ping, connecting using the mongo shell, etc.)?

I tried to restart MOngoS server it is not coming up.

Is there any error messages in the mongos log that shows the reason why it cannot be started?

If you are still having issues, could you please provide:

your MongoDB version
your deployment topology (i.e. how many config servers, how many mongos, whether all mongos is having this issue, etc.)
the output of db.serverCmdLineOpts() from the mongos processes
the output of sh.status()
any error messages in the logs (mongod and mongos)

Best regards,
Kevin

Virendra Agarwal

unread,

May 9, 2016, 12:28:20 AM5/9/16

to mongodb-user

Hi Kevin,

Thanks for responding on threads I really appreciate your kind response.

I believe the main issue is the inability of the mongos process to connect to the config server confserv_1.xyz.com due to DNS issues. Is this a constant issue, or is it intermittent?

This issue was not consistent as sometimes we see it on MongoS or some times on replica set..

Did you try to connect to confserv_1.xyz.com from the machine that is hosting the mongos process? Also, how did you determine that the connection between the two machines are fine (i.e. using ping, connecting using the mongo shell, etc.)?

Yes i tried with ping then i opened this confserv_1.xyz.com from same machine hosting MongoS server.

Is there any error messages in the mongos log that shows the reason why it cannot be started?

The same error was there when i confirmed the connection was fine i try to restart the server but it gave me same error as host not resolved.

One more thing we also confirmed the dbhash of all config servers and it was all fine.

We took a restart of whole cluster and then thyis error was gone. But now we are occasinaly seeing mongo service down on cnfig servers.

your MongoDB version -

3.2.3

your deployment topology (i.e. how many config servers, how many mongos, whether all mongos is having this issue, etc.)

3 Config Servers 4 MongoS yup alll serevrs showed same issue.

mongos> db.serverCmdLineOpts();

{

"argv" : [

"/opt/mongodb/bin/mongos",

"--config",

"/opt/mongodb.conf",

"--configdb",

"confserv_1.xyz.com:27017,confserv_2.xyz.com:27017,confserv_3.xyz.com:27017",

"--maxConns=20000",

"--logpath=/opt/mongolog/log/mongodb.log",

"--logappend"

],

"parsed" : {

"config" : "/opt/mongodb.conf",

"net" : {

"http" : {

"enabled" : true

},

"maxIncomingConnections" : 20000

},

"sharding" : {

"configDB" : "confserv_1.xyz.com:27017,confserv_2.xyz.com:27017,confserv_3.xyz.com:27017"

},

"systemLog" : {

"destination" : "file",

"logAppend" : true,

"path" : "/opt/mongolog/log/mongodb.log"

}

},

"ok" : 1

}

the output of sh.status()

Attached output.

any error messages in the logs (mongod and mongos)

[ReplicationExecutor] Error in heartbeat request to secondary-rep2:27017; ExceededTimeLimit: Couldn't get a connection within the time limit

SHARDING [Balancer] caught exception while doing balance: Server’s sharding metadata manager failed to initialize and will remain in this state until the instance is manually reset :: caused by :: HostNotFound: unable to resolve DNS for host confserv_1.xyz.com

Best regards,
Kevin

statusOutput.txt

Virendra Agarwal

unread,

May 9, 2016, 12:44:15 AM5/9/16

to mongodb-user

Just to add one more thing we saw this issue again on one of our shard replica.

Network connection was down from primary to replica for some time and we restored it. But primary could not connect to resorted secondary.

It was always showed not reachable in rs.status() till i manually step it down and restart mongo process then made this primary again.

Kevin Adistambha

unread,

May 9, 2016, 2:32:38 AM5/9/16

to mongodb-user

Hi Viren,

This issue was not consistent as sometimes we see it on MongoS or some times on replica set..

3 Config Servers 4 MongoS yup alll serevrs showed same issue.

Network connection was down from primary to replica for some time and we restored it. But primary could not connect to resorted secondary.

We took a restart of whole cluster and then thyis error was gone

From what I have seen so far, I believe the underlying issue is network connectivity/configuration within your deployment. My understanding so far is:

all servers are experiencing the same issue intermittently
the issue seems to be spread across the whole cluster (i.e. sometimes on mongos and other times on individual mongod)
network connectivity issues within a replica set
error messages in the form of “HostNotFound: unable to resolve DNS for host” or “Couldn’t get a connection within the time limit”
restarting the cluster seems to solve the problem for a while (likely due to the refresh of the DNS cache)

All these signs seems to point that the issue is in your network setup (e.g. DNS setup, network hardware issues, etc.) and not in your MongoDB deployment. The output of sh.status() doesn’t seem to show any notable issue, and:

One more thing we also confirmed the dbhash of all config servers and it was all fine.

seems to indicate that the cluster is operating normally, the config servers are consistent with each other (which is vital to the operation of a sharded cluster), and cluster balancing seems to operate normally as well.

Is there a pattern to this issue? For example, did you observe these network-related errors happening more often in some particular time, during particular load, etc.?

On another note, I would recommend you to upgrade to the latest in the 3.2 series, which is currently 3.2.6 for bugfixes and improvements.

Best regards,
Kevin

Virendra Agarwal

unread,

May 9, 2016, 2:42:52 AM5/9/16

to mongodb-user

Thanks Kevin.

Yup update is in plan most probably we will do it by today as it is drop in update.

There is no fix pattern for network issue as far as i searched.

My query was After network issue occurred and server were not communicating and everything came back online why still it was showing an error of communication.

As for example Yesterday evening we faced that secondary of one Shard was not reachable and heart beat issue was there.

But once secondary was up and reachable rs.status were giving different on both primary and secondary.

Primary was still showing secondary not available but secondary replica status was coming fine.

We try to open port from primary to secondary and it was working.

Issue was fixed when step down primary and restart then made it primary again. (surprise)

Similarly when last time we took whole cluster restart issue was gone.

I have attached rs.status from both servers here.

Thanks

Virendra Agarwal

unread,

May 9, 2016, 2:43:22 AM5/9/16

to mongodb-user

status.txt

Matthieu Rigal

unread,

Jun 28, 2016, 12:34:17 PM6/28/16

to mongodb-user

Hi Virendra,

I just ran into this problem while trying to harden the security configuration. As in your case, I was able to connect to the config servers from all mongos instances.

In my case I was also testing a case with members of replica sets being in different datacenters, and I had the problem only after steppingDown some primaries.

The instability you may have in your network architecture is actually not responsible for these routing issues, but probably for stepDowns. And your different mongo machines may have different name resolving configurations that should be the root of the problem.

I noticed at the end that, not as the error message is pretending, the issue was happening on some primaries of one datacenter, who were not able to route back to the config server. After fixing the routing problem (/etc/hosts eventually), no more problems occurred on the mongo side.

Have fun fixing :)

Best, Matthieu

Virendra Agarwal

unread,

Jul 6, 2016, 2:24:59 AM7/6/16

to mongodb-user

Thanks Rigal.

Regards

Virendra Agarwal

Reply all

Reply to author

Forward