Hi Viren,
SHARDING [Balancer] caught exception while doing balance: Server’s sharding metadata manager failed to initialize and will remain in this state until the instance is manually reset :: caused by :: HostNotFound: unable to resolve DNS for host confserv_1.xyz.com
I believe the main issue is the inability of the mongos
process to connect to the config server confserv_1.xyz.com
due to DNS issues. Is this a constant issue, or is it intermittent?
When i connect config server using host name it is working fine.
Did you try to connect to confserv_1.xyz.com
from the machine that is hosting the mongos
process? Also, how did you determine that the connection between the two machines are fine (i.e. using ping
, connecting using the mongo
shell, etc.)?
I tried to restart MOngoS server it is not coming up.
Is there any error messages in the mongos
log that shows the reason why it cannot be started?
If you are still having issues, could you please provide:
mongos
, whether all mongos
is having this issue, etc.)db.serverCmdLineOpts()
from the mongos
processessh.status()
mongod
and mongos
)Best regards,
Kevin
I believe the main issue is the inability of themongos
process to connect to the config serverconfserv_1.xyz.com
due to DNS issues. Is this a constant issue, or is it intermittent?
Did you try to connect toconfserv_1.xyz.com
from the machine that is hosting themongos
process? Also, how did you determine that the connection between the two machines are fine (i.e. usingping
, connecting using themongo
shell, etc.)?
Is there any error messages in themongos
log that shows the reason why it cannot be started?
mongos
, whether all mongos
is having this issue, etc.) mongos> db.serverCmdLineOpts();
{
"argv" : [
"/opt/mongodb/bin/mongos",
"--config",
"/opt/mongodb.conf",
"--configdb",
"confserv_1.xyz.com:27017,confserv_2.xyz.com:27017,confserv_3.xyz.com:27017",
"--maxConns=20000",
"--logpath=/opt/mongolog/log/mongodb.log",
"--logappend"
],
"parsed" : {
"config" : "/opt/mongodb.conf",
"net" : {
"http" : {
"enabled" : true
},
"maxIncomingConnections" : 20000
},
"sharding" : {
"configDB" : "confserv_1.xyz.com:27017,confserv_2.xyz.com:27017,confserv_3.xyz.com:27017"
},
"systemLog" : {
"destination" : "file",
"logAppend" : true,
"path" : "/opt/mongolog/log/mongodb.log"
}
},
"ok" : 1
}
- the output of
sh.status()
- any error messages in the logs (
mongod
andmongos
)
- [ReplicationExecutor] Error in heartbeat request to secondary-rep2:27017; ExceededTimeLimit: Couldn't get a connection within the time limit
Best regards,
Kevin
Hi Viren,
This issue was not consistent as sometimes we see it on MongoS or some times on replica set..
3 Config Servers 4 MongoS yup alll serevrs showed same issue.
Network connection was down from primary to replica for some time and we restored it. But primary could not connect to resorted secondary.
We took a restart of whole cluster and then thyis error was gone
From what I have seen so far, I believe the underlying issue is network connectivity/configuration within your deployment. My understanding so far is:
mongos
and other times on individual mongod
)All these signs seems to point that the issue is in your network setup (e.g. DNS setup, network hardware issues, etc.) and not in your MongoDB deployment. The output of sh.status()
doesn’t seem to show any notable issue, and:
One more thing we also confirmed the dbhash of all config servers and it was all fine.
seems to indicate that the cluster is operating normally, the config servers are consistent with each other (which is vital to the operation of a sharded cluster), and cluster balancing seems to operate normally as well.
Is there a pattern to this issue? For example, did you observe these network-related errors happening more often in some particular time, during particular load, etc.?
On another note, I would recommend you to upgrade to the latest in the 3.2 series, which is currently 3.2.6 for bugfixes and improvements.
Best regards,
Kevin