mongos fails with "Name or service not known"

5,733 views
Skip to first unread message

Anton Volokhov

unread,
Oct 3, 2012, 5:53:56 AM10/3/12
to mongod...@googlegroups.com
Hi there!
I'm using mongodb 2.0.7 with 2 shards, each one is configured as a replica set with 3 instances.
I needed to reconfigure shard name identification, so I removed first shard(wait for the data draining, moved primaries, so on), changed its name and added it with new name.

Connection to the cluster is secured with keyFile and password.
I was just about to remove the second shard to do the same changes with it, when I experienced following exceptions in mongos:

Sun Sep 30 15:23:34 [conn271] ns: musicgroup.entity could not initialize cursor across all shards because : stale config detected for ns: musicgroup.entity ParallelCursor::_
init @ mongodb-sh1/<...> attempt: 0
Sun Sep 30 15:23:34 [mongosMain] connection accepted from 77.88.18.6:54797 #447
Sun Sep 30 15:23:34 [conn447]  authenticate: { authenticate: 1, user: "rating", nonce: "7642e5d689939a9c", key: "2ee0b88e7572c5fac957830fa727075d" }
Sun Sep 30 15:23:36 [conn230] ns: rating.entity could not initialize cursor across all shards because : stale config detected for ns: rating.entity ParallelCursor::_init @ mongodb-sh1/<...> attempt: 0
Sun Sep 30 15:23:40 [conn380] ns: rating.entity could not initialize cursor across all shards because : stale config detected for ns: rating.entity ParallelCursor::_init @ mongodb-sh1/<...< attempt: 0
Sun Sep 30 15:23:41 [mongosMain] connection accepted from 77.88.18.6:54904 #448
Sun Sep 30 15:23:41 [conn448]  authenticate: { authenticate: 1, user: "rating", nonce: "83c1b541f74be683", key: "510ff1afd1d2cf9d41128135fdeccd43" }
Sun Sep 30 15:23:41 [conn448] getaddrinfo("host1f.load.net") failed: Name or service not known
Sun Sep 30 15:23:41 [conn448] DBException in process: could not initialize cursor across all shards because : socket exception @ mongodb-sh1/<...>
Sun Sep 30 15:23:41 [conn448] getaddrinfo("host1f.load.net") failed: Name or service not known
Sun Sep 30 15:23:41 [conn448] DBException in process: could not initialize cursor across all shards because : socket exception @ mongodb-sh1/<...>
Sun Sep 30 15:23:41 [conn448] getaddrinfo("host1f.load.net") failed: Name or service not known
Sun Sep 30 15:23:41 [conn448] DBException in process: could not initialize cursor across all shards because : socket exception @ mongodb-sh1/<...>
Sun Sep 30 15:23:41 [conn448] getaddrinfo("host1f.load.net") failed: Name or service not known
Sun Sep 30 15:23:41 [conn448] DBException in process: could not initialize cursor across all shards because : socket exception @ mongodb-sh1/<...>
Sun Sep 30 15:23:41 [conn327] getaddrinfo("host1f.load.net") failed: Name or service not known
Sun Sep 30 15:23:41 [conn327] warning: could not get last error from a shard host1g.load.net :: caused by :: socket exception
Received signal 6
Backtrace: 0x528c54 0x7ff4a8b77af0 0x7ff4a8b77a75 0x7ff4a8b7b5c0 0x7ff4a8b70941 0x730908 0x712fa5 0x700860 0x72ad97 0x73c8d1 0x5aad20 0x7ff4a9ef89ca 0x7ff4a8c2a70d
mongos(_ZN5mongo17printStackAndExitEi+0x64)[0x528c54]
/lib/libc.so.6(+0x33af0)[0x7ff4a8b77af0]
/lib/libc.so.6(gsignal+0x35)[0x7ff4a8b77a75]
/lib/libc.so.6(abort+0x180)[0x7ff4a8b7b5c0]
/lib/libc.so.6(__assert_fail+0xf1)[0x7ff4a8b70941]
mongos(_ZN5mongo10ClientInfo12getLastErrorERKNS_7BSONObjERNS_14BSONObjBuilderEb+0x3898)[0x730908]
mongos(_ZN5mongo7Command20runAgainstRegisteredEPKcRNS_7BSONObjERNS_14BSONObjBuilderEi+0x805)[0x712fa5]
mongos(_ZN5mongo14SingleStrategy7queryOpERNS_7RequestE+0x4d0)[0x700860]
mongos(_ZN5mongo7Request7processEi+0x157)[0x72ad97]
mongos(_ZN5mongo21ShardedMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0x71)[0x73c8d1]
mongos(_ZN5mongo3pms9threadRunEPNS_13MessagingPortE+0x260)[0x5aad20]
/lib/libpthread.so.0(+0x69ca)[0x7ff4a9ef89ca]
/lib/libc.so.6(clone+0x6d)[0x7ff4a8c2a70d]
===
 CursorCache at shutdown -  sharded: 1 passthrough: 2206


After I had restarted mongos, exceptions still occurred

***** SERVER RESTARTED *****

Mon Oct  1 14:25:02 [ReplicaSetMonitorWatcher] Socket recv() timeout  95.108.235.171:27017
Mon Oct  1 14:25:02 [ReplicaSetMonitorWatcher] SocketException: remote: 95.108.235.171:27017 error: 9001 socket exception [3] server [95.108.235.171:27017]
Mon Oct  1 14:25:02 [ReplicaSetMonitorWatcher] DBClientCursor::init call() failed
Mon Oct  1 14:25:02 [ReplicaSetMonitorWatcher] trying reconnect to host2f.load.net:27017
Mon Oct  1 14:25:02 [ReplicaSetMonitorWatcher] reconnect host2f.load.net:27017 ok
Mon Oct  1 14:25:42 [conn6] ns: rating.entity could not initialize cursor across all shards because : stale config detected for ns: rating.entity ParallelCursor::_init @ mongodb-sh1/<...> attempt: 0
Mon Oct  1 14:26:11 [conn148] ns: musicgroup.entity could not initialize cursor across all shards because : stale config detected for ns: musicgroup.entity ParallelCursor::_
init @ mongodb-sh1/<...> attempt: 0
Mon Oct  1 14:26:18 [conn41] ns: rating.entity could not initialize cursor across all shards because : stale config detected for ns: vertis-rating.entity_attr ParallelCursor::_init @ mongodb-sh1/<...> attempt: 0
Mon Oct  1 14:26:49 [conn150] ns: musicgroup.entity could not initialize cursor across all shards because : stale config detected for ns: musicgroup.entity ClusteredCursor::
_checkCursor @ rset02/<...> attempt: 0

Mon Oct  1 14:26:49 [conn150] ns: musicgroup.entity could not initialize cursor across all shards because : stale config detected for ns: musicgroup.entity_attr ClusteredCursor::
_checkCursor @ rset02/<...> attempt: 1
Mon Oct  1 14:26:51 [WriteBackListener-host2f:27017] ChunkManager: time to load chunks for musicgroup.entity: 4ms sequenceNumber: 12 version: 416|1
Mon Oct  1 14:26:51 [conn150] ns: musicgroup.entity could not initialize cursor across all shards because : stale config detected for ns: musicgroup.entity_attr ParallelCursor::_
init @ <...> attempt: 2
Mon Oct  1 14:27:14 [conn150] SyncClusterConnection connecting to [conf1g:27018]
Mon Oct  1 14:27:14 [conn150] SyncClusterConnection connecting to [conf1d:27018]
Mon Oct  1 14:27:14 [conn150] SyncClusterConnection connecting to [conf1f:27018]
Mon Oct  1 14:27:16 [conn142] ns: rating.entity could not initialize cursor across all shards because : stale config detected for ns: rating.entity_attr ParallelCursor::_init @ mongodb-sh1/<...> attempt: 0
Mon Oct  1 14:27:19 [conn86] ns: rating.entity could not initialize cursor across all shards because : stale config detected for ns: rating.entity_attr ParallelCursor::_init @ mongodb-sh1/<...> attempt: 0
Mon Oct  1 14:27:24 [conn93] ns: musicgroup.entity could not initialize cursor across all shards because : stale config detected for ns: musicgroup.entity ParallelCursor::_i
nit @ mongodb-sh1/<...> attempt: 0
Mon Oct  1 14:27:46 [ReplicaSetMonitorWatcher] Socket recv() timeout  192.168.1.101:27017
Mon Oct  1 14:27:46 [ReplicaSetMonitorWatcher] SocketException: remote: 192.168.1.101:27017 error: 9001 socket exception [3] server [192.168.1.101:27017]
Mon Oct  1 14:27:46 [ReplicaSetMonitorWatcher] DBClientCursor::init call() failed
Mon Oct  1 14:27:50 [conn149] ns: musicgroup.entity could not initialize cursor across all shards because : stale config detected for ns: musicgroup.entity ParallelCursor::_
init @ mongodb-sh1/<...> attempt: 0
Mon Oct  1 14:27:56 [ReplicaSetMonitorWatcher] trying reconnect to host2d.load.net:27017
Mon Oct  1 14:27:56 [ReplicaSetMonitorWatcher] reconnect host2d.load.net:27017 ok

At this moment I realized, that my loglevel is 0. So I changed it to 5 to see, what's happening.
So, I performed db.runCommand("{setParameter:1,logLevel:5}")
And right after this command, this exception disappeared. Even sharded queries are processed properly:

Mon Oct  1 14:28:20 [conn90] Request::process ns: musicgroup.entity msg id:2954917 attempt: 0
Mon Oct  1 14:28:20 [conn90] shard query: musicgroup.entity  { query: { _id: { $in: <...IDS...> } }, $readPreference: { mode: "primary" } }
Mon Oct  1 14:28:20 [conn90]  shard query servers: 2
       server:mongodb-sh1/<...> _extra:{} _orderObject:{}
       server:rset02/<...> _extra:{} _orderObject:{}

Mongos is still OK, but it seems to be a bug here.
Does anybody have some ideas?
Thanks!

Anton Volokhov

unread,
Oct 3, 2012, 10:54:23 AM10/3/12
to mongod...@googlegroups.com
Expirienced one more mongos crush, with logLevel 5 now.

Wed Oct  3 18:28:32 [conn406] Request::process ns: chords.entity msg id:169528 attempt: 0
Wed Oct  3 18:28:32 [conn406] CursorCache::get id: 1058686756268338652
Wed Oct  3 18:28:32 [conn302]  shard query servers: 2

       server:mongodb-sh1/<...> _extra:{} _orderObject:{}
       server:rset02/<...> _extra:{} _orderObject:{}

Wed Oct  3 18:28:32 [conn302] _check : mongodb-sh1/<...>
Wed Oct  3 18:28:32 [conn273] ReplicaSetMonitor::_checkConnection: host1d.load.net:27017 { setName: "mongodb-sh1", ismaster: false, secondary: true, hosts: [ "host1d.load.net:27017", "host1f.load.net:27017", "host1g.load.net:27017" ], primary: "host1f.load.net:27017", me: "host1d.load.net:27017", maxBsonObjectSize: 16777216, ok: 1.0 }
Wed Oct  3 18:28:32 [conn448] Request::process ns: admin.$cmd msg id:207996 attempt: 0
Wed Oct  3 18:28:32 [conn448] single query: admin.$cmd  { ismaster: 1 }  ntoreturn: -1 options : 0
Wed Oct  3 18:28:32 [conn448] Request::process ns: admin.$cmd msg id:207997 attempt: 0
Wed Oct  3 18:28:32 [conn448] single query: admin.$cmd  { ismaster: 1 }  ntoreturn: -1 options : 0
Wed Oct  3 18:28:32 [conn273] ReplicaSetMonitor::_checkConnection: host1f.load.net:27017 { setName: "mongodb-sh1", ismaster: true, secondary: false, hosts: [ "host1f.load.net:27017", "host1d.load.net:27017", "host1g.load.net:27017" ], primary: "host1f.load.net:27017", me: "host1f.load.net:27017", maxBsonObjectSize: 16777216, ok: 1.0 }
Wed Oct  3 18:28:32 [conn273] Request::process ns: rating.$cmd msg id:169526 attempt: 0
Wed Oct  3 18:28:32 [conn273] single query: rating.$cmd  { getlasterror: 1, w: 2, wtimeout: 0 }  ntoreturn: -1 options : 0
Wed Oct  3 18:28:32 [conn273] creating new connection to:host1d.load.net:27017
Wed Oct  3 18:28:32 [conn273] getaddrinfo("host1d.load.net") failed: Name or service not known
Wed Oct  3 18:28:32 [conn273] warning: could not get last error from a shard host1d.load.net:27017 :: caused by :: socket exception
Wed Oct  3 18:28:32 [conn393] Request::process ns: chords.entity msg id:169529 attempt: 0
Wed Oct  3 18:28:32 [conn393] CursorCache::get id: 4966267460045049647
Wed Oct  3 18:28:32 [conn446] Request::process ns: admin.$cmd msg id:183399 attempt: 0
Wed Oct  3 18:28:32 [conn446] single query: admin.$cmd  { ismaster: 1 }  ntoreturn: -1 options : 0
Received signal 6
Backtrace: 0x528c54 0x7fed6cc0faf0 0x7fed6cc0fa75 0x7fed6cc135c0 0x7fed6cc08941 0x730908 0x712fa5 0x700860 0x72ad97 0x73c8d1 0x5aad20 0x7fed6df909ca 0x7fed6ccc270d
Wed Oct  3 18:28:32 [conn446] Request::process ns: admin.$cmd msg id:183400 attempt: 0
Wed Oct  3 18:28:32 [conn446] single query: admin.$cmd  { ismaster: 1 }  ntoreturn: -1 options : 0
mongos(_ZN5mongo17printStackAndExitEi+0x64)[0x528c54]
/lib/libc.so.6(+0x33af0)[0x7fed6cc0faf0]
/lib/libc.so.6(gsignal+0x35)[0x7fed6cc0fa75]
/lib/libc.so.6(abort+0x180)[0x7fed6cc135c0]
/lib/libc.so.6(__assert_fail+0xf1)[0x7fed6cc08941]

mongos(_ZN5mongo10ClientInfo12getLastErrorERKNS_7BSONObjERNS_14BSONObjBuilderEb+0x3898)[0x730908]
mongos(_ZN5mongo7Command20runAgainstRegisteredEPKcRNS_7BSONObjERNS_14BSONObjBuilderEi+0x805)[0x712fa5]
mongos(_ZN5mongo14SingleStrategy7queryOpERNS_7RequestE+0x4d0)[0x700860]
mongos(_ZN5mongo7Request7processEi+0x157)[0x72ad97]
mongos(_ZN5mongo21ShardedMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0x71)[0x73c8d1]
mongos(_ZN5mongo3pms9threadRunEPNS_13MessagingPortE+0x260)[0x5aad20]
/lib/libpthread.so.0(+0x69ca)[0x7fed6df909ca]
/lib/libc.so.6(clone+0x6d)[0x7fed6ccc270d]
===
 CursorCache at shutdown -  sharded: 2 passthrough: 1900


среда, 3 октября 2012 г., 13:53:56 UTC+4 пользователь Anton Volokhov написал:

Brian McNamara

unread,
Oct 3, 2012, 12:02:16 PM10/3/12
to mongod...@googlegroups.com
Hi Anton,

It looks like there's an issue communicating with one of the nodes (host1d.load.net:27017). Can you resolve the hostname?

Regards,
Brian

Anton Volokhov

unread,
Oct 4, 2012, 7:15:17 AM10/4/12
to mongod...@googlegroups.com
Obviously, yes. It's resolvable, pingable and sshable all the time.

среда, 3 октября 2012 г., 20:02:16 UTC+4 пользователь Brian McNamara написал:

Brian McNamara

unread,
Oct 4, 2012, 10:46:29 AM10/4/12
to mongod...@googlegroups.com
Hi Anton,

There was an exception thrown in the log indicating a failure to resolve host1d.load.net - just wanted to use that as a starting point. 

Wed Oct  3 18:28:32 [conn273] creating new connection to:host1d.load.net:27017
Wed Oct  3 18:28:32 [conn273] getaddrinfo("host1d.load.net") failed: Name or service not known
Wed Oct  3 18:28:32 [conn273] warning: could not get last error from a shard host1d.load.net:27017 :: caused by :: socket exception
Wed Oct  3 18:28:32 [conn393] Request::process ns: chords.entity msg id:169529 attempt: 0
Wed Oct  3 18:28:32 [conn393] CursorCache::get id: 4966267460045049647

Regards,
Brian

Anton Volokhov

unread,
Oct 4, 2012, 11:02:00 AM10/4/12
to mongod...@googlegroups.com
well, found a similar problem with stale config (https://groups.google.com/forum/?fromgroups=#!topic/mongodb-user/7Y9Rg-MCYZk)
So, the only question now is why mongos crashed twice.
I'm pretty sure, that all mongod instances were up when mongod fails with exception.
And every time it was new unresolved host(1f and 1g the first time, 1d the second).
These instances are located in different physical machines.

четверг, 4 октября 2012 г., 18:46:29 UTC+4 пользователь Brian McNamara написал:

Sam Helman

unread,
Oct 8, 2012, 3:41:30 PM10/8/12
to mongod...@googlegroups.com
It is possible that an old configuration across the config servers caused the mongos to experience a problem with routing requests.  If the issue comes up in the future, it might be advisable to take the route that was taken in http://stackoverflow.com/questions/8563600/mongos-stale-config-error.
Reply all
Reply to author
Forward
0 new messages