MongoDB cluster unresponsive: server: "error: 9001 socket exception [1] server", mongo shell: "assertion: 13106 nextSafe(): { $err: "socket exception", code: 11002 }"

2,460 views
Skip to first unread message

Jens Braeuer

unread,
Nov 1, 2011, 7:05:21 AM11/1/11
to mongod...@googlegroups.com
Hi everyone,

I run a minimalistic shard setup on AWS. Mongo is 2.0.1, the config
reads as follows
- 1 c1.medium instance for the configserver (port 27117)
- 1 m1.large instance for 2 shards (27217, 27218) and 1 mongos (port 27017)
There is no sharding on collection-level.

The setup is a part of a continuous delivery setup and re-created from
scratch every day.

During these daily cycles Mongo stopped "working" in 2 out of 6 runs
after some time.
This means that MongoDB was configured successful and could be
successfully used by our application (based on Scala/Java).
After a couple of hours MongoDB stops working. The components are still
alive (I check this with Icinga, which monitors the TCP ports).

mongodump gives me this:
> connected to: 127.0.0.1
> all dbs
> assertion: 13106 nextSafe(): { $err: "socket exception", code: 11002 }

The same error is printed, eg. when I use mongo shell and eg. issue
db.printShardingStatus().

The sad thing about this, that I was unable to resolve the issue by
restarting single components.
The only solution that worked to far, was a /complete/ cleanup, which meant:
1. stop config-server, rm -rf databases
2. stop shards, rm -rf databases
3. start configserver, start shards
4. re-add the shards

Obviously this is not the way to go ;-)

This is the log of the configserver on both situation. Also I must admit
I am unsure that this is the root, root cause, but these are the first
errors that show up in the logfile.

> Wed Oct 26 18:05:54 [conn10640] end connection 10.250.99.30:45994
> Wed Oct 26 18:05:55 [mongosMain] connection accepted from 10.250.99.30:46000 #10641
> Wed Oct 26 18:05:55 [conn10641] end connection 10.250.99.30:46000
> Wed Oct 26 18:05:57 [WriteBackListener-ip-10-59-58-70.eu-west-1.compute.internal:27217] Socket recv() errno:110 Connection timed out 10.59.58.70:27217
> Wed Oct 26 18:05:57 [WriteBackListener-ip-10-59-58-70.eu-west-1.compute.internal:27217] SocketException: remote: 10.59.58.70:27217 error: 9001 socket exception [1] server [10.59.58.70:27217]
> Wed Oct 26 18:05:57 [WriteBackListener-ip-10-59-58-70.eu-west-1.compute.internal:27217] DBClientCursor::init call() failed
> Wed Oct 26 18:05:57 [WriteBackListener-ip-10-59-58-70.eu-west-1.compute.internal:27217] WriteBackListener exception : DBClientBase::findN: transport error: ip-10-59-58-70.eu-west-1.compute.internal:27217 query: { writebacklisten: ObjectId('4ea61e7c7ff28104e1121726') }
> Wed Oct 26 18:05:57 [WriteBackListener-ip-10-59-58-70.eu-west-1.compute.internal:27218] Socket recv() errno:110 Connection timed out 10.59.58.70:27218
> Wed Oct 26 18:05:57 [WriteBackListener-ip-10-59-58-70.eu-west-1.compute.internal:27218] SocketException: remote: 10.59.58.70:27218 error: 9001 socket exception [1] server [10.59.58.70:27218]
> Wed Oct 26 18:05:57 [WriteBackListener-ip-10-59-58-70.eu-west-1.compute.internal:27218] DBClientCursor::init call() failed
> Wed Oct 26 18:05:57 [WriteBackListener-ip-10-59-58-70.eu-west-1.compute.internal:27218] WriteBackListener exception : DBClientBase::findN: transport error: ip-10-59-58-70.eu-west-1.compute.internal:27218 query: { writebacklisten: ObjectId('4ea61e7c7ff28104e1121726') }
> Wed Oct 26 18:06:02 [mongosMain] connection accepted from 10.250.99.30:46122 #10642

> Tue Nov 1 05:18:27 [conn499] end connection 10.56.49.22:35452
> Tue Nov 1 05:18:33 [mongosMain] connection accepted from 10.56.49.22:35458 #500
> Tue Nov 1 05:18:33 [conn500] end connection 10.56.49.22:35458
> Tue Nov 1 05:18:43 [Balancer] Socket recv() errno:113 No route to host 10.59.99.41:27217
> Tue Nov 1 05:18:43 [Balancer] SocketException: remote: 10.59.99.41:27217 error: 9001 socket exception [1] server [10.59.99.41:27217]
> Tue Nov 1 05:18:43 [Balancer] DBClientCursor::init call() failed
> Tue Nov 1 05:18:43 [Balancer] ~ScopedDbConnection: _conn != null
> Tue Nov 1 05:18:43 [Balancer] caught exception while doing balance: DBClientBase::findN: transport error: ip-10-59-99-41.eu-west-1.compute.internal:27217 query: { features: 1 }
> Tue Nov 1 05:19:07 [mongosMain] connection accepted from 10.56.49.22:35513 #501
> Tue Nov 1 05:19:07 [conn501] end connection 10.56.49.22:35513
> Tue Nov 1 05:19:21 [mongosMain] connection accepted from 10.56.49.22:35532 #502

I am happy to provide more information and assist finding the issue.
Also complete logs from shards, configserver and mongos are available.

Any help and suggestions are welcome.
Cheers,
Jens

Nat

unread,
Nov 1, 2011, 7:11:28 AM11/1/11
to mongod...@googlegroups.com
Can you provide each ip address and how yoy configure it? I notice that mongos cannot access one of your box. Did you make sure that each server can see one another?

recv() errno:113 No route to host 10.59.99.41:27217

--
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To post to this group, send email to mongod...@googlegroups.com.
To unsubscribe from this group, send email to mongodb-user...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.

Jens Braeuer

unread,
Nov 1, 2011, 9:27:48 AM11/1/11
to mongod...@googlegroups.com
This is my configuration:

configserver: ip-10-56-49-22.eu-west-1.compute.internal
shardserver: ip-10-59-99-151.eu-west-1.compute.internal

It is also possible to reverse this information from the port I guess.

I run collectd to record ping-times between the shard-server and the config-server. The graphs look ok. In addition I
run Icinga, which also seemed happy. So overall the server can connect each other. No firewall/packet filter has been
changed in these setups. So I assume it should have never worked if this would be the root cause.

Configuration reads as follows:

configserver (started with /usr/bin/mongod -f /etc/mongod-configserver.conf --configsvr)
> cpu = true
> dbpath=/var/lib/mongod-configserver
> fork = true
> journal = true
> logappend=true
> logpath=/var/log/mongod-configserver/mongod.log
> noprealloc = true
> nssize = 2
> pidfilepath=/var/run/mongod-configserver/mongod.pid
> port=27117
> smallfiles = true

shardserver (/usr/bin/mongod -f /etc/mongod-shard-27217.conf --shardsvr)
> cpu = true
> dbpath=/storage/mongod-shard-27217
> fork = true
> journal = true
> logappend=true
> logpath=/var/log/mongod-shard-27217/mongod.log
> noprealloc = true
> nssize = 2
> pidfilepath=/var/run/mongod-shard-27217/mongod.pid
> port=27217
> smallfiles = true

shardserver (/usr/bin/mongod -f /etc/mongod-shard-27218.conf --shardsvr)
> cpu = true
> dbpath=/storage/mongod-shard-27218
> fork = true
> journal = true
> logappend=true
> logpath=/var/log/mongod-shard-27218/mongod.log
> noprealloc = true
> nssize = 2
> pidfilepath=/var/run/mongod-shard-27218/mongod.pid
> port=27218
> smallfiles = true

mongos
> configdb=mongo-configserver-host:27117
> logappend=true
> logpath=/var/log/mongos-mongos/mongod.log
> pidfilepath=/var/run/mongos-mongos/mongos.pid
> port=27017

The dns name "mongo-configserver-host" resolves:
> mongo-configserver-host.eu-west-1.compute.internal has address 10.56.49.22

The alias is provided by dnsmasq and configured in /etc/hosts.

I am happy to provide more information.
Cheers,
Jens


Am 01.11.11 12:11, schrieb Nat:


--
Jens Br�uer
Senior Systems Engineer
Dipl. Inf.
NumberFour AG
Sch�nhauser Allee 8
10119 Berlin
Germany
Mobile: +49 175 221 88 34
Phone: +49 30 40505411
Fax: +49 30 40505410
je...@numberfour.eu

numberfour.eu
facebook.com/NumberFour
twitter.com/numberfourag

Nat Luengnaruemitchai

unread,
Nov 1, 2011, 10:51:43 AM11/1/11
to mongod...@googlegroups.com
Where do you have a setting for 10.59.99.41:27217? I can't see it anywhere.

Jens Bräuer

Senior Systems Engineer
Dipl. Inf.
NumberFour AG
Schönhauser Allee 8

Jens Braeuer

unread,
Nov 1, 2011, 1:37:58 PM11/1/11
to mongod...@googlegroups.com
Hi Nat,

thanks for pointing this out. I'll investigate where this IP comes from ;-) Probably should not be there.

Thanks,
Jens

Am 01.11.11 15:51, schrieb Nat Luengnaruemitchai:
> Where do you have a setting for 10.59.99.41:27217 <http://10.59.99.41:27217/>? I can't see it anywhere.

> > recv() errno:113 No route to host 10.59.99.41:27217 <http://10.59.99.41:27217>

> >> Wed Oct 26 18:05:54 [conn10640] end connection 10.250.99.30:45994 <http://10.250.99.30:45994>
> >> Wed Oct 26 18:05:55 [mongosMain] connection accepted from 10.250.99.30:46000 <http://10.250.99.30:46000> #10641
> >> Wed Oct 26 18:05:55 [conn10641] end connection 10.250.99.30:46000 <http://10.250.99.30:46000>


> >> Wed Oct 26 18:05:57 [WriteBackListener-ip-10-59-58-70.eu-west-1.compute.internal:27217] Socket recv() errno:110

> Connection timed out 10.59.58.70:27217 <http://10.59.58.70:27217>


> >> Wed Oct 26 18:05:57 [WriteBackListener-ip-10-59-58-70.eu-west-1.compute.internal:27217] SocketException: remote:

> 10.59.58.70:27217 <http://10.59.58.70:27217> error: 9001 socket exception [1] server [10.59.58.70:27217
> <http://10.59.58.70:27217>]


> >> Wed Oct 26 18:05:57 [WriteBackListener-ip-10-59-58-70.eu-west-1.compute.internal:27217] DBClientCursor::init
> call() failed
> >> Wed Oct 26 18:05:57 [WriteBackListener-ip-10-59-58-70.eu-west-1.compute.internal:27217] WriteBackListener
> exception : DBClientBase::findN: transport error: ip-10-59-58-70.eu-west-1.compute.internal:27217 query: {
> writebacklisten: ObjectId('4ea61e7c7ff28104e1121726') }
> >> Wed Oct 26 18:05:57 [WriteBackListener-ip-10-59-58-70.eu-west-1.compute.internal:27218] Socket recv() errno:110

> Connection timed out 10.59.58.70:27218 <http://10.59.58.70:27218>


> >> Wed Oct 26 18:05:57 [WriteBackListener-ip-10-59-58-70.eu-west-1.compute.internal:27218] SocketException: remote:

> 10.59.58.70:27218 <http://10.59.58.70:27218> error: 9001 socket exception [1] server [10.59.58.70:27218
> <http://10.59.58.70:27218>]


> >> Wed Oct 26 18:05:57 [WriteBackListener-ip-10-59-58-70.eu-west-1.compute.internal:27218] DBClientCursor::init
> call() failed
> >> Wed Oct 26 18:05:57 [WriteBackListener-ip-10-59-58-70.eu-west-1.compute.internal:27218] WriteBackListener
> exception : DBClientBase::findN: transport error: ip-10-59-58-70.eu-west-1.compute.internal:27218 query: {
> writebacklisten: ObjectId('4ea61e7c7ff28104e1121726') }

> >> Wed Oct 26 18:06:02 [mongosMain] connection accepted from 10.250.99.30:46122 <http://10.250.99.30:46122> #10642
> >
> >> Tue Nov 1 05:18:27 [conn499] end connection 10.56.49.22:35452 <http://10.56.49.22:35452>
> >> Tue Nov 1 05:18:33 [mongosMain] connection accepted from 10.56.49.22:35458 <http://10.56.49.22:35458> #500
> >> Tue Nov 1 05:18:33 [conn500] end connection 10.56.49.22:35458 <http://10.56.49.22:35458>
> >> Tue Nov 1 05:18:43 [Balancer] Socket recv() errno:113 No route to host 10.59.99.41:27217 <http://10.59.99.41:27217>
> >> Tue Nov 1 05:18:43 [Balancer] SocketException: remote: 10.59.99.41:27217 <http://10.59.99.41:27217> error: 9001
> socket exception [1] server [10.59.99.41:27217 <http://10.59.99.41:27217>]


> >> Tue Nov 1 05:18:43 [Balancer] DBClientCursor::init call() failed
> >> Tue Nov 1 05:18:43 [Balancer] ~ScopedDbConnection: _conn != null
> >> Tue Nov 1 05:18:43 [Balancer] caught exception while doing balance: DBClientBase::findN: transport error:
> ip-10-59-99-41.eu-west-1.compute.internal:27217 query: { features: 1 }

> >> Tue Nov 1 05:19:07 [mongosMain] connection accepted from 10.56.49.22:35513 <http://10.56.49.22:35513> #501
> >> Tue Nov 1 05:19:07 [conn501] end connection 10.56.49.22:35513 <http://10.56.49.22:35513>
> >> Tue Nov 1 05:19:21 [mongosMain] connection accepted from 10.56.49.22:35532 <http://10.56.49.22:35532> #502


> >
> > I am happy to provide more information and assist finding the issue.
> > Also complete logs from shards, configserver and mongos are available.
> >
> > Any help and suggestions are welcome.
> > Cheers,
> > Jens
> >
> >
> >
>
>
> --

> Jens Br�uer


> Senior Systems Engineer
> Dipl. Inf.
> NumberFour AG

> Sch�nhauser Allee 8
> 10119 Berlin
> Germany
> Mobile: +49 175 221 88 34 <tel:%2B49%20175%20221%2088%2034>
> Phone: +49 30 40505411 <tel:%2B49%2030%2040505411>
> Fax: +49 30 40505410 <tel:%2B49%2030%2040505410>
> je...@numberfour.eu <mailto:je...@numberfour.eu>
>
> numberfour.eu <http://numberfour.eu>
> facebook.com/NumberFour <http://facebook.com/NumberFour>
> twitter.com/numberfourag <http://twitter.com/numberfourag>


>
> --
> You received this message because you are subscribed to the Google Groups "mongodb-user" group.

> To post to this group, send email to mongod...@googlegroups.com <mailto:mongod...@googlegroups.com>.


> To unsubscribe from this group, send email to mongodb-user...@googlegroups.com

> <mailto:mongodb-user%2Bunsu...@googlegroups.com>.


> For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.
>
>
> --
> You received this message because you are subscribed to the Google Groups "mongodb-user" group.
> To post to this group, send email to mongod...@googlegroups.com.
> To unsubscribe from this group, send email to mongodb-user...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.


--
Jens Br�uer


Senior Systems Engineer
Dipl. Inf.
NumberFour AG

Sch�nhauser Allee 8

Jens Braeuer

unread,
Nov 4, 2011, 7:51:38 PM11/4/11
to mongod...@googlegroups.com, nat....@gmail.com
Hi Nat,

thanks again for the hint. I turned out that the automated setup involved a "mongorestore". So far I have been un-aware,
that this also included the config-db, which configures the shard-server and thus bring the "old" shard-servers into the
new setup.

Cheers,
Jens

Am 01.11.11 18:37, schrieb Jens Braeuer:

Reply all
Reply to author
Forward
0 new messages