I run a minimalistic shard setup on AWS. Mongo is 2.0.1, the config
reads as follows
- 1 c1.medium instance for the configserver (port 27117)
- 1 m1.large instance for 2 shards (27217, 27218) and 1 mongos (port 27017)
There is no sharding on collection-level.
The setup is a part of a continuous delivery setup and re-created from
scratch every day.
During these daily cycles Mongo stopped "working" in 2 out of 6 runs
after some time.
This means that MongoDB was configured successful and could be
successfully used by our application (based on Scala/Java).
After a couple of hours MongoDB stops working. The components are still
alive (I check this with Icinga, which monitors the TCP ports).
mongodump gives me this:
> connected to: 127.0.0.1
> all dbs
> assertion: 13106 nextSafe(): { $err: "socket exception", code: 11002 }
The same error is printed, eg. when I use mongo shell and eg. issue
db.printShardingStatus().
The sad thing about this, that I was unable to resolve the issue by
restarting single components.
The only solution that worked to far, was a /complete/ cleanup, which meant:
1. stop config-server, rm -rf databases
2. stop shards, rm -rf databases
3. start configserver, start shards
4. re-add the shards
Obviously this is not the way to go ;-)
This is the log of the configserver on both situation. Also I must admit
I am unsure that this is the root, root cause, but these are the first
errors that show up in the logfile.
> Wed Oct 26 18:05:54 [conn10640] end connection 10.250.99.30:45994
> Wed Oct 26 18:05:55 [mongosMain] connection accepted from 10.250.99.30:46000 #10641
> Wed Oct 26 18:05:55 [conn10641] end connection 10.250.99.30:46000
> Wed Oct 26 18:05:57 [WriteBackListener-ip-10-59-58-70.eu-west-1.compute.internal:27217] Socket recv() errno:110 Connection timed out 10.59.58.70:27217
> Wed Oct 26 18:05:57 [WriteBackListener-ip-10-59-58-70.eu-west-1.compute.internal:27217] SocketException: remote: 10.59.58.70:27217 error: 9001 socket exception [1] server [10.59.58.70:27217]
> Wed Oct 26 18:05:57 [WriteBackListener-ip-10-59-58-70.eu-west-1.compute.internal:27217] DBClientCursor::init call() failed
> Wed Oct 26 18:05:57 [WriteBackListener-ip-10-59-58-70.eu-west-1.compute.internal:27217] WriteBackListener exception : DBClientBase::findN: transport error: ip-10-59-58-70.eu-west-1.compute.internal:27217 query: { writebacklisten: ObjectId('4ea61e7c7ff28104e1121726') }
> Wed Oct 26 18:05:57 [WriteBackListener-ip-10-59-58-70.eu-west-1.compute.internal:27218] Socket recv() errno:110 Connection timed out 10.59.58.70:27218
> Wed Oct 26 18:05:57 [WriteBackListener-ip-10-59-58-70.eu-west-1.compute.internal:27218] SocketException: remote: 10.59.58.70:27218 error: 9001 socket exception [1] server [10.59.58.70:27218]
> Wed Oct 26 18:05:57 [WriteBackListener-ip-10-59-58-70.eu-west-1.compute.internal:27218] DBClientCursor::init call() failed
> Wed Oct 26 18:05:57 [WriteBackListener-ip-10-59-58-70.eu-west-1.compute.internal:27218] WriteBackListener exception : DBClientBase::findN: transport error: ip-10-59-58-70.eu-west-1.compute.internal:27218 query: { writebacklisten: ObjectId('4ea61e7c7ff28104e1121726') }
> Wed Oct 26 18:06:02 [mongosMain] connection accepted from 10.250.99.30:46122 #10642
> Tue Nov 1 05:18:27 [conn499] end connection 10.56.49.22:35452
> Tue Nov 1 05:18:33 [mongosMain] connection accepted from 10.56.49.22:35458 #500
> Tue Nov 1 05:18:33 [conn500] end connection 10.56.49.22:35458
> Tue Nov 1 05:18:43 [Balancer] Socket recv() errno:113 No route to host 10.59.99.41:27217
> Tue Nov 1 05:18:43 [Balancer] SocketException: remote: 10.59.99.41:27217 error: 9001 socket exception [1] server [10.59.99.41:27217]
> Tue Nov 1 05:18:43 [Balancer] DBClientCursor::init call() failed
> Tue Nov 1 05:18:43 [Balancer] ~ScopedDbConnection: _conn != null
> Tue Nov 1 05:18:43 [Balancer] caught exception while doing balance: DBClientBase::findN: transport error: ip-10-59-99-41.eu-west-1.compute.internal:27217 query: { features: 1 }
> Tue Nov 1 05:19:07 [mongosMain] connection accepted from 10.56.49.22:35513 #501
> Tue Nov 1 05:19:07 [conn501] end connection 10.56.49.22:35513
> Tue Nov 1 05:19:21 [mongosMain] connection accepted from 10.56.49.22:35532 #502
I am happy to provide more information and assist finding the issue.
Also complete logs from shards, configserver and mongos are available.
Any help and suggestions are welcome.
Cheers,
Jens
configserver: ip-10-56-49-22.eu-west-1.compute.internal
shardserver: ip-10-59-99-151.eu-west-1.compute.internal
It is also possible to reverse this information from the port I guess.
I run collectd to record ping-times between the shard-server and the config-server. The graphs look ok. In addition I
run Icinga, which also seemed happy. So overall the server can connect each other. No firewall/packet filter has been
changed in these setups. So I assume it should have never worked if this would be the root cause.
Configuration reads as follows:
configserver (started with /usr/bin/mongod -f /etc/mongod-configserver.conf --configsvr)
> cpu = true
> dbpath=/var/lib/mongod-configserver
> fork = true
> journal = true
> logappend=true
> logpath=/var/log/mongod-configserver/mongod.log
> noprealloc = true
> nssize = 2
> pidfilepath=/var/run/mongod-configserver/mongod.pid
> port=27117
> smallfiles = true
shardserver (/usr/bin/mongod -f /etc/mongod-shard-27217.conf --shardsvr)
> cpu = true
> dbpath=/storage/mongod-shard-27217
> fork = true
> journal = true
> logappend=true
> logpath=/var/log/mongod-shard-27217/mongod.log
> noprealloc = true
> nssize = 2
> pidfilepath=/var/run/mongod-shard-27217/mongod.pid
> port=27217
> smallfiles = true
shardserver (/usr/bin/mongod -f /etc/mongod-shard-27218.conf --shardsvr)
> cpu = true
> dbpath=/storage/mongod-shard-27218
> fork = true
> journal = true
> logappend=true
> logpath=/var/log/mongod-shard-27218/mongod.log
> noprealloc = true
> nssize = 2
> pidfilepath=/var/run/mongod-shard-27218/mongod.pid
> port=27218
> smallfiles = true
mongos
> configdb=mongo-configserver-host:27117
> logappend=true
> logpath=/var/log/mongos-mongos/mongod.log
> pidfilepath=/var/run/mongos-mongos/mongos.pid
> port=27017
The dns name "mongo-configserver-host" resolves:
> mongo-configserver-host.eu-west-1.compute.internal has address 10.56.49.22
The alias is provided by dnsmasq and configured in /etc/hosts.
I am happy to provide more information.
Cheers,
Jens
Am 01.11.11 12:11, schrieb Nat:
--
Jens Br�uer
Senior Systems Engineer
Dipl. Inf.
NumberFour AG
Sch�nhauser Allee 8
10119 Berlin
Germany
Mobile: +49 175 221 88 34
Phone: +49 30 40505411
Fax: +49 30 40505410
je...@numberfour.eu
numberfour.eu
facebook.com/NumberFour
twitter.com/numberfourag
Jens Bräuer
Senior Systems Engineer
Dipl. Inf.
NumberFour AG
Schönhauser Allee 8
10119 Berlin
Germany
Mobile: +49 175 221 88 34
Phone: +49 30 40505411
Fax: +49 30 40505410
je...@numberfour.eu
numberfour.eu
facebook.com/NumberFour
twitter.com/numberfourag
thanks for pointing this out. I'll investigate where this IP comes from ;-) Probably should not be there.
Thanks,
Jens
Am 01.11.11 15:51, schrieb Nat Luengnaruemitchai:
> Where do you have a setting for 10.59.99.41:27217 <http://10.59.99.41:27217/>? I can't see it anywhere.
> > recv() errno:113 No route to host 10.59.99.41:27217 <http://10.59.99.41:27217>
> >> Wed Oct 26 18:05:54 [conn10640] end connection 10.250.99.30:45994 <http://10.250.99.30:45994>
> >> Wed Oct 26 18:05:55 [mongosMain] connection accepted from 10.250.99.30:46000 <http://10.250.99.30:46000> #10641
> >> Wed Oct 26 18:05:55 [conn10641] end connection 10.250.99.30:46000 <http://10.250.99.30:46000>
> >> Wed Oct 26 18:05:57 [WriteBackListener-ip-10-59-58-70.eu-west-1.compute.internal:27217] Socket recv() errno:110
> Connection timed out 10.59.58.70:27217 <http://10.59.58.70:27217>
> >> Wed Oct 26 18:05:57 [WriteBackListener-ip-10-59-58-70.eu-west-1.compute.internal:27217] SocketException: remote:
> 10.59.58.70:27217 <http://10.59.58.70:27217> error: 9001 socket exception [1] server [10.59.58.70:27217
> <http://10.59.58.70:27217>]
> >> Wed Oct 26 18:05:57 [WriteBackListener-ip-10-59-58-70.eu-west-1.compute.internal:27217] DBClientCursor::init
> call() failed
> >> Wed Oct 26 18:05:57 [WriteBackListener-ip-10-59-58-70.eu-west-1.compute.internal:27217] WriteBackListener
> exception : DBClientBase::findN: transport error: ip-10-59-58-70.eu-west-1.compute.internal:27217 query: {
> writebacklisten: ObjectId('4ea61e7c7ff28104e1121726') }
> >> Wed Oct 26 18:05:57 [WriteBackListener-ip-10-59-58-70.eu-west-1.compute.internal:27218] Socket recv() errno:110
> Connection timed out 10.59.58.70:27218 <http://10.59.58.70:27218>
> >> Wed Oct 26 18:05:57 [WriteBackListener-ip-10-59-58-70.eu-west-1.compute.internal:27218] SocketException: remote:
> 10.59.58.70:27218 <http://10.59.58.70:27218> error: 9001 socket exception [1] server [10.59.58.70:27218
> <http://10.59.58.70:27218>]
> >> Wed Oct 26 18:05:57 [WriteBackListener-ip-10-59-58-70.eu-west-1.compute.internal:27218] DBClientCursor::init
> call() failed
> >> Wed Oct 26 18:05:57 [WriteBackListener-ip-10-59-58-70.eu-west-1.compute.internal:27218] WriteBackListener
> exception : DBClientBase::findN: transport error: ip-10-59-58-70.eu-west-1.compute.internal:27218 query: {
> writebacklisten: ObjectId('4ea61e7c7ff28104e1121726') }
> >> Wed Oct 26 18:06:02 [mongosMain] connection accepted from 10.250.99.30:46122 <http://10.250.99.30:46122> #10642
> >
> >> Tue Nov 1 05:18:27 [conn499] end connection 10.56.49.22:35452 <http://10.56.49.22:35452>
> >> Tue Nov 1 05:18:33 [mongosMain] connection accepted from 10.56.49.22:35458 <http://10.56.49.22:35458> #500
> >> Tue Nov 1 05:18:33 [conn500] end connection 10.56.49.22:35458 <http://10.56.49.22:35458>
> >> Tue Nov 1 05:18:43 [Balancer] Socket recv() errno:113 No route to host 10.59.99.41:27217 <http://10.59.99.41:27217>
> >> Tue Nov 1 05:18:43 [Balancer] SocketException: remote: 10.59.99.41:27217 <http://10.59.99.41:27217> error: 9001
> socket exception [1] server [10.59.99.41:27217 <http://10.59.99.41:27217>]
> >> Tue Nov 1 05:18:43 [Balancer] DBClientCursor::init call() failed
> >> Tue Nov 1 05:18:43 [Balancer] ~ScopedDbConnection: _conn != null
> >> Tue Nov 1 05:18:43 [Balancer] caught exception while doing balance: DBClientBase::findN: transport error:
> ip-10-59-99-41.eu-west-1.compute.internal:27217 query: { features: 1 }
> >> Tue Nov 1 05:19:07 [mongosMain] connection accepted from 10.56.49.22:35513 <http://10.56.49.22:35513> #501
> >> Tue Nov 1 05:19:07 [conn501] end connection 10.56.49.22:35513 <http://10.56.49.22:35513>
> >> Tue Nov 1 05:19:21 [mongosMain] connection accepted from 10.56.49.22:35532 <http://10.56.49.22:35532> #502
> >
> > I am happy to provide more information and assist finding the issue.
> > Also complete logs from shards, configserver and mongos are available.
> >
> > Any help and suggestions are welcome.
> > Cheers,
> > Jens
> >
> >
> >
>
>
> --
> Jens Br�uer
> Senior Systems Engineer
> Dipl. Inf.
> NumberFour AG
> Sch�nhauser Allee 8
> 10119 Berlin
> Germany
> Mobile: +49 175 221 88 34 <tel:%2B49%20175%20221%2088%2034>
> Phone: +49 30 40505411 <tel:%2B49%2030%2040505411>
> Fax: +49 30 40505410 <tel:%2B49%2030%2040505410>
> je...@numberfour.eu <mailto:je...@numberfour.eu>
>
> numberfour.eu <http://numberfour.eu>
> facebook.com/NumberFour <http://facebook.com/NumberFour>
> twitter.com/numberfourag <http://twitter.com/numberfourag>
>
> --
> You received this message because you are subscribed to the Google Groups "mongodb-user" group.
> To post to this group, send email to mongod...@googlegroups.com <mailto:mongod...@googlegroups.com>.
> To unsubscribe from this group, send email to mongodb-user...@googlegroups.com
> <mailto:mongodb-user%2Bunsu...@googlegroups.com>.
> For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.
>
>
> --
> You received this message because you are subscribed to the Google Groups "mongodb-user" group.
> To post to this group, send email to mongod...@googlegroups.com.
> To unsubscribe from this group, send email to mongodb-user...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.
--
Jens Br�uer
Senior Systems Engineer
Dipl. Inf.
NumberFour AG
Sch�nhauser Allee 8
thanks again for the hint. I turned out that the automated setup involved a "mongorestore". So far I have been un-aware,
that this also included the config-db, which configures the shard-server and thus bring the "old" shard-servers into the
new setup.
Cheers,
Jens
Am 01.11.11 18:37, schrieb Jens Braeuer: