We've been adding a lot of replica nodes to our 1.8.2 cluster as phase
one of a cluster move. The replica sets themselves are part of a
sharded cluster with three configdbs.
Adding nodes cause some (but not all) mongos's to fail:
[mpliveappapi01] run: echo show dbs | bin/mongo
[mpliveappapi01] out: MongoDB shell version: 1.8.2
[mpliveappapi01] out: connecting to: test
[mpliveappapi01] out: > show dbs
[mpliveappapi01] out: Fri Jul 8 21:23:36 uncaught exception:
listDatabases failed:{
[mpliveappapi01] out: "assertion" : "DBClientBase::findOne: transport
error: [ip of newly added node]:27017 query: { listDatabases: 1 }",
[mpliveappapi01] out: "assertionCode" : 10276,
[mpliveappapi01] out: "errmsg" : "db assertion failure",
[mpliveappapi01] out: "ok" : 0
[mpliveappapi01] out: }
[mpliveappapi01] out: > bye
All queries on these failed mongos's will return similar errors and
they never reconnect/recover. Bouncing the failed mongos's fixes the
problem (until the next time a replica set is changed). The mongos
logs contain no text that is unusual but a ton (62K) of null bytes
near the beginning of the file. The end of the log file usually reads:
Fri Jul 8 04:02:15 [mongosMain] connection accepted from
127.0.0.1:8816 #31
Fri Jul 8 04:02:15 [WriteBackListener] WriteBackListener exception :
socket exception
Fri Jul 8 04:02:16 [conn31] MessagingPort recv() errno:104 Connection
reset by peer [ip of the master of the replica set that just got
changed]:27017
Fri Jul 8 04:02:16 [conn31] SocketException: remote: error: 9001
socket exception [1]
Fri Jul 8 04:02:16 [conn31] DBClientCursor::init call() failed
Fri Jul 8 04:02:17 [conn31] end connection
127.0.0.1:8816
Fri Jul 8 04:02:23 [WriteBackListener] WriteBackListener exception :
socket exception
Fri Jul 8 04:02:26 [mongosMain] dbexit: received signal 2 rc:0
received signal 2
The connections from localhost here are me running the "show dbs"
command.
On my latest replica set addition, however, some of the mongos's
failed as usual but one of them hung and did not respond to "kill -2
[pid]". Inspecting the logfile revealed the usual null bytes and:
Backtrace: 0x52f8f5 0x7fb6d2e6eaf0 0x5523de 0x557ec5 0x50454b 0x505e04
0x6a50a0 0x7fb6d39729ca 0x7fb6d2f2170d
prod/bin/mongos(_ZN5mongo17printStackAndExitEi+0x75)[0x52f8f5]
/lib/libc.so.6(+0x33af0)[0x7fb6d2e6eaf0]
prod/bin/mongos(_ZN5mongo17ReplicaSetMonitor8checkAllEv+0x23e)
[0x5523de]
prod/bin/mongos(_ZN5mongo24ReplicaSetMonitorWatcher3runEv+0x55)
[0x557ec5]
prod/bin/
mongos(_ZN5mongo13BackgroundJob7jobBodyEN5boost10shared_ptrINS0_9JobStatusEEE
+0x12b)[0x50454b]
prod/bin/
mongos(_ZN5boost6detail11thread_dataINS_3_bi6bind_tIvNS_4_mfi3mf1IvN5mongo13BackgroundJobENS_10shared_ptrINS7_9JobStatusEEEEENS2_5list2INS2_5valueIPS7_EENSD_ISA_EEEEEEE3runEv
+0x74)[0x505e04]
prod/bin/mongos(thread_proxy+0x80)[0x6a50a0]
/lib/libpthread.so.0(+0x69ca)[0x7fb6d39729ca]
/lib/libc.so.6(clone+0x6d)[0x7fb6d2f2170d]