Lots of error: 9001 socket exceptions

Daniel

unread,

Sep 17, 2012, 8:20:38 AM9/17/12

to mongod...@googlegroups.com

Hi,

i am experiencing lots of "error: 9001 socket exception [3] server" exceptions. All the servers are reachable.

I have already restarted the config servers and the mongos servers as well.

The main problem is, that mongos instances die frequently. I don't know.

Maybe this problem is related to a network problem I had. Our hosting facility broke their DNS server and so no domain name could

be resolved for couple of hours. Could this destroy the whole MongoDB cluster?

I also see warnings like

warning: distributed lock pinger 'sever1:27019,server2:27019,server3:27019/mongosserver1:27017:1347879109:1804289383' detected an exception while pinging. :: caused by :: SyncClusterConnection::udpate prepare failed: 10276 DBClientBase::findN: transport error: server1:27019 ns: admin.$cmd query: { fsync: 1 } server1:27019:{}

Thanks & regards

Daniel

unread,

Sep 17, 2012, 9:14:36 AM9/17/12

to mongod...@googlegroups.com

Update: This is the error which causes the mongos to crash:

Mon Sep 17 15:02:52 [WriteBackListener-shard1:27018] DBClientCursor::init call() failed
Mon Sep 17 15:02:52 [WriteBackListener-shard1:27018] WriteBackListener exception : DBClientBase::findN: transport error: shard1:27018 ns: admin.$cmd query: { writebacklisten: ObjectId('50407de3a9156801d6dd726a') }
Mon Sep 17 15:02:52 [WriteBackListener-shard1:27018] SyncClusterConnection connecting to [config1:27019]
Mon Sep 17 15:02:52 [WriteBackListener-shard1:27018] SyncClusterConnection connecting to [config2:27019]
Mon Sep 17 15:02:52 [WriteBackListener-shard1:27018] SyncClusterConnection connecting to [config3:27019]
Mon Sep 17 15:02:53 [WriteBackListener-shard1:27018] Socket recv() errno:104 Connection reset by peer shard1-IP:27018
Mon Sep 17 15:02:53 [WriteBackListener-shard1:27018] SocketException: remote: shard1-IP:27018 error: 9001 socket exception [1] server [shard1-IP:27018]
Mon Sep 17 15:02:53 [WriteBackListener-shard1:27018] DBClientCursor::init call() failed
Mon Sep 17 15:02:53 [WriteBackListener-shard1:27018] WriteBackListener exception : DBClientBase::findN: transport error: shard1:27018 ns: admin.$cmd query: { writebacklisten: ObjectId('50407de3a9156801d6dd726a') }
Mon Sep 17 15:02:54 [mongosMain] connection accepted from client1:47748 #1847 (7 connections now open)
Mon Sep 17 15:02:54 [conn1847] got not master for: shard1:27018
Received signal 11
Backtrace: 0x8386d5 0x7f21638c34f0 0x7f215f6b66c0
mongos(_ZN5mongo17printStackAndExitEi+0x75)[0x8386d5]
/lib/x86_64-linux-gnu/libc.so.6(+0x324f0)[0x7f21638c34f0]
[0x7f215f6b66c0]

I am using the MongoDB 2.2 and for the client Java version 2.9

All servers are reachable and all firewall settings and set correctly.

Thanks.

Mark Hillick

unread,

Sep 18, 2012, 6:03:28 AM9/18/12

to mongod...@googlegroups.com

Hi Daniel,

Did these errors only start after the DNS problem? Is the DNS problem now resolved and these socket exceptions are still happening?

Do you have authentication enabled on the shard cluster?

Were there any issues with the config server?

Would you be able to provide a larger snippet of the logs?

Thanks

Mark

Daniel

unread,

Sep 18, 2012, 6:50:15 AM9/18/12

to mongod...@googlegroups.com

I think, that they started after the DNS problem. But maybe it was also just coincidence. They are still there and really ugly because they totally prevent me from going productive with my system. There was also another guy who had this problem and there is an issue in Jira for this: https://jira.mongodb.org/browse/SERVER-7029. I have added some level 2 log snippets into the comments.

The cluster runs without authentication. It's protected by firewalls, but they are correctly configured. And when i don't add data the error doesn't occur and the system runs fine. But if i start using the system then this problem comes up.

Daniel

Mark Hillick

unread,

Sep 18, 2012, 8:56:51 AM9/18/12

to mongod...@googlegroups.com

Hi Daniel,

I saw that SERVER ticket when I first saw your posting but wanted to learn more about your environment.

If the MongoS is crashing after or during a RS failover then the SERVER ticket is the best way to progress this.

Thanks

Mark

Daniel

unread,

Sep 18, 2012, 9:04:45 AM9/18/12

to mongod...@googlegroups.com

Mark,

how can I find out if this happens before or after? Or is there no need to do further steps for debugging on my side?

I don't really know, where I should start to fix this, because I am not sure if this is a MongoDB core problem.

Thanks.

Mark Hillick

unread,

Sep 18, 2012, 9:11:30 AM9/18/12

to mongod...@googlegroups.com

Hi Daniel,

Can you check your mongod logs for the first occurrence of the exception?

egrep "error: 9001 socket exception" mongo_log_file

should return something. Compare the time of the first occurence with the time of the DNS issue?

An exception can happen for a number of reasons though such a network connection problems, firewall issues, hitting ulimit threshold, running of threads etc, however, you have not indicated that these events have occurred.

In terms of the SERVER ticket, it is assigned to an engineer who now owns the problem and will be creating a fix.

Did you make any changes prior to the crashes beginning? If so, you may want to consider backing them out (in a safe manner).

Thanks

Mark

Daniel

unread,

Sep 18, 2012, 9:29:26 AM9/18/12

to mongod...@googlegroups.com

OK, this is the first occurrence. Not when the DNS happened because I don't have the logs for this data anymore. But this one is when i started importing some data:

Mon Sep 17 17:03:46 [WriteBackListener-member3_rs2:27018] SocketException: remote: member3_rs2IP:27018 error: 9001 socket exception9001 socket exception [0] server [member3_rs2IP:27018]

Mon Sep 17 17:03:46 [WriteBackListener-member3_rs2:27018] DBClientCursor::init call() failed

Mon Sep 17 17:03:46 [WriteBackListener-member3_rs2:27018] User Assertion: 10276:DBClientBase::findN: transport error: member3_rs2:27018 ns: admin.$cmd query: { writebacklisten: ObjectId('5057388e7773804af634bb61') }

Mon Sep 17 17:03:46 [WriteBackListener-member3_rs2:27018] WriteBackListener exception : DBClientBase::findN: transport error: member3_rs2:27018 ns: admin.$cmd query: { writebacklisten: ObjectId('5057388e7773804af634bb61') }

Mon Sep 17 17:03:46 [Balancer] rs_wa_2 has more chunks me:623 best: rs_wa_1:623

Mon Sep 17 17:03:46 [Balancer] rs_wa_3 has more chunks me:627 best: rs_wa_1:623

Mon Sep 17 17:03:46 [Balancer] collection : database.documents

Mon Sep 17 17:03:46 [Balancer] donor : rs_wa_3 chunks on 627

Mon Sep 17 17:03:46 [Balancer] receiver : rs_wa_1 chunks on 623

Mon Sep 17 17:03:46 [Balancer] threshold : 8

Mon Sep 17 17:03:46 [Balancer] rs_wa_2 has more chunks me:1 best: rs_wa_1:0

Mon Sep 17 17:03:46 [Balancer] rs_wa_3 has more chunks me:0 best: rs_wa_1:0

Mon Sep 17 17:03:46 [Balancer] collection : webanalyzer.documents2

Mon Sep 17 17:03:46 [Balancer] donor : rs_wa_2 chunks on 1

Mon Sep 17 17:03:46 [Balancer] receiver : rs_wa_1 chunks on 0

Mon Sep 17 17:03:46 [Balancer] threshold : 2

Mon Sep 17 17:03:46 [Balancer] no need to move any chunk

Mon Sep 17 17:03:46 [Balancer] *** end of balancing round

Mon Sep 17 17:03:47 [ReplicaSetMonitorWatcher] checking replica set: rs_wa_1

Daniel

unread,

Sep 18, 2012, 9:31:59 AM9/18/12

to mongod...@googlegroups.com

An exception can happen for a number of reasons though such a network connection problems, firewall issues, hitting ulimit threshold, running of threads etc, however, you have not indicated that these events have occurred.

OK, but shouldn't there be some failover when e.g. a network problem appears? The crashing of the mongos is really weird. My database is some kind of big and i am little bit concerned if this happens in a productive system. There is no fast way to recover all the data.

Daniel

Mark Hillick

unread,

Sep 21, 2012, 10:28:40 AM9/21/12

to mongod...@googlegroups.com

Hi Daniel,

Apologies for the delay replying.

The MongoS is stateless. Even if there was failover with state, I don't think you could 100% guarantee not losing data.

You could potentially put a load-balancer in front of multiple MongoS's but as they're stateless, you would still have to restart the import.

The MongoS should not be dying on an import. Your particular error has been fixed in 2.2.1 as per https://jira.mongodb.org/browse/SERVER-7061.