can't create new thread, closing connections

张耀星

unread,

Sep 28, 2012, 2:39:03 PM9/28/12

to mongod...@googlegroups.com

hello everyone,

while using a master/slave cluster to serving my web application, I get the error "can't create thread. closing connection". I dug a lot but didn't find any solution. anyone who ran into the same issue could you please shed me some light?

here's the detail of my issue:

I'm using C# driver v1.1.0.4184 to connect to a slave instance which is running mongodb 2.0.7 (the master instance runs 2.0.2), with option maxPoolSize=300;slaveOk=true.

when the web server is online, I can see from the log that the connection qty increases very rapidly. and when it reaches max user process limit (ulimit -u = 1024), I begin to see the error:

Fri Sep 28 06:37:21 [initandlisten] connection accepted from xx.xx.xx.xx:64034 #1073 (1014 connections now open)

Fri Sep 28 06:37:21 [initandlisten] pthread_create failed: errno:11 Resource temporarily unavailable

Fri Sep 28 06:37:21 [initandlisten] can't create new thread, closing connection

after some internet searching, I decided to increase "ulimit -u" to 20480. I do noticed another message from the log. I can't find the exact log now, but it says something like mongodb is releasing unused connections. when I see this message, the connection qty decreases very quickly. however, new connections are created even more quickly. so after a longer period, I end up got the same error again.

then I tried to stop master/slave replication, and make both servers master. this time everything works fine, connection qty is always kept less than 250.

here's what I think:

assuming from the log, I think it's the C# driver who didn't know it has created enough connections in the connection pool, instead, it kept creating new ones. thus the old ones are never used again, after a period, mongodb thinks the old connections are not used anymore and released them. that's why I see the 2nd message.

what I didn't figure out is why does this issue only happen to slave instance? does C# driver use monogdb to store it's current connection pool size? because this way it can never write the qty to slave instance, and I guess that's why it kept creating new connections because it can't get how many connections are already created.

thank you for reading my long post. really hope someone can't help me figure out a solution.

Robert Stam

unread,

Sep 28, 2012, 11:15:27 PM9/28/12

to mongod...@googlegroups.com

Version 1.1 of the C# driver is very old. You may be encountering this issue:

https://jira.mongodb.org/browse/CSHARP-302

Can you try a newer version of the driver?

--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To post to this group, send email to mongod...@googlegroups.com
To unsubscribe from this group, send email to
mongodb-user...@googlegroups.com
See also the IRC channel -- freenode.net#mongodb

张耀星

unread,

Sep 29, 2012, 12:36:24 PM9/29/12

to mongod...@googlegroups.com

Thanks for the tip. We did consider using a new driver, but it seems there are too many incompatible changes done. Still need some time to review our code before we can use it.

Is there any other work around? I really need to use the slave instance to reduce master presure in a short time.

在 2012年9月29日星期六UTC+8上午11时15分39秒，Robert Stam写道：

Robert Stam

unread,

Sep 29, 2012, 12:46:36 PM9/29/12

to mongod...@googlegroups.com

It's hard to suggest workarounds for a version of the driver that is over a year old.

One thing you could try is to open a separate direct connection to the secondaries for queries that you want to send to the secondaries.

That may or may not solve this issue though, since CSHARP-302 was more about how connections get closed (and how they built up when they weren't being closed fast enough) when errors occur than about whether queries are being sent to secondaries.

张耀星

unread,

Sep 29, 2012, 1:15:42 PM9/29/12

to mongod...@googlegroups.com

yea I totally understand it's hard to suggest. just need to make it work before the new driver passes the test.

and sorry, I don't quite get your suggestion. what do you mean a "separate direct connection"? since my site is now unstable anyway I'm willing give it a shot.

在 2012年9月30日星期日UTC+8上午12时46分46秒，Robert Stam写道：

Robert Stam

unread,

Sep 29, 2012, 2:37:54 PM9/29/12

to mongod...@googlegroups.com

When you say master/slave I assume you mean a replica set?

A connection to a replica set lists the members of the replica set on the connection string:

mongodb://host1,host2,host3/?safe=true

When you connect to a replica set the driver knows about all the members and routes queries to the primary (unless slaveOk is true).

A direct connection to just one member of the replica set would have just that one host on the connection string:

mongodb://host2/?safe=true

A direct connection doesn't know about the other members of the replica set so all queries (slaveOk or not) would be routed to this one member.

You would create a new MongoServerInstance for each connection string you use.

Keep in mind though that any one of the hosts could be the primary, so host2 could be either a primary or a secondary.

Once again though, if the problem is CSHARP-302 (fixed over a year ago) then your only solution will be to upgrade to a newer version of the driver.

张耀星

unread,

Sep 29, 2012, 9:04:30 PM9/29/12

to mongod...@googlegroups.com

I see. sorry I didn't make it clear enough. I'm using direct connection to the secondary already. my whole issue happens when I'm connecting to a secondary directly.

then I guess my only solution now is to upgrade the driver.

well, thanks anyway.

在 2012年9月30日星期日UTC+8上午2时38分05秒，Robert Stam写道：

张耀星

unread,

Sep 29, 2012, 9:55:54 PM9/29/12

to mongod...@googlegroups.com

One more thing, I also find a lot of this exception in our log:

Unable to read data from the transport connection: A connection attempt failed because the connected part did not properly respond after a period of time, or established connection failed because connected host as failed to respond

Do you think it's also caused by the same bug you mentioned above? to me it smells like when the connection storm happens, the server is too busy to respond.

在 2012年9月30日星期日UTC+8上午2时38分05秒，Robert Stam写道：

Robert Stam

unread,

Sep 29, 2012, 10:02:45 PM9/29/12

to mongod...@googlegroups.com

Yes, if the server is closing sockets the client would get this error.

张耀星

unread,

Sep 30, 2012, 7:12:27 AM9/30/12

to mongod...@googlegroups.com

great, then it looks like upgrade the driver would resolve everything. all I have to do is to get my team upgrade the driver as soon as possible.

thanks a lot for you help.

在 2012年9月30日星期日UTC+8上午10时02分56秒，Robert Stam写道：

张耀星

unread,

Oct 1, 2012, 10:41:28 AM10/1/12

to mongod...@googlegroups.com

we've upgraded the driver to latest version, and was running online for several hours. it seems there're almost no exceptions thrown anymore. and server stress is kept in a low level until now.

It's achieving traffic peak in 4 hours, we'll see if the driver resolves the issue completely.

there's one more thing which I don't know whether it's related to the driver. Now I see a lot of log like this kind:

Mon Oct 1 09:17:20 [initandlisten] connection accepted from 10.xx.xx.xx:57566 #1557

Mon Oct 1 09:17:20 [conn1557] end connection 10.xx.xx.xx:57566

It seems like a connection is created and release in a very short time. I'm not sure if it's an expected behavior. for me it seems more like someone didn't use the driver in a correct way. maybe called the Disconnect or something else. what do you think?

在 2012年9月30日星期日UTC+8下午7时12分27秒，张耀星写道：

Robert Stam

unread,

Oct 1, 2012, 10:44:56 AM10/1/12

to mongod...@googlegroups.com

Are you seeing these once every 10 seconds?

The driver pings each server once every 10 seconds to check whether it is still up and what state it is in. It uses a new connection each time for this.

张耀星

unread,

Oct 1, 2012, 3:32:49 PM10/1/12

to mongod...@googlegroups.com

yes you're right, it's every 10s:

Mon Oct 1 14:13:36 [initandlisten] connection accepted from 10.80.xx.xx:61287 #483

Mon Oct 1 14:13:36 [conn483] end connection 10.80.xx.xx:61287

Mon Oct 1 14:13:43 [initandlisten] connection accepted from 10.4.xx.xx:55519 #484

Mon Oct 1 14:13:43 [conn484] end connection 10.4.xx.xx:55519

Mon Oct 1 14:13:47 [initandlisten] connection accepted from 10.80.xx.xx:61296 #485

Mon Oct 1 14:13:47 [conn485] end connection 10.80.xx.xx:61296

Mon Oct 1 14:13:53 [initandlisten] connection accepted from 10.4.xx.xx:56289 #486

Mon Oct 1 14:13:53 [conn486] end connection 10.4.xx.xx:56289

Mon Oct 1 14:13:56 [initandlisten] connection accepted from 10.80.xx.xx:61315 #487

Mon Oct 1 14:13:56 [conn487] end connection 10.80.xx.xx:61315

But bad news is the connection problem happened again. slightly different from last time though. now we got a lot of the following error before the server goes down.

Unable to connect to server 10.51.xx.xx:27017: A connection attemp failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond 10.51.xx.xx:27017

I'm a bit confused, usually this exception means a private network issue. but after restarting app pool everything recovered. Do you think there's anything else that can cause this issue? By the way, I didn't see the "an existing connection was forcibly closed by the remote host mongodb" exception this time. and both MongoDB and IIS are working alright, CPU pressure is low, too.

It only happens one time after upgrading the driver, much less frequently than before. So I guess it may be a different issue.

在 2012年10月1日星期一UTC+8下午10时45分05秒，Robert Stam写道：

craiggwilson

unread,

Oct 1, 2012, 4:07:18 PM10/1/12

to mongod...@googlegroups.com

Can you be very specific about your server setup and what your connection strings look like? I'd like a complete picture of this instead of trying to put it together from 10 different messages.

Thanks :)

张耀星

unread,

Oct 1, 2012, 11:29:25 PM10/1/12

to mongod...@googlegroups.com

sure. let me try to put everything together.

our site was recently migrated from SQLServer/IIS application. new site is running on MongoDB/IIS. not until several days ago we had all the traffic switched to the new site. before that it was 50%, about 4 million PV per day. It was doing good. Mongo was running on a 8 cores/12G RAM server, 2 IIS are connected to it.

before switching, to make everything goes as expected, I add 2 more IIS and a new MongoDB which runs on a 16 cores/12G RAM server. I was planning to have a Master/Slave cluster to serve the 4 web servers. The master node runs MongoDB 2.0.2, and the slave node 2.0.7. Both of them are installed from official source by yum install. didn't do any special configurations.

since the slave has a better hardware, I want the original server to handle write only, and has all the IIS connected to slave. connection string was simply a basic one plus "maxPoolSize=300;slaveOk=true". however I get the error described in my first post. the driver kept creating new connections without releasing them.when the connection qty reaches ulimit -u, the server is down. there are a lot of error message:

Fri Sep 28 06:37:21 [initandlisten] connection accepted from xx.xx.xx.xx:64034 #1073 (1014 connections now open)

Fri Sep 28 06:37:21 [initandlisten] pthread_create failed: errno:11 Resource temporarily unavailable

Fri Sep 28 06:37:21 [initandlisten] can't create new thread, closing connection

and from IIS side there are two kinds of exceptions:

an existing connection was forcibly closed by the remote host mongodb

because the server is was planned to be online soon, I had noway but set both MongoDB to master mode. data was synced by using a windows service created before. each was connected by 2 IIS. generally it works, but not very good. I had to restart both IIS and MongoDB every several hours, especially during peak hours. otherwise CPU of both IIS and MongoDB goes up to 100% randomly. can't find anything abnormal in MongoDB logs but there are a lot of exceptions thrown in IIS side:

Unable to read data from the transport connection: A connection attempt failed because the connected part did not properly respond after a period of time, or established connection failed because connected host as failed to respond

Timeout expired. The timeout period elapsed prior to completion of the operation or the server is not responding.

I was thinking it was the high pressure that makes the server unable to respond. so follow Robert's suggestion, we upgraded C# driver from 1.1 to 1.6, to stop connection storm. he was right, the main issue was resolved, MongoDB CPU never goes up to 100% anymore (even during the time IIS was 100%).

Now the 2nd exception disappeared. most of the time there's no exceptions at all. but during last night around 3:00AM GMT+8, IIS connected to the new MongoDB were both down (IIS connected to the old MongoDB was alright). didn't find anything valuable in MongoDB log. but in IIS we got a lot of:

Unable to read data from the transport connection: A connection attempt failed because the connected part did not properly respond after a period of time, or established connection failed because connected host as failed to respond

by that time, both IIS had a very high CPU usage (while in MongoDB the pressure is very low). so I restarted app pool, and it recovered at once.

so that's all. It's the first night after upgrading to new driver. I haven't found anything else yet. and I didn't try to make them master/slave again. maybe several days later I'll try the master/slave again since it's holiday here now and nobody's in the office. what do you think? anything else I need to provide?

在 2012年10月2日星期二UTC+8上午4时07分18秒，craiggwilson写道：

张耀星

unread,

Oct 1, 2012, 11:58:41 PM10/1/12

to mongod...@googlegroups.com

Sorry I made a mistake. the exception happened last night was not the one in my last post. it was:

Unable to connect to server 10.xx.xx.xx:27017: A connection attempt failed because the connected part did not properly respond after a period of time, or established connection failed because connected host as failed to respond
They look so similar to each other.

在 2012年10月2日星期二UTC+8上午4时07分18秒，craiggwilson写道：

craiggwilson

unread,

Oct 2, 2012, 8:50:47 AM10/2/12

to mongod...@googlegroups.com

Thanks, this is very helpful. What I see is this: Everything is running well and then, at some point, you start getting TCP/IP error messages ("Unable to read data from..."). That is the .NET framework throwing the exception, not the MongoDB layer. The usual cause of this error is some type of network issue where the driver can't communicate with MongoDB anymore. It's possible we have a bug somewhere that, after some period of time, it stops behaving properly. This is the first we've heard of this and this scenario will be extremely hard to narrow down. Hence, I'd like to walk through some other possible issues as well.

1) Can you provide the connection strings you are using?

2) Can you provide some code for how you are

a) setting up your app - are you using an IoC container? How are you creating your MongoServer instances?

b) querying and writing to the database - How are you creating MongoDatabase and MongoCollection?

c) terminating your "sessions" - are you calling Disconnect?

Craig

张耀星

unread,

Oct 2, 2012, 1:25:31 PM10/2/12

to mongod...@googlegroups.com

Sorry I don't have the code at hand right now. It's in my office.

But until now I it doesn't happen again, not even once. I'd like to watch one more night to see if it happens again. Maybe it is really just a network issue. We do got a lot of maintenance notice from IDC recently. Let's see how's it doing tonight. If it's still happening, tomorrow I'll go to the office to get the code.

Thanks a lot for your help.

在 2012年10月2日星期二UTC+8下午8时50分47秒，craiggwilson写道：

张耀星

unread,

Oct 3, 2012, 9:23:02 AM10/3/12

to mongod...@googlegroups.com

bad news. last night the MongoDB CPU raise up to 100% again (but IIS still works fine, just slower). after a service restarting, it's recovered. any known issue with MongoDB 2.0.7?

here's the part I don't understand. now we have 2 mongo servers. 2 IIS connected to each. what makes me confused is that the old mongodb with weaker hardware has a higher CPU usage percentage, but is doing fine. I don't see any error log nor need to restart it. while the new server with better hardware which is supposed to act much better always causes problem. the only difference between them is the old one runs mongo 2.0.2, and the new one runs 2.0.7

another thing I noticed is that from the result of htop, I can see 90% of the CPU bar is red. I think it means it's occupied by linux kernel threads right? the disk on new server is a RAID10, does it mean anything wrong with the RAID disks?

在 2012年10月3日星期三UTC+8上午1时25分31秒，张耀星写道：

张耀星

unread,

Oct 24, 2012, 11:55:05 PM10/24/12

to mongod...@googlegroups.com

OK we finally find the reason. I'll skip the details and write down the cause in case some one met the same issue.

So there are generally 2 reasons caused our problem.

The first one is what Robert mentioned above, our driver is too old that may lead to a connection storm.

The second issue has nothing to do with C# driver. It's a problem of Linux and MongoDB versions.

Our engineer installed a wrong version of CentOS 6.3 (we asked for 6.0). When we run MongoDB 2.0.2/2.0.7 on CentOS 6.3, it caused a high CPU consumption. We can reproduce the issue by putting a high pressure on MongoDB, then CPU usage begin to raise very quickly, but most of it is occupied by Linux kernel processes. When CPU reaches a high level, it never reduce again even if we remove all pressure (or takes a very long time to reduce). Now we tried MongoDB 2.2.0 on CentOS 6.0, Everything works fine again. CPU consumption is much lower and almost no kernel process time spent.

In conclusion, DON'T run MongoDB 2.0.x on CentOS 6.3.

在 2012年10月3日星期三UTC+8下午9时23分02秒，张耀星写道：

Reply all

Reply to author

Forward