We have 2 types of services which connect to mongo. One which is read-only, and one which performs read/write queries.
Each service has a connection pool of up to 200 connections.
We currently run 2x of the read-only service and 4x of the read/write service. Typical number connections to our master is 550-800.
We never have issues in normal running. When restarting one of the read-only services, there is never an issue either.
However, when restarting one of the read/write servers, the master completely locks up. The amount of time the db locks up for is proportional to the amount of time the connected service has been running. From a couple of seconds, to the longest we've seen, 5 minutes, after the connected service has been running for a couple of days.
During the lock up 1x bin/mongod runs at 100% cpu of a single core. Our master runs on an EC2 High-Memory Double Extra Large Instance. The other 3 cores are dead idle. All disk IO is also dead idle during the lock up. It's not possible to query the db during this state to get running jobs etc.
The read/write service doesn't running any special start up or shutdown code. It just starts, takes requests, performs queries, replies and then on stop.. stops. While it is running performance is great. It's almost identical to the read only service which is able to restart fine. The only differences I can think of is it writes.. perhaps freeing the service's db connections after being used for a large number of writes could cause a lock up? The other difference is we use the master db to generate object ids as a way to get ids that are guaranteed to be incrementing across all our services. So a few million of these: _id = db.eval('ObjectId()') are issued from the service after a day or so of run time. Is it possible these evals could be setting up some type of connection state, which when the connections are closed could cause a lockup?
There's really no clue in the db logs. They just go quiet during the log up, except for a couple of sparse connections accepted and ended, and then normal activity comes back in a flurry after the lockup.
During the last lockup there was 4 minutes of sparse:
2012-04-27_21:49:07.91410 Fri Apr 27 21:49:07 [initandlisten] connection accepted from
10.194.227.159:37157 #9291798
2012-04-27_21:49:11.90823 Fri Apr 27 21:49:11 [initandlisten] connection accepted from
10.194.227.159:37177 #9291799
2012-04-27_21:49:11.91429 Fri Apr 27 21:49:11 [initandlisten] connection accepted from
10.194.227.159:37178 #9291800
2012-04-27_21:49:11.91454 Fri Apr 27 21:49:11 [initandlisten] connection accepted from
10.194.227.159:37179 #9291801
2012-04-27_21:49:11.92265 Fri Apr 27 21:49:11 [initandlisten] connection accepted from
10.194.227.159:37180 #9291802
And then:
2012-04-27_21:49:22.20714 Fri Apr 27 21:49:22 [conn7447185] command chalk-pad-g27.$cmd command: { $eval: CodeWScope( ObjectId(), {}), args: {} } ntoreturn:1 reslen:69 229568ms
And then about another minute of lock up, then things were back at full steam.
Are master is running on: mongodb-linux-x86_64-2.0.1
The issue happens whether the service connects directly or via a mongos.
Any clues for possible causes or fixes would be greatly appreciated.
Andy.