Lock up after restarting service which has been connected for some time

Showing 1-10 of 10 messages
Lock up after restarting service which has been connected for some time Andy Gayton 4/27/12 4:00 PM
We have 2 types of services which connect to mongo.  One which is read-only, and one which performs read/write queries.

Each service has a connection pool of up to 200 connections.

We currently run 2x of the read-only service and 4x of the read/write service.  Typical number connections to our master is 550-800.

We never have issues in normal running.  When restarting one of the read-only services, there is never an issue either.

However, when restarting one of the read/write servers, the master completely locks up.  The amount of time the db locks up for is proportional to the amount of time the connected service has been running.  From a couple of seconds, to the longest we've seen, 5 minutes, after the connected service has been running for a couple of days.

During the lock up 1x bin/mongod runs at 100% cpu of a single core.  Our master runs on an EC2 High-Memory Double Extra Large Instance.  The other 3 cores are dead idle.  All disk IO is also dead idle during the lock up.  It's not possible to query the db during this state to get running jobs etc.

The read/write service doesn't running any special start up or shutdown code.  It just starts, takes requests, performs queries, replies and then on stop.. stops.  While it is running performance is great.  It's almost identical to the read only service which is able to restart fine.  The only differences I can think of is it writes.. perhaps freeing the service's db connections after being used for a large number of writes could cause a lock up?  The other difference is we use the master db to generate object ids as a way to get ids that are guaranteed to be incrementing across all our services.  So a few million of these: _id = db.eval('ObjectId()') are issued from the service after a day or so of run time.  Is it possible these evals could be setting up some type of connection state, which when the connections are closed could cause a lockup?

There's really no clue in the db logs.  They just go quiet during the log up, except for a couple of sparse connections accepted and ended, and then normal activity comes back in a flurry after the lockup.

During the last lockup there was 4 minutes of sparse:

2012-04-27_21:49:07.91410 Fri Apr 27 21:49:07 [initandlisten] connection accepted from 10.194.227.159:37157 #9291798
2012-04-27_21:49:11.66835 Fri Apr 27 21:49:11 [conn9291776] end connection 10.194.227.159:37009
2012-04-27_21:49:11.66854 Fri Apr 27 21:49:11 [conn9291778] end connection 10.194.227.159:37011
2012-04-27_21:49:11.90823 Fri Apr 27 21:49:11 [initandlisten] connection accepted from 10.194.227.159:37177 #9291799
2012-04-27_21:49:11.91429 Fri Apr 27 21:49:11 [initandlisten] connection accepted from 10.194.227.159:37178 #9291800
2012-04-27_21:49:11.91454 Fri Apr 27 21:49:11 [initandlisten] connection accepted from 10.194.227.159:37179 #9291801
2012-04-27_21:49:11.92265 Fri Apr 27 21:49:11 [initandlisten] connection accepted from 10.194.227.159:37180 #9291802
2012-04-27_21:49:13.68824 Fri Apr 27 21:49:13 [conn9291780] end connection 10.194.227.159:37023

And then:

2012-04-27_21:49:22.20714 Fri Apr 27 21:49:22 [conn7447185] command chalk-pad-g27.$cmd command: { $eval: CodeWScope( ObjectId(), {}), args: {} } ntoreturn:1 reslen:69 229568ms

And then about another minute of lock up, then things were back at full steam.

Are master is running on: mongodb-linux-x86_64-2.0.1

The issue happens whether the service connects directly or via a mongos.

Any clues for possible causes or fixes would be greatly appreciated.

Andy.
Re: [mongodb-user] Lock up after restarting service which has been connected for some time Scott Hernandez 4/27/12 4:03 PM
What are you running with db.eval()? Are you saying db.currentOp()
doesn't return, or hangs?

Can you supply mongostat numbers when this happens, from before and after?

Please post the stats to gist/pastebin/etc.
> --
> You received this message because you are subscribed to the Google Groups
> "mongodb-user" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/mongodb-user/-/OvORlGZIBZIJ.
> To post to this group, send email to mongod...@googlegroups.com.
> To unsubscribe from this group, send email to
> mongodb-user...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/mongodb-user?hl=en.
Re: [mongodb-user] Lock up after restarting service which has been connected for some time Andy Gayton 4/27/12 4:46 PM
I'm running: 'ObjectId()' with db.eval()

Just retried then, db.currentOp() is actually returning.  I had it running every second during the lockup, logging to separate files, so lots of info there if more would be helpful.  There were only 2 jobs which were "waitingForLock" : false, I've included those jobs + mongostat at the start and end of the lock in this gist:

Re: [mongodb-user] Lock up after restarting service which has been connected for some time Scott Hernandez 4/27/12 5:00 PM
Why are you creating an objectId on the server like that?

Normally ObjectIDs are created by the driver on the client machine for
many reasons. If you plan to continue with this process (for some at
yet unknown reason) please at least specify the nolock option so you
don't block writes.
http://www.mongodb.org/display/DOCS/Server-side+Code+Execution#Server-sideCodeExecution-Using%7B%7Bdb.eval%28%29%7D%7D

At what time in that mongostat output was the system "locked up"?
> https://groups.google.com/d/msg/mongodb-user/-/y-fE75d-z08J.
Re: [mongodb-user] Lock up after restarting service which has been connected for some time Andy Gayton 4/27/12 5:32 PM
Thanks Scott,

As I mentioned in the original post:

'we use the master db to generate object ids as a way to get ids that are guaranteed to be incrementing across all our services.'

Really appreciate your help.  I don't want to get into semantics re phrasing, not sure how else to describe the system while it's in this state.  So making this eval call, without the nolock option is likely the cause of the flatlined 0 inserts, 0 queries, 0 updates, 0 deletes, 0 disk IO for several minutes?  Not sure if it's significant that this only happens when the connecting server is stopping?  While the connecting service is running everything runs smoothly.

I just gisted mongostat as the system when into, and out of this state.

Here's a little more of the middle:


We can definitely add the nolock option.  If this is something that's really bad to be putting on the master, we can also run a python rpc service, that hands out ObjectIds.  The nice part of using the master for this is we got auto-fail over for this service if the master went down, which we'd need to build into an rpc service we create ourselves.
Re: [mongodb-user] Lock up after restarting service which has been connected for some time Scott Hernandez 4/27/12 6:08 PM
There def. could be a better way to do this. Can you explain where you
use these ids and when you need them to be created?

What version are you running? From that last mongostat, when there are
lots of queued writes and reads, what does db.currentOp() show?

Yes, please add the nolock option and see if it goes away.

If you really need to create counters you can do so without using
javascript and still keep the counters on the server as described
here: http://www.mongodb.org/display/DOCS/How+to+Make+an+Auto+Incrementing+Field
> https://groups.google.com/d/msg/mongodb-user/-/QBqCUBppL1wJ.
Re: [mongodb-user] Lock up after restarting service which has been connected for some time Andy Gayton 4/27/12 8:03 PM
We provide a versioned data store that synchronizes our user's data instantly across many devices.  To do this clients listen to a stream of change objects, ordered by their change id, and apply incoming changes to their local data using operational transforms.  We use ObjectId's for the change id, as they have a bunch of nice qualities, in particular, a timestamp.  If a set of changes is processed close to the same time, each change processed by a different process, it's possible that order of the change ids won't match the change history if each server process generates the ObjectId locally, due to small time differences between the server processes.

It's sounds like we're really abusing what eval is meant for here.  I'm going to move generating the ObjectIds out of the master mongo node to an rpc process and see if that addresses the problem.

We're running on: mongodb-linux-x86_64-2.0.1

For the db.currentOp() during the thick of the inactivity period on the db, there's a large number of jobs waiting for lock, and the two jobs I gisted before, that aren't waiting for lock:

ubuntu@ip-10-218-95-168:~$ grep '"waitingForLock" : true,'  /tmp/currentOp-20:02.txt  | wc -l
281
ubuntu@ip-10-218-95-168:~$ grep '"waitingForLock" : false,'  /tmp/currentOp-20:02.txt  | wc -l
2

The two jobs that aren't waiting for lock:


Not sure it'd be helpful to see an entire dump of the db.currentOp?
Re: [mongodb-user] Lock up after restarting service which has been connected for some time Scott Hernandez 4/27/12 8:45 PM
Nope, the facf that the eval is taking over 106 seconds is your
blocker. Using the findAndModify to generate your ids will be much
faster. If you just need a sequence id generated by the server on each
insert to order the changes then you can also use another trick which
doesn't require a round-trip. But the findAndModify is really the
easiest.
> https://groups.google.com/d/msg/mongodb-user/-/nQselJ3O2AkJ.
Re: [mongodb-user] Lock up after restarting service which has been connected for some time Andy Gayton 4/27/12 9:23 PM
Roger re using findAndModify for auto-incrementing int ids.

Something seems awry that ObjectId() would take 106s, or in the original post's case, 229s - *only* around the same time a service releases a bunch of connections.

2012-04-27_21:49:22.20714 Fri Apr 27 21:49:22 [conn7447185] command chalk-pad-g27.$cmd command: { $eval: CodeWScope( ObjectId(), {}), args: {} } ntoreturn:1 reslen:69 229568ms

Re: [mongodb-user] Lock up after restarting service which has been connected for some time Scott Hernandez 4/28/12 11:54 AM
This is most likely a javascript leak, or bug, which may be fixed in
newer versions... hard to say.

Do you use the same connection to call that db.eval every time from your apps?
> https://groups.google.com/d/msg/mongodb-user/-/pPP7SOPvuO4J.