Google Groups

Re: [mongodb-user] Lock up after restarting service which has been connected for some time


Andy Gayton Apr 27, 2012 4:46 PM
Posted in group: mongodb-user
I'm running: 'ObjectId()' with db.eval()

Just retried then, db.currentOp() is actually returning.  I had it running every second during the lockup, logging to separate files, so lots of info there if more would be helpful.  There were only 2 jobs which were "waitingForLock" : false, I've included those jobs + mongostat at the start and end of the lock in this gist:


On Friday, April 27, 2012 4:03:58 PM UTC-7, Scott Hernandez wrote:
What are you running with db.eval()? Are you saying db.currentOp()
doesn't return, or hangs?

Can you supply mongostat numbers when this happens, from before and after?

Please post the stats to gist/pastebin/etc.

On Fri, Apr 27, 2012 at 4:00 PM, Andy Gayton <an...@thecablelounge.com> wrote:
> We have 2 types of services which connect to mongo.  One which is read-only,
> and one which performs read/write queries.
>
> Each service has a connection pool of up to 200 connections.
>
> We currently run 2x of the read-only service and 4x of the read/write
> service.  Typical number connections to our master is 550-800.
>
> We never have issues in normal running.  When restarting one of the
> read-only services, there is never an issue either.
>
> However, when restarting one of the read/write servers, the master
> completely locks up.  The amount of time the db locks up for is proportional
> to the amount of time the connected service has been running.  From a couple
> of seconds, to the longest we've seen, 5 minutes, after the connected
> service has been running for a couple of days.
>
> During the lock up 1x bin/mongod runs at 100% cpu of a single core.  Our
> master runs on an EC2 High-Memory Double Extra Large Instance.  The other 3
> cores are dead idle.  All disk IO is also dead idle during the lock up.
>  It's not possible to query the db during this state to get running jobs
> etc.
>
> The read/write service doesn't running any special start up or shutdown
> code.  It just starts, takes requests, performs queries, replies and then on
> stop.. stops.  While it is running performance is great.  It's almost
> identical to the read only service which is able to restart fine.  The only
> differences I can think of is it writes.. perhaps freeing the service's db
> connections after being used for a large number of writes could cause a lock
> up?  The other difference is we use the master db to generate object ids as
> a way to get ids that are guaranteed to be incrementing across all our
> services.  So a few million of these: _id = db.eval('ObjectId()') are issued
> from the service after a day or so of run time.  Is it possible these evals
> could be setting up some type of connection state, which when the
> connections are closed could cause a lockup?
>
> There's really no clue in the db logs.  They just go quiet during the log
> up, except for a couple of sparse connections accepted and ended, and then
> normal activity comes back in a flurry after the lockup.
>
> During the last lockup there was 4 minutes of sparse:
>
> 2012-04-27_21:49:07.91410 Fri Apr 27 21:49:07 [initandlisten] connection
> accepted from 10.194.227.159:37157 #9291798
> 2012-04-27_21:49:11.66835 Fri Apr 27 21:49:11 [conn9291776] end connection
> 10.194.227.159:37009
> 2012-04-27_21:49:11.66854 Fri Apr 27 21:49:11 [conn9291778] end connection
> 10.194.227.159:37011
> 2012-04-27_21:49:11.90823 Fri Apr 27 21:49:11 [initandlisten] connection
> accepted from 10.194.227.159:37177 #9291799
> 2012-04-27_21:49:11.91429 Fri Apr 27 21:49:11 [initandlisten] connection
> accepted from 10.194.227.159:37178 #9291800
> 2012-04-27_21:49:11.91454 Fri Apr 27 21:49:11 [initandlisten] connection
> accepted from 10.194.227.159:37179 #9291801
> 2012-04-27_21:49:11.92265 Fri Apr 27 21:49:11 [initandlisten] connection
> accepted from 10.194.227.159:37180 #9291802
> 2012-04-27_21:49:13.68824 Fri Apr 27 21:49:13 [conn9291780] end connection
> 10.194.227.159:37023
>
> And then:
>
> 2012-04-27_21:49:22.20714 Fri Apr 27 21:49:22 [conn7447185] command
> chalk-pad-g27.$cmd command: { $eval: CodeWScope( ObjectId(), {}), args: {} }
> ntoreturn:1 reslen:69 229568ms
>
> And then about another minute of lock up, then things were back at full
> steam.
>
> Are master is running on: mongodb-linux-x86_64-2.0.1
>
> The issue happens whether the service connects directly or via a mongos.
>
> Any clues for possible causes or fixes would be greatly appreciated.
>
> Andy.
>
> --
> You received this message because you are subscribed to the Google Groups
> "mongodb-user" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/mongodb-user/-/OvORlGZIBZIJ.
> To post to this group, send email to mongod...@googlegroups.com.
> To unsubscribe from this group, send email to
> mongodb-user...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/mongodb-user?hl=en.