Mongod seems hung

RobV23

unread,

Nov 18, 2010, 11:50:01 AM11/18/10

to mongodb-user

Hi all, hoping maybe someone has a suggestion.

Our mongod seems hung, at least partially.

When a client connects, the connection takes place, but commands
simply hang. For example, the mongo client connects but a show dbs
hangs. The log shows the connection, but no subsequent activity.

When I try to connect through the web (28017), I get a "timeout
getting readlock" error.

However, I can send a SIGUSR1 and the log rolls over, so it's doing
something. When I run ps, I see the state bouncing back and forth
between poll and ?.

I'm running x86_64_1.6.3.

Any thoughts on what else I can look at or do? I'd rather not bounce
the DB if possible (although it's not doing anything at the moment).
Prior to this, it acted normally. The last activity was creating an
index on a M/R temp collection.

Eliot Horowitz

unread,

Nov 18, 2010, 11:50:42 AM11/18/10

to mongod...@googlegroups.com

can you do: db.currentOp() from the shell

> --
> You received this message because you are subscribed to the Google Groups "mongodb-user" group.
> To post to this group, send email to mongod...@googlegroups.com.
> To unsubscribe from this group, send email to mongodb-user...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.
>
>

RobV23

unread,

Nov 18, 2010, 1:59:34 PM11/18/10

to mongodb-user

Hi Eliot,
thanks for the reply.

I can do the db.currentOp() and get a long list of operations.
Does this imply I may not need to restart mongo?

There are 298 operations. Of those, 12 are {"active" : true}
here are a few samples.
{
"opid" : 37902836,
"active" : true,
"lockType" : "read",
"waitingForLock" : true,
"secs_running" : 9687,
"op" : "query",
"ns" : "?",
"query" : {
"listDatabases" : 1
},
"client" : "127.0.0.1:35918",
"desc" : "conn"
},

{
"opid" : 37902768,
"active" : true,
"lockType" : "read",
"waitingForLock" : false,
"secs_running" : 9835,
"op" : "query",
"ns" :
"control_DB.CustomControllerConfigurations",
"query" : {
"$msg" : "query not recording (too
large)"
},
"client" : "10.51.240.201:49947",
"desc" : "conn"
},

FYI:
> db.serverStatus()
{
"version" : "1.6.3",
"uptime" : 1182642,
"uptimeEstimate" : 1171918,
"localTime" : "Thu Nov 18 2010 13:39:40 GMT-0500 (EST)",
"globalLock" : {
"totalTime" : 1182641898687,
"lockTime" : 261570843,
"ratio" : 0.00022117501780581492,
"currentQueue" : {
"total" : 298,
"readers" : 53,
"writers" : 245
}
},
"mem" : {
"bits" : 64,
"resident" : 1227,
"virtual" : 12730,
"supported" : true,
"mapped" : 9951
},
"connections" : {
"current" : 313,
"available" : 506
},
"extra_info" : {
"note" : "fields vary by platform",
"heap_usage_bytes" : 43844432,
"page_faults" : 1516
},
"indexCounters" : {
"btree" : {
"accesses" : 534800,
"hits" : 534800,
"misses" : 0,
"resets" : 0,
"missRatio" : 0
}
},
"backgroundFlushing" : {
"flushes" : 19710,
"total_ms" : 34659,
"average_ms" : 1.758447488584475,
"last_ms" : 1,
"last_finished" : "Thu Nov 18 2010 13:39:00 GMT-0500
(EST)"
},
"cursros" : {
"totalOpen" : 1,
"clientCursors_size" : 1,
"timedOut" : 0
},
"opcounters" : {
"insert" : 4841785,
"query" : 32359041,
"update" : 425016,
"delete" : 89562,
"getmore" : 12172,
"command" : 175539
},
"asserts" : {
"regular" : 0,
"warning" : 0,
"msg" : 0,
"user" : 7,
"rollovers" : 0
},
"ok" : 1

Eliot Horowitz

unread,

Nov 18, 2010, 2:04:03 PM11/18/10

to mongod...@googlegroups.com

try killing 37902768

RobV23

unread,

Nov 18, 2010, 2:36:37 PM11/18/10

to mongodb-user

> db.killOp(37902768)
{ "info" : "attempting to kill op" }

After the operation, nothing seems to have changed. The status of the
op (as shown in db.currentOp() is the same and the system is still
unresponsive to db commands.

RobV23

unread,

Nov 18, 2010, 4:49:35 PM11/18/10

to mongodb-user

Another tidbit of info. I tried a kill -2 and kill -15, but the
process ignored both signals. It caught the sig11 and displayed the
stack trace as expected.

Once I restarted mongod, everything came up fine again.

Thanks for the tip on killing operations. While it didn't help here,
it could be really useful in the future.

On Nov 18, 2:04 pm, Eliot Horowitz <eliothorow...@gmail.com> wrote:

RobV23

unread,

Dec 7, 2010, 2:29:50 PM12/7/10

to mongodb-user

further updates.
Our server has hung several times this week and seems to be getting
more and more prone to hanging. One more observation is that every
hang has this sequence: (Of course, the specific numbers are
different, but the sequence of 'drop temp MR table' followed by 'build
index' seems consistent.

Any more thoughts?
Again, any thoughts

Tue Dec 7 19:06:15 [conn975] done for 0 records 0secs
Tue Dec 7 19:06:15 [conn975] CMD: drop
TEST_ENTITY_DB.tmp.mr.mapreduce_1291748775_71369_inc
Tue Dec 7 19:06:15 [conn975] CMD: drop
TEST_ENTITY_DB.tmp.mr.mapreduce_1291748775_71369
Tue Dec 7 19:06:15 [conn979] CMD: drop
TEST_ENTITY_DB.tmp.mr.mapreduce_1291748775_71370
Tue Dec 7 19:06:15 [conn979] CMD: drop
TEST_ENTITY_DB.tmp.mr.mapreduce_1291748775_71370_inc
Tue Dec 7 19:06:15 [conn979] building new index on { 0: 1 } for
TEST_ENTITY_DB.tmp.mr.mapreduce_1291748775_71370_inc
Tue Dec 7 19:06:15 [conn979] done for 26 records 0secs
Tue Dec 7 19:06:15 [conn979] building new index on { _id: 1 } for
TEST_ENTITY_DB.tmp.mr.mapreduce_1291748775_71370
Tue Dec 7 19:06:15 [conn979] done for 0 records 0secs
Tue Dec 7 19:06:15 [conn979] CMD: drop
TEST_ENTITY_DB.tmp.mr.mapreduce_1291748775_71370_inc
Tue Dec 7 19:06:15 [conn979] CMD: drop
TEST_ENTITY_DB.tmp.mr.mapreduce_1291748775_71370
Tue Dec 7 19:06:15 [conn979] info DFM::findAll(): extent 0:5d00 was
empty, skipping ahead TEST_ENTITY_DB.system.indexes
Tue Dec 7 19:06:15 [conn976] CMD: drop
TEST_ENTITY_DB.tmp.mr.mapreduce_1291748775_71371
Tue Dec 7 19:06:15 [conn976] CMD: drop
TEST_ENTITY_DB.tmp.mr.mapreduce_1291748775_71371_inc
Tue Dec 7 19:06:15 [conn976] building new index on { 0: 1 } for
TEST_ENTITY_DB.tmp.mr.mapreduce_1291748775_71371_inc
Tue Dec 7 19:06:15 [conn976] done for 28 records 0secs
Tue Dec 7 19:06:15 [conn976] building new index on { _id: 1 } for
TEST_ENTITY_DB.tmp.mr.mapreduce_1291748775_71371
Tue Dec 7 19:06:15 [conn976] done for 0 records 0secs
Tue Dec 7 19:06:17 [snapshotthread] cpu: elapsed:4000 writelock: 3%
Tue Dec 7 19:06:21 [snapshotthread] cpu: elapsed:4000 writelock: 0%

Eliot Horowitz

unread,

Dec 7, 2010, 6:25:52 PM12/7/10

to mongod...@googlegroups.com

How many collections do you have?
Can you try db.stats()?
Can you try deleting temp collections as soon as you're done with them?

RobV23

unread,

Dec 8, 2010, 10:07:16 AM12/8/10

to mongodb-user

Hi Elliot,

We have about 20 collections.

I did the db.serverStatus(), but didn't try stats(). next time it
hangs, I'll be sure the try it.

We actually do delete the temp collections as soon as we are done with
them.
On that note, do you know what is building the indexing on the map/
reduce?
In the log, I see something is building two indexes on the M/R temp
table,
but my code doesn't have any ensureIndex().

Thanks,

Eliot Horowitz

unread,

Dec 8, 2010, 10:44:50 AM12/8/10

to mongod...@googlegroups.com

The map/reduce code itself is doing it.

Out of curiosity, how large are your m/r result sets?

RobV23

unread,

Dec 8, 2010, 11:25:18 AM12/8/10

to mongodb-user

The result set is quite small, maybe a few hundred docs.

Interesting that M/R is doing the indexing. Do you think the indexing
is actually being done after the temp collection is dropped, or are
the events just being logged out of sequence?

iragsdale

unread,

Dec 8, 2010, 2:11:32 PM12/8/10

to mongodb-user

We have also seen a problem with 2 simultaneous map-reduce jobs
locking the server - I would avoid running more than one at a time if
possible:

https://jira.mongodb.org/browse/CS-216

This was happening regularly on our production server until we figured
it out and made sure we only run one M/R at a time. I don't think
anybody at 10gen has been able to reproduce it though.

- Ian

> > >> > Our server hashungseveral times this week and seems to be getting

RobV23

unread,

Dec 8, 2010, 3:39:55 PM12/8/10

to mongodb-user

Interesting observation, Ian.

We are currently experiencing several lock-ups per day and can nearly
reproduce it at will.

I'm not sure how we could accomplish limiting our queries to one M/R
at a time. I'm not sure that would be practical for our application.

BTW: Eliot, I tried to run a db.stats() during a lock-up and it did
not return. db.serverStatus() returned OK.

thanks

Eliot Horowitz

unread,

Dec 8, 2010, 5:27:34 PM12/8/10

to mongod...@googlegroups.com

During a lock up can you send db.serverStatus() and db.currentOp()

Reply all

Reply to author

Forward