After a restart Mongodb eventually uses 100% CPU with very high lock percentage (almost weekly)

2,588 views
Skip to first unread message

Mark Quezada

unread,
Feb 21, 2012, 7:07:51 PM2/21/12
to mongod...@googlegroups.com
I inherited a mongodb-backed app from another developer. I'm trying to pinpoint an issue we're having where mongodb usually runs fine but eventually uses 100% CPU and has a very high lock percentage (above 90%). Eventually it starts timing out to the client and needs to be restarted. It'll then slowly creep back up to the same point and need to be restarted again.

As far as I understand (which could be totally wrong), it's a minimal app and dataset compared to what mongodb can handle so I'm sort of puzzled as to why this could be happening. It's mongodb 2.0.2 running on a 4 GB Ubuntu 10.04 LTS VM hosted on rackspace cloud.

Here are some stats I collected. Unfortunately I just restarted it a few days ago so it doesn't show the server while in the problem state.

> db.stats()
{
"db" : "switchboard",
"collections" : 5,
"objects" : 89029,
"avgObjSize" : 38503.869211155914,
"dataSize" : 3427960972,
"storageSize" : 3838152704,
"numExtents" : 27,
"indexes" : 3,
"indexSize" : 10972192,
"fileSize" : 8519680000,
"nsSizeMB" : 16,
"ok" : 1
}

I turned on profiling and looked for queries greater than 100ms:

> db.system.profile.find( { millis : { $gt : 100 } } )
{ "ts" : ISODate("2012-02-21T23:40:42.810Z"), "op" : "update", "ns" : "switchboard.calls", "query" : { "sid" : "50b9713f-dc7d-4af4-81f5-4ce965ff908a" }, "updateobj" : { "$push" : { "log" : { "time" : 1329867642, "type" : "play", "message" : { "file" : "/home/switchboard/audio/ringing.wav" } } } }, "nscanned" : 1, "moved" : true, "millis" : 108, "client" : "127.0.0.1", "user" : "" }
{ "ts" : ISODate("2012-02-21T23:56:18.385Z"), "op" : "command", "ns" : "switchboard.$cmd", "command" : { "dbstats" : 1, "scale" : null }, "ntoreturn" : 1, "responseLength" : 222, "millis" : 142, "client" : "127.0.0.1", "user" : "" }
{ "ts" : ISODate("2012-02-21T23:56:50.056Z"), "op" : "update", "ns" : "switchboard.calls", "query" : { "sid" : "6e12996f-c3ea-4e89-9788-509e3f99a7fe" }, "updateobj" : { "$push" : { "log" : { "time" : 1329868609, "type" : "play", "message" : { "file" : "/home/switchboard/audio/ringing.wav" } } } }, "nscanned" : 1, "millis" : 369, "client" : "127.0.0.1", "user" : "" }
{ "ts" : ISODate("2012-02-22T00:01:21.210Z"), "op" : "update", "ns" : "switchboard.calls", "query" : { "sid" : "a9b2b4a4-b106-41fb-bee2-722d4f7f3ffc" }, "updateobj" : { "$push" : { "log" : { "time" : 1329868880, "type" : "play", "message" : { "file" : "/home/switchboard/audio/ringing.wav" } } } }, "nscanned" : 1, "millis" : 180, "client" : "127.0.0.1", "user" : "" }

Running an explain on one of the slow queries I saw:

> db.calls.find({ "sid" : "4be90902-483e-4d09-b9dd-7b7602888c6c" }).explain()
{
"cursor" : "BtreeCursor sid_1",
"nscanned" : 1,
"nscannedObjects" : 1,
"n" : 1,
"millis" : 0,
"nYields" : 0,
"nChunkSkips" : 0,
"isMultiKey" : false,
"indexOnly" : false,
"indexBounds" : {
"sid" : [
[
"4be90902-483e-4d09-b9dd-7b7602888c6c",
"4be90902-483e-4d09-b9dd-7b7602888c6c"
]
]
}
}

Here's db.serverStatus(): http://pastebin.com/VXTjkkR3
Here's a screenshot of the MMS dashboard. (I just installed it so not a huge amount of data in there yet.): http://cl.ly/2s0M38072K3Y342J2U0F

If anyone could help push me in the right direction to track the root cause of this down, it would be greatly appreciated.

Nat

unread,
Feb 21, 2012, 7:41:09 PM2/21/12
to mongod...@googlegroups.com
Your problem should be around update statement where it requires object moving. Can you post part of your mongod log when cpu usage is high and db.calls.stats()?
From: Mark Quezada <ma...@mirthlab.com>
Date: Tue, 21 Feb 2012 16:07:51 -0800 (PST)
Subject: [mongodb-user] After a restart Mongodb eventually uses 100% CPU with very high lock percentage (almost weekly)
--
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To view this discussion on the web visit https://groups.google.com/d/msg/mongodb-user/-/UneSgKXf25cJ.
To post to this group, send email to mongod...@googlegroups.com.
To unsubscribe from this group, send email to mongodb-user...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.

Mark Quezada

unread,
Feb 21, 2012, 8:13:52 PM2/21/12
to mongod...@googlegroups.com
Almost forgot. Here's a mongostat output sample:

Mark Quezada

unread,
Feb 21, 2012, 8:16:51 PM2/21/12
to mongod...@googlegroups.com, nat....@gmail.com
>  db.calls.stats()
{
"ns" : "switchboard.calls",
"count" : 71159,
"size" : 3400948956,
"avgObjSize" : 47793.65865175171,
"storageSize" : 3799601152,
"numExtents" : 17,
"nindexes" : 2,
"lastExtentSize" : 813023232,
"paddingFactor" : 1.2799999999179976,
"flags" : 1,
"totalIndexSize" : 9778496,
"indexSizes" : {
"_id_" : 4153408,
"sid_1" : 5625088
},
"ok" : 1
}


The mongodb process has creeped up to about 25% CPU utilization but it probably won't hit 100% for a few more days based on what I've seen previously.

Mark Quezada

unread,
Feb 22, 2012, 12:35:19 PM2/22/12
to mongod...@googlegroups.com
Here's an updated screenshot of the MMS stats: http://cl.ly/3D10442U1e2z3v3G2a0M

You can see that over the last 24 hours the lock percentage has been steadily increasing and then it leveled off for a bit. Any Ideas?

Scott Hernandez

unread,
Feb 22, 2012, 2:32:52 PM2/22/12
to mongod...@googlegroups.com
Can you turn on the db profiler and collect slow operations? I
suspect you have unindexed operations which is causing this problem.
http://www.mongodb.org/display/DOCS/Database+Profiler

> --
> You received this message because you are subscribed to the Google Groups
> "mongodb-user" group.
> To view this discussion on the web visit

> https://groups.google.com/d/msg/mongodb-user/-/qX09KcVbeBYJ.

Mark Quezada

unread,
Feb 22, 2012, 3:27:28 PM2/22/12
to mongod...@googlegroups.com
Here's the profiler info for queries that took more than 25ms:

> db.system.profile.find( { millis : { $gt : 25 } } )
{ "ts" : ISODate("2012-02-22T20:12:58.665Z"), "op" : "update", "ns" : "switchboard.calls", "query" : { "sid" : "835ad946-82e5-4ac9-81c2-fd10f140a1ae" }, "updateobj" : { "$push" : { "log" : { "time" : 1329941578, "type" : "play", "message" : { "file" : "/home/switchboard/audio/ringing.wav" } } } }, "nscanned" : 1, "millis" : 31, "client" : "127.0.0.1", "user" : "" }
{ "ts" : ISODate("2012-02-22T20:12:58.725Z"), "op" : "update", "ns" : "switchboard.calls", "query" : { "sid" : "da475ca6-997e-4d70-85d2-b721d040ffba" }, "updateobj" : { "$push" : { "log" : { "time" : 1329941578, "type" : "play", "message" : { "file" : "/home/switchboard/audio/ringing.wav" } } } }, "nscanned" : 1, "millis" : 56, "client" : "127.0.0.1", "user" : "" }
{ "ts" : ISODate("2012-02-22T20:12:58.765Z"), "op" : "update", "ns" : "switchboard.calls", "query" : { "sid" : "50b9713f-dc7d-4af4-81f5-4ce965ff908a" }, "updateobj" : { "$push" : { "log" : { "time" : 1329941578, "type" : "play", "message" : { "file" : "/home/switchboard/audio/ringing.wav" } } } }, "nscanned" : 1, "millis" : 39, "client" : "127.0.0.1", "user" : "" }
{ "ts" : ISODate("2012-02-22T20:12:58.855Z"), "op" : "update", "ns" : "switchboard.calls", "query" : { "sid" : "a9b2b4a4-b106-41fb-bee2-722d4f7f3ffc" }, "updateobj" : { "$push" : { "log" : { "time" : 1329941578, "type" : "play", "message" : { "file" : "/home/switchboard/audio/ringing.wav" } } } }, "nscanned" : 1, "millis" : 26, "client" : "127.0.0.1", "user" : "" }
{ "ts" : ISODate("2012-02-22T20:12:58.909Z"), "op" : "update", "ns" : "switchboard.calls", "query" : { "sid" : "865c2b65-d494-42ac-a441-1747c863d96b" }, "updateobj" : { "$push" : { "log" : { "time" : 1329941578, "type" : "play", "message" : { "file" : "/home/switchboard/audio/ringing.wav" } } } }, "nscanned" : 1, "millis" : 37, "client" : "127.0.0.1", "user" : "" }
{ "ts" : ISODate("2012-02-22T20:12:58.938Z"), "op" : "update", "ns" : "switchboard.calls", "query" : { "sid" : "a0bac22c-d277-4280-8a9e-3c07fc47618f" }, "updateobj" : { "$push" : { "log" : { "time" : 1329941578, "type" : "play", "message" : { "file" : "/home/switchboard/audio/ringing.wav" } } } }, "nscanned" : 1, "millis" : 28, "client" : "127.0.0.1", "user" : "" }
{ "ts" : ISODate("2012-02-22T20:12:59.051Z"), "op" : "update", "ns" : "switchboard.calls", "query" : { "sid" : "3af5617b-7823-46d3-be6c-01b281840ba7" }, "updateobj" : { "$push" : { "log" : { "time" : 1329941578, "type" : "play", "message" : { "file" : "/home/switchboard/audio/ringing.wav" } } } }, "nscanned" : 1, "exception" : "$ operator made object too large", "exceptionCode" : 12522, "millis" : 48, "client" : "127.0.0.1", "user" : "" }
{ "ts" : ISODate("2012-02-22T20:12:59.090Z"), "op" : "update", "ns" : "switchboard.calls", "query" : { "sid" : "3af5617b-7823-46d3-be6c-01b281840ba7" }, "updateobj" : { "$push" : { "log" : { "time" : 1329941578, "type" : "play", "message" : { "file" : "/home/switchboard/audio/ringing.wav" } } } }, "nscanned" : 1, "exception" : "$ operator made object too large", "exceptionCode" : 12522, "millis" : 38, "client" : "127.0.0.1", "user" : "" }
{ "ts" : ISODate("2012-02-22T20:12:59.140Z"), "op" : "update", "ns" : "switchboard.calls", "query" : { "sid" : "3af5617b-7823-46d3-be6c-01b281840ba7" }, "updateobj" : { "$push" : { "log" : { "time" : 1329941578, "type" : "play", "message" : { "file" : "/home/switchboard/audio/ringing.wav" } } } }, "nscanned" : 1, "exception" : "$ operator made object too large", "exceptionCode" : 12522, "millis" : 50, "client" : "127.0.0.1", "user" : "" }
{ "ts" : ISODate("2012-02-22T20:12:59.181Z"), "op" : "update", "ns" : "switchboard.calls", "query" : { "sid" : "3af5617b-7823-46d3-be6c-01b281840ba7" }, "updateobj" : { "$push" : { "log" : { "time" : 1329941578, "type" : "play", "message" : { "file" : "/home/switchboard/audio/ringing.wav" } } } }, "nscanned" : 1, "exception" : "$ operator made object too large", "exceptionCode" : 12522, "millis" : 40, "client" : "127.0.0.1", "user" : "" }
{ "ts" : ISODate("2012-02-22T20:12:59.233Z"), "op" : "update", "ns" : "switchboard.calls", "query" : { "sid" : "3af5617b-7823-46d3-be6c-01b281840ba7" }, "updateobj" : { "$push" : { "log" : { "time" : 1329941578, "type" : "play", "message" : { "file" : "/home/switchboard/audio/ringing.wav" } } } }, "nscanned" : 1, "exception" : "$ operator made object too large", "exceptionCode" : 12522, "millis" : 52, "client" : "127.0.0.1", "user" : "" }
{ "ts" : ISODate("2012-02-22T20:12:59.273Z"), "op" : "update", "ns" : "switchboard.calls", "query" : { "sid" : "3af5617b-7823-46d3-be6c-01b281840ba7" }, "updateobj" : { "$push" : { "log" : { "time" : 1329941578, "type" : "play", "message" : { "file" : "/home/switchboard/audio/ringing.wav" } } } }, "nscanned" : 1, "exception" : "$ operator made object too large", "exceptionCode" : 12522, "millis" : 38, "client" : "127.0.0.1", "user" : "" }
{ "ts" : ISODate("2012-02-22T20:12:59.323Z"), "op" : "update", "ns" : "switchboard.calls", "query" : { "sid" : "3af5617b-7823-46d3-be6c-01b281840ba7" }, "updateobj" : { "$push" : { "log" : { "time" : 1329941578, "type" : "play", "message" : { "file" : "/home/switchboard/audio/ringing.wav" } } } }, "nscanned" : 1, "exception" : "$ operator made object too large", "exceptionCode" : 12522, "millis" : 49, "client" : "127.0.0.1", "user" : "" }
{ "ts" : ISODate("2012-02-22T20:12:59.362Z"), "op" : "update", "ns" : "switchboard.calls", "query" : { "sid" : "3af5617b-7823-46d3-be6c-01b281840ba7" }, "updateobj" : { "$push" : { "log" : { "time" : 1329941578, "type" : "play", "message" : { "file" : "/home/switchboard/audio/ringing.wav" } } } }, "nscanned" : 1, "exception" : "$ operator made object too large", "exceptionCode" : 12522, "millis" : 38, "client" : "127.0.0.1", "user" : "" }
{ "ts" : ISODate("2012-02-22T20:12:59.413Z"), "op" : "update", "ns" : "switchboard.calls", "query" : { "sid" : "3af5617b-7823-46d3-be6c-01b281840ba7" }, "updateobj" : { "$push" : { "log" : { "time" : 1329941578, "type" : "play", "message" : { "file" : "/home/switchboard/audio/ringing.wav" } } } }, "nscanned" : 1, "exception" : "$ operator made object too large", "exceptionCode" : 12522, "millis" : 50, "client" : "127.0.0.1", "user" : "" }
{ "ts" : ISODate("2012-02-22T20:12:59.504Z"), "op" : "update", "ns" : "switchboard.calls", "query" : { "sid" : "3af5617b-7823-46d3-be6c-01b281840ba7" }, "updateobj" : { "$push" : { "log" : { "time" : 1329941578, "type" : "play", "message" : { "file" : "/home/switchboard/audio/ringing.wav" } } } }, "nscanned" : 1, "exception" : "$ operator made object too large", "exceptionCode" : 12522, "millis" : 38, "client" : "127.0.0.1", "user" : "" }
{ "ts" : ISODate("2012-02-22T20:12:59.573Z"), "op" : "update", "ns" : "switchboard.calls", "query" : { "sid" : "3af5617b-7823-46d3-be6c-01b281840ba7" }, "updateobj" : { "$push" : { "log" : { "time" : 1329941578, "type" : "play", "message" : { "file" : "/home/switchboard/audio/ringing.wav" } } } }, "nscanned" : 1, "exception" : "$ operator made object too large", "exceptionCode" : 12522, "millis" : 68, "client" : "127.0.0.1", "user" : "" }
{ "ts" : ISODate("2012-02-22T20:12:59.613Z"), "op" : "update", "ns" : "switchboard.calls", "query" : { "sid" : "4be90902-483e-4d09-b9dd-7b7602888c6c" }, "updateobj" : { "$push" : { "log" : { "time" : 1329941578, "type" : "play", "message" : { "file" : "/home/switchboard/audio/ringing.wav" } } } }, "nscanned" : 1, "millis" : 38, "client" : "127.0.0.1", "user" : "" }
{ "ts" : ISODate("2012-02-22T20:12:59.666Z"), "op" : "update", "ns" : "switchboard.calls", "query" : { "sid" : "31140c7f-4e18-425a-948b-754325e48271" }, "updateobj" : { "$push" : { "log" : { "time" : 1329941578, "type" : "play", "message" : { "file" : "/home/switchboard/audio/ringing.wav" } } } }, "nscanned" : 1, "millis" : 33, "client" : "127.0.0.1", "user" : "" }
{ "ts" : ISODate("2012-02-22T20:12:59.705Z"), "op" : "update", "ns" : "switchboard.calls", "query" : { "sid" : "6e12996f-c3ea-4e89-9788-509e3f99a7fe" }, "updateobj" : { "$push" : { "log" : { "time" : 1329941578, "type" : "play", "message" : { "file" : "/home/switchboard/audio/ringing.wav" } } } }, "nscanned" : 1, "millis" : 38, "client" : "127.0.0.1", "user" : "" }
has more

And more than 100ms:

> db.system.profile.find( { millis : { $gt : 100 } } )
{ "ts" : ISODate("2012-02-22T20:16:46.399Z"), "op" : "update", "ns" : "switchboard.calls", "query" : { "sid" : "6e12996f-c3ea-4e89-9788-509e3f99a7fe" }, "updateobj" : { "$push" : { "log" : { "time" : 1329941805, "type" : "play", "message" : { "file" : "/home/switchboard/audio/ringing.wav" } } } }, "nscanned" : 1, "millis" : 577, "client" : "127.0.0.1", "user" : "" }
{ "ts" : ISODate("2012-02-22T20:18:50.290Z"), "op" : "update", "ns" : "switchboard.calls", "query" : { "sid" : "4e974968-c11d-44db-b843-96f1d89c7053" }, "updateobj" : { "$push" : { "log" : { "time" : 1329941928, "type" : "play", "message" : { "file" : "/home/switchboard/audio/ringing.wav" } } } }, "nscanned" : 1, "millis" : 915, "client" : "127.0.0.1", "user" : "" }

Notice "nscanned" is 1 so it looks like it's using an index. 

Here's db.calls.stats() again:

> db.calls.stats()
{
"ns" : "switchboard.calls",
"count" : 73036,
"size" : 3489527572,
"avgObjSize" : 47778.18571663289,
"storageSize" : 3799601152,
"numExtents" : 17,
"nindexes" : 2,
"lastExtentSize" : 813023232,
"paddingFactor" : 1.3899999999157062,
"flags" : 1,
"totalIndexSize" : 10023776,
"indexSizes" : {
"_id_" : 4259696,
"sid_1" : 5764080
},
"ok" : 1
}

Notice that the padding factor has gone way up since the last time I posted it. We're using the "log" collection to log events that happen to that record, but as far as I can tell it's being saved as a sub-document. I'm wondering if that sub document is growing way too large.

Scott Hernandez

unread,
Feb 22, 2012, 3:52:04 PM2/22/12
to mongod...@googlegroups.com
Yes, that is probably the issue. You are growing the docs and that is
causing extra IO/processing. You went from an avg object size of 38K
to 47K between those two stats.

> --
> You received this message because you are subscribed to the Google Groups
> "mongodb-user" group.
> To view this discussion on the web visit

> https://groups.google.com/d/msg/mongodb-user/-/7Vmp2xdZ1K4J.

Mark Quezada

unread,
Feb 22, 2012, 4:07:58 PM2/22/12
to mongod...@googlegroups.com
Thanks for the help tracking this down. I'm relatively new to mongo. (I inherited this app from the previous developer.) 

After talking with him, it looks like the original thinking was that if it was all just one document, you could pull the whole call info along with its logs at the same time. I guess the problem is that the logs vary greatly in size from call to call.

What would you recommend going forward? From what I've read, one fix would be to manually pad the object before saving the first time. Or would it be better to break out each log entry into its own document? (I'm not even positive that that would actually help.)

Scott Hernandez

unread,
Feb 22, 2012, 4:10:03 PM2/22/12
to mongod...@googlegroups.com
With logs it is generally better to store them in chunks which are
immutable. The size can be one, or a group of more than one if the
grouping is that way you will read them or use them.

What is the use case for when reading?

> --
> You received this message because you are subscribed to the Google Groups
> "mongodb-user" group.
> To view this discussion on the web visit

> https://groups.google.com/d/msg/mongodb-user/-/w0ADnck85DoJ.

Mark Quezada

unread,
Feb 22, 2012, 4:21:41 PM2/22/12
to mongod...@googlegroups.com
For now the logging is really only used for quality assurance, so I wouldn't consider it "critical" data that needs to be included in the call itself. It tracks every event that happens on a phone call (playing different messages, receiving key presses, when people hangup etc.) so that if something goes wrong, we can view the log to determine on which side the error occurred.

When you say immutable chunks... is that a specific mongodb functionality? I quick google didn't turn up much.

Would a better way to do this be to just move those out of a sub-document into its own log table keyed off the call id?

Scott Hernandez

unread,
Feb 22, 2012, 6:50:45 PM2/22/12
to mongod...@googlegroups.com
On Wed, Feb 22, 2012 at 4:21 PM, Mark Quezada <ma...@mirthlab.com> wrote:
> For now the logging is really only used for quality assurance, so I wouldn't
> consider it "critical" data that needs to be included in the call itself. It
> tracks every event that happens on a phone call (playing different messages,
> receiving key presses, when people hangup etc.) so that if something goes
> wrong, we can view the log to determine on which side the error occurred.
>
> When you say immutable chunks... is that a specific mongodb functionality? I
> quick google didn't turn up much.

By immutable I simply meant that they don't constantly change;
immutable isn't the correct word.

> Would a better way to do this be to just move those out of a sub-document
> into its own log table keyed off the call id?

Yes, I'd suggest that and see how it goes. It should change the write
behavior a bit.

> --
> You received this message because you are subscribed to the Google Groups
> "mongodb-user" group.
> To view this discussion on the web visit

> https://groups.google.com/d/msg/mongodb-user/-/WAGw8ekzlf0J.

Reply all
Reply to author
Forward
0 new messages