2.6.3 MR $mergeCursors - Disk IO 98%-100%

241 views
Skip to first unread message

Andre

unread,
Aug 1, 2014, 1:14:16 PM8/1/14
to mongod...@googlegroups.com
I've been getting 100% Disk IO quite a bit recently and I believe i've narrowed it down to when a MR job crashes not so gracefully or something of that nature. Either way we will have 100% disk IO just running for a while until I flip primaries or kill op. I was thinking it was our recent upgrade but I believe I've seen it in 2.4.9. anyway, question is what does this mean? I'm going to kill it but wanted to get more details.

Here is our current OP, basically one of these on each shard.

query : { "aggregate" : "_stream", "pipeline" : [{ "$mergeCursors" : [{ "host" : "db1shard1b:27017", "id" : NumberLong("4475578602510") }, { "host" : "db1shard2b:27017", "id" : NumberLong("81942629521") }, { "host" : "db1shard3b:27017", "id" : NumberLong("95283487055") }] }, { "$group" : { "_id" : "$$ROOT._id", "count" : { "$sum" : "$$ROOT.count" }, "$doingMerge" : true } }], "allowDiskUse" : true }

these MR jobs are over a queried/filter set of about 10 million rows from a 500million+ data set. The tasks are quite simple, they are just formatting and aggregating, nothing complicated.

Second question, if a job is executed and cancelled before the MR task is complete, does the MR task keeps running - if it is ouchhh!!! My final question will be is there a way to tie an MR task to a Job, so we can kill long running MR's based on inactive jobs.

Asya Kamsky

unread,
Aug 1, 2014, 5:32:29 PM8/1/14
to mongodb-user
Andre,

Aggregation framework does NOT use MapReduce (so you can probably disregard my previous comment about unique counts).

You are doing counts/aggregations and results are coming to this shard from multiple other shards.  It needs to merge the results - it needs to sort them to do that, and it's using disk because it exceeds what it's allowed to have in RAM.

If you provide the original use case (what exactly you need to compute) and the stats on the collection this is running over (including what indexes there are!) there might be a way to minimize the amount of work the merging process has to do to get the final result.

Asya



--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.
 
For other MongoDB technical support options, see: http://www.mongodb.org/about/support/.
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user...@googlegroups.com.
To post to this group, send email to mongod...@googlegroups.com.
Visit this group at http://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/f742fbdf-975c-41ea-85ee-426fe137f509%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Andre

unread,
Aug 1, 2014, 7:57:40 PM8/1/14
to mongod...@googlegroups.com
Hi Asya,

So that's why i'm confused, looking at the op that I sent I assumed that it was trying to merge based on the fields in the exact scenario that you provided. so I did look at the cluster charts and each shard was about 5% memory utilization while this was going on, each node has 72GB, it has plenty of RAM and definitely did not exceed its limit or even come close.

I've been doing process of elimination in prod since I don't have that size of data in dev and I believe we are facing 2 problems.

1. Aggregation framework is worst in 2.6 than it used to be, so these particular jobs run every minute and executes about 100 aggr tasks. It has not changed in years, I stopped it and after a few minutes ops disappear and everything is back to normal, start it and its 90%+ disk IO.

2. when you forcefully stop a MR job, it doesn't immediately recover and release resources and it could be just one heck of a coincedence that its because of #1, so #1 needs to resolve to determine if its the main culprit of all our recent problems.

NOTE: I had allowdiskuse =true at first when upgrading to 2.6, now its false and made no difference at all. Why is this a problem now, I've never seen it before in 2.4 and this thing has been running for ages?

why is it not using memory, all the shards are just lingering at 2.3% RAM util.

I'm unable to give you any specific queries because these are 100s of jobs doing different things from multiple collections and dbs, we do have our largest stream db/col that contains an arbitrary number of events and attributes so we are not able to index that, the only index is date range as that's the only required  field but it used to work.

the aggrs are taking longer and use to run under 1min, now its a few minutes and it compounds.


Andre

Asya Kamsky

unread,
Aug 1, 2014, 8:26:26 PM8/1/14
to mongodb-user
Interesting.    Here are some points of information:

allowdiskuse =true is needed if you are going to sort something bigger than 100MB in memory - this is a somewhat arbitrary limit to keep a single agg query from hogging RAM from the rest of the server (sure you have 72GB but you want to keep your indexes and hot data in it still).

Depending on which version of 2.6.X you are on, there were a couple of bugs with Map Reduce and replication specifically - which version are you currently on?

How are you forcefully stopping the MR jobs?

Asya



Andre

unread,
Aug 1, 2014, 9:34:36 PM8/1/14
to mongod...@googlegroups.com
we are currently on 2.6.3, upgraded last week from 2.4.9.

majority of the reports that are created ends up being about 10s of 1000's rows although it starts with millions and they are on average 8mb disk size when exported.

stopping MR is usually the job itself crashing or killing the process, not sure what happens with MR in that scenario.

It's obvious that aggr framework will be a better option than MR but what changed why are all these jobs now running slow or periodic 100%, we also updated the .net lib so we can use the new 2.6.3 features.

the good news is that it is not crashing the systems just not running efficiently anymore. it only compounds as jobs get added, it feels like our SSD's and memory are useless.

Asya Kamsky

unread,
Aug 2, 2014, 5:38:07 AM8/2/14
to mongodb-user
I wonder if you are running into one of the bugs in query optimizer - do you have a test environment where you can try our 2.6 nightly build?  2.6.4 will be out early next week and it's possible it'll improve things if you're running into this issue.

Asya



Andre

unread,
Aug 2, 2014, 4:36:49 PM8/2/14
to mongod...@googlegroups.com
Yes, I do have a development environment just not with the same size of data so not sure i'll be able to reproduce it.

Andre.

Asya Kamsky

unread,
Aug 2, 2014, 7:07:02 PM8/2/14
to mongod...@googlegroups.com
I strongly suspect an index selection bug (especially of you have multiple similar indexes that are being considered for various queries).  Especially if you didn't see this problem on. 2.4 and are seeing it on 2.6.

Asya

Andre

unread,
Aug 3, 2014, 3:29:55 PM8/3/14
to mongod...@googlegroups.com
Asya,

Well that will make sense and explain the 100% disk IO, anyway, I've been trying a few things, the last that I tried is to change those queries to read only from secondaries since we now have 0 secs lag and that seem to fix the problem, disk usage averages about 30% now with 100+ jobs running for the last 12+hours.

So that being said, any ideas what I need to do, it seems that we just can't read from primaries or why secondaries fix the problem but I'm afraid if we ever fail over we will run into the same problem again or maybe its just a read preference issue in the .NET library but then we would have seen that in DEV.

Even with that fix, our cluster is just not as calm as it used to be since 2.4 it seems to be a lot busy and active especially disks IO, its like its assuming there is no memory, it just sits idle at 5% ram while disk is being pounded.

Andre
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user+unsubscribe@googlegroups.com.

Asya Kamsky

unread,
Aug 4, 2014, 3:18:29 AM8/4/14
to mongodb-user
Andre,

You are possibly only seeing a random change in index selected - the
nature of the bug I'm thinking of is that it will choose the right
index some of the time, but the wrong (partially appropriate but not
best performing) index some of the time.

This means that at various times some of your queries may get slow,
cause lots of disk IO or CPU load but it may not be all the time. I
strongly recommend upgrading to 2.6.4 as soon as it's released and
noting how much that helps - if it helps a lot as I think it might,
then you should probably make sure you return to reading from
primaries as there are other issues reading from secondaries (other
than possible lag).

Asya
> --
> You received this message because you are subscribed to the Google Groups
> "mongodb-user"
> group.
>
> For other MongoDB technical support options, see:
> http://www.mongodb.org/about/support/.
> ---
> You received this message because you are subscribed to the Google Groups
> "mongodb-user" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to mongodb-user...@googlegroups.com.
> To post to this group, send email to mongod...@googlegroups.com.
> Visit this group at http://groups.google.com/group/mongodb-user.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/mongodb-user/b91164fa-e376-4a50-b0dd-ebc039ecd887%40googlegroups.com.

Andre

unread,
Aug 4, 2014, 11:21:10 AM8/4/14
to mongod...@googlegroups.com
Thanks Asya, I'll keep an eye on 2.6.4

Andre

unread,
Aug 5, 2014, 11:03:43 AM8/5/14
to mongod...@googlegroups.com
Asya,


Couple of other things that I found interested within this subject since the problem is now back. 1. Its always the same shard, 2. its always the primary even though secondary read is set in the driver.


Andre

Andre

unread,
Aug 6, 2014, 4:49:31 AM8/6/14
to mongod...@googlegroups.com
Asya,

More details, I took a very simple query that has an index that is 100% covered. if I run explain with aggrF, I get "InternalError No plan available to provide stats", so that's a new problem. Anyway, I used the same query again but this time using a find in shell to eliminate all other suspects. I believe I found a weird result, that may confirm your thoughts.

So I test this on a 3 shard cluster and I get this?

{
        "clusteredType" : "ParallelSort",
        "shards" : {
                "db1shard1/db1shard1a:27017,db1shard1b:27017" : [
                        {
                                "cursor" : "BtreeCursor appid_1_dto.ymd_1_c.ev.n_1",
                                "isMultiKey" : false,
                                "n" : 434716,
                                "nscannedObjects" : 434716,
                                "nscanned" : 434716,
                                "nscannedObjectsAllPlans" : 435019,
                                "nscannedAllPlans" : 435023,
                                "scanAndOrder" : false,
                                "indexOnly" : false,
                                "nYields" : 3398,
                                "nChunkSkips" : 0,
                                "millis" : 3536,
                                "indexBounds" : {
                                        "appid" : [
                                                [
                                                        ObjectId("51a6014e240232243cb5ac76"),
                                                        ObjectId("51a6014e240232243cb5ac76")
                                                ]
                                        ],
                                        "dto.ymd" : [
                                                [
                                                        20140730,
                                                        20140806
                                                ]
                                        ],
                                        "c.ev.n" : [
                                                [
                                                        {
                                                                "$minElement" : 1
                                                        },
                                                        {
                                                                "$maxElement" : 1
                                                        }
                                                ]
                                        ]
                                },
                                "server" : "db1shard1b:27017",
                                "filterSet" : false
                        }
                ],
                "db1shard2/db1shard2a:27017,db1shard2b:27017" : [
                        {
                                "cursor" : "BtreeCursor appid_1_dto.ymd_1_c.ev.n_1",
                                "isMultiKey" : false,
                                "n" : 435349,
                                "nscannedObjects" : 435349,
                                "nscanned" : 435349,
                                "nscannedObjectsAllPlans" : 435652,
                                "nscannedAllPlans" : 435656,
                                "scanAndOrder" : false,
                                "indexOnly" : false,
                                "nYields" : 3404,
                                "nChunkSkips" : 0,
                                "millis" : 3595,
                                "indexBounds" : {
                                        "appid" : [
                                                [
                                                        ObjectId("51a6014e240232243cb5ac76"),
                                                        ObjectId("51a6014e240232243cb5ac76")
                                                ]
                                        ],
                                        "dto.ymd" : [
                                                [
                                                        20140730,
                                                        20140806
                                                ]
                                        ],
                                        "c.ev.n" : [
                                                [
                                                        {
                                                                "$minElement" : 1
                                                        },
                                                        {
                                                                "$maxElement" : 1
                                                        }
                                                ]
                                        ]
                                },
                                "server" : "db1shard2b:27017",
                                "filterSet" : false
                        }
                ],
                "db1shard3/db1shard3a:27017,db1shard3b:27017" : [
                        {
                                "cursor" : "BtreeCursor appid_1_c.ev.attr._oid_1",
                                "isMultiKey" : false,
                                "n" : 346931,
                                "nscannedObjects" : 11072023,
                                "nscanned" : 11072023,
                                "nscannedObjectsAllPlans" : 11072322,
                                "nscannedAllPlans" : 11072326,
                                "scanAndOrder" : false,
                                "indexOnly" : false,
                                "nYields" : 611011,
                                "nChunkSkips" : 0,
                                "millis" : 668978,
                                "indexBounds" : {
                                        "appid" : [
                                                [
                                                        ObjectId("51a6014e240232243cb5ac76"),
                                                        ObjectId("51a6014e240232243cb5ac76")
                                                ]
                                        ],
                                        "c.ev.attr._oid" : [
                                                [
                                                        {
                                                                "$minElement" : 1
                                                        },
                                                        {
                                                                "$maxElement" : 1
                                                        }
                                                ]
                                        ]
                                },
                                "server" : "db1shard3b:27017",
                                "filterSet" : false
                        }
                ]
        },
        "cursor" : "multiple",
        "n" : 1216996,
        "nChunkSkips" : 0,
        "nYields" : 617813,
        "nscanned" : 11942088,
        "nscannedAllPlans" : 11943005,
        "nscannedObjects" : 11942088,
        "nscannedObjectsAllPlans" : 11942993,
        "millisShardTotal" : 676109,
        "millisShardAvg" : 225369,
        "numQueries" : 3,
        "numShards" : 3,
        "millis" : 668992
}


Do you see what I'm seeing, db1shard3b is using a completely different index. This index is on a sharded collection by the way. I have confirmed that the index does exist on all shards. These queries are usually executed by MR but find/aggrF has the same problems, so it looks like the bug is in the core of everything. I also download 2.6.* and 2.7.* they both didn't seem to solve the problem but again I have different setup on dev than in prod.

Any info you provide will be greatly appreciated, we are stranded with some really long queries.

Andre

Asya Kamsky

unread,
Aug 8, 2014, 11:23:31 PM8/8/14
to mongodb-user
There is definitely a bug fixed for 2.6.4 that would cause the wrong index to get picked (it wouldn't happen all the time because the good and bad indexes would tie during race-off and one of them would get picked - and if it's the wrong one, you're stuck with a bad plan for a while).   I understand the challenge of testing this without having an environment with a workload similar to production - it seems like the same bug - you can see the description of it in SERVER-14311.

Asya



Andre

unread,
Aug 8, 2014, 11:50:31 PM8/8/14
to mongod...@googlegroups.com
Yep, you are right, I forced a sort which forced that index and my queries now run normal, I'll upgrade as soon as 2.6.4 is available as I noticed it got pushed out.
Reply all
Reply to author
Forward
0 new messages