More details, I took a very simple query that has an index that is 100% covered. if I run explain with aggrF, I get "InternalError No plan available to provide stats", so that's a new problem. Anyway, I used the same query again but this time using a find in shell to eliminate all other suspects. I believe I found a weird result, that may confirm your thoughts.
{
"clusteredType" : "ParallelSort",
"shards" : {
"db1shard1/db1shard1a:27017,db1shard1b:27017" : [
{
"cursor" : "BtreeCursor appid_1_dto.ymd_1_c.ev.n_1",
"isMultiKey" : false,
"n" : 434716,
"nscannedObjects" : 434716,
"nscanned" : 434716,
"nscannedObjectsAllPlans" : 435019,
"nscannedAllPlans" : 435023,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 3398,
"nChunkSkips" : 0,
"millis" : 3536,
"indexBounds" : {
"appid" : [
[
ObjectId("51a6014e240232243cb5ac76"),
ObjectId("51a6014e240232243cb5ac76")
]
],
"dto.ymd" : [
[
20140730,
20140806
]
],
"c.ev.n" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
]
},
"server" : "db1shard1b:27017",
"filterSet" : false
}
],
"db1shard2/db1shard2a:27017,db1shard2b:27017" : [
{
"cursor" : "BtreeCursor appid_1_dto.ymd_1_c.ev.n_1",
"isMultiKey" : false,
"n" : 435349,
"nscannedObjects" : 435349,
"nscanned" : 435349,
"nscannedObjectsAllPlans" : 435652,
"nscannedAllPlans" : 435656,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 3404,
"nChunkSkips" : 0,
"millis" : 3595,
"indexBounds" : {
"appid" : [
[
ObjectId("51a6014e240232243cb5ac76"),
ObjectId("51a6014e240232243cb5ac76")
]
],
"dto.ymd" : [
[
20140730,
20140806
]
],
"c.ev.n" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
]
},
"server" : "db1shard2b:27017",
"filterSet" : false
}
],
"db1shard3/db1shard3a:27017,db1shard3b:27017" : [
{
"cursor" : "BtreeCursor appid_1_c.ev.attr._oid_1",
"isMultiKey" : false,
"n" : 346931,
"nscannedObjects" : 11072023,
"nscanned" : 11072023,
"nscannedObjectsAllPlans" : 11072322,
"nscannedAllPlans" : 11072326,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 611011,
"nChunkSkips" : 0,
"millis" : 668978,
"indexBounds" : {
"appid" : [
[
ObjectId("51a6014e240232243cb5ac76"),
ObjectId("51a6014e240232243cb5ac76")
]
],
"c.ev.attr._oid" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
]
},
"server" : "db1shard3b:27017",
"filterSet" : false
}
]
},
"cursor" : "multiple",
"n" : 1216996,
"nChunkSkips" : 0,
"nYields" : 617813,
"nscanned" : 11942088,
"nscannedAllPlans" : 11943005,
"nscannedObjects" : 11942088,
"nscannedObjectsAllPlans" : 11942993,
"millisShardTotal" : 676109,
"millisShardAvg" : 225369,
"numQueries" : 3,
"numShards" : 3,
"millis" : 668992
}
Do you see what I'm seeing, db1shard3b is using a completely different index. This index is on a sharded collection by the way. I have confirmed that the index does exist on all shards. These queries are usually executed by MR but find/aggrF has the same problems, so it looks like the bug is in the core of everything. I also download 2.6.* and 2.7.* they both didn't seem to solve the problem but again I have different setup on dev than in prod.
Any info you provide will be greatly appreciated, we are stranded with some really long queries.