Multi-key, compound index performance issues

529 views
Skip to first unread message

Brian Adkins

unread,
Jan 4, 2012, 11:57:09 AM1/4/12
to mongodb-user
I've noticed a performance regression after upgrading from 1.6.5 to
2.0.2, and I'm having some difficulty pinpointing the precise problem.

I have 43 million record database of documents that are similar to a
contact record. The name field is an array of documents containing
keys for last name (n.l), first name (n.f), last name soundex (n.ls),
first name soundex (n.fs).

1) After upgrading to 2.0.2, mongod seems to now have trouble loading
indexes into RAM and keeping them there. My RAM size is 16 GB and my
total index size is ~ 7.5 GB. I learned early on that I needed to pre-
load indexes into RAM to get any reasonable performance. With 1.6.5 I
ran a $ne query w/ count for each of the indexes, but that was taking
so long after upgrading to 2.0.2, I switched to this:

def self.precache_indexes
10.times do |i|
MongoMgr.extracts_collection.find({ 's4' => /^#{i}/ }).count
end

10.times do |i|
MongoMgr.extracts_collection.find({ 'dl' => /^#{i}/ }).count
end

('A'..'Z').each do |c|
MongoMgr.extracts_collection.find({ 'n.l' => /^#{c}/ }).count
end
end

I've also noticed an inexplicable drop in RAM usage occasionally (e.g.
from 12 GB down to 2 GB). It will then very slowly grow back up, but
in the meantime, performance is terrible.


2) I've noticed that explain shows only the last name is being used
(instead of last name and first name) when querying with last & first
names. For some common last names, this results in scanning nearly
100,000 documents. It's *possible* that this was also the case with
1.6.5, but if not, then it's likely the source of my performance
regression. If it was the same in 1.6.5, then 1.6.5 seems to be much
better at keeping indexes in RAM because the performance was
excellent.

Here are my indexes:

db.system.indexes.find({ns : 'nc1.extracts'})
{ "v" : 1, "key" : { "_id" : 1 }, "ns" : "nc1.extracts", "name" :
"_id_" }
{ "v" : 1, "key" : { "dl" : 1 }, "ns" : "nc1.extracts", "name" :
"dl_1" }
{ "v" : 1, "key" : { "s4" : 1 }, "ns" : "nc1.extracts", "name" :
"s4_1" }
{ "v" : 1, "key" : { "n.l" : 1, "n.f" : 1 }, "ns" : "nc1.extracts",
"name" : "n.l_1_n.f_1" }
{ "v" : 1, "key" : { "n.ls" : 1, "n.fs" : 1 }, "ns" : "nc1.extracts",
"name" : "n.ls_1_n.fs_1" }

and here is an explain:

db.extracts.find( { n : { $elemMatch : { l : 'ADKINS', n :
'GEORGE' }}}).explain()
{
"cursor" : "BtreeCursor n.l_1_n.f_1",
"nscanned" : 14778,
"nscannedObjects" : 14778,
"n" : 0,
"millis" : 16978,
"nYields" : 0,
"nChunkSkips" : 0,
"isMultiKey" : true,
"indexOnly" : false,
"indexBounds" : {
"n.l" : [
[
"ADKINS",
"ADKINS"
]
],
"n.f" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
]
}
}

I have two questions:

1) What is the most efficient/effective way to get mongod to load all
indexes into RAM? Waiting for them to be cached as they are used
results in unacceptable performance (multi-minute response times).

2) Given my documents contain an array of name documents with last,
first, last_soundex, first_soundex fields, how can I query so that the
full index (last & first) is used instead of only (last) ? In other
words, in the explain() above, instead of using $minElement/
$maxElement, I need that to be 'GEORGE' in the example.

Thanks,
Brian


Here is some more diagnostic info:


> version()
version: 2.0.2


> db.version()
2.0.2

> db.stats()
{
"db" : "nc1",
"collections" : 8,
"objects" : 43529343,
"avgObjSize" : 354.8403949032725,
"dataSize" : 15445969260,
"storageSize" : 17605328848,
"numExtents" : 50,
"indexes" : 18,
"indexSize" : 7524021232,
"fileSize" : 29984030720,
"nsSizeMB" : 16,
"ok" : 1
}

> db.extracts.stats()
{
"ns" : "nc1.extracts",
"count" : 43528842,
"size" : 15444986452,
"avgObjSize" : 354.8219006607159,
"storageSize" : 17602879440,
"numExtents" : 38,
"nindexes" : 5,
"lastExtentSize" : 2146426864,
"paddingFactor" : 1.0099999998671518,
"flags" : 0,
"totalIndexSize" : 7523898592,
"indexSizes" : {
"_id_" : 2657666032,
"dl_1" : 1015115808,
"s4_1" : 865871104,
"n.l_1_n.f_1" : 1457363824,
"n.ls_1_n.fs_1" : 1527881824
},
"ok" : 1
}

$ uname -a
Linux myxyz-P55-USB3 2.6.35-30-generic #61-Ubuntu SMP Tue Oct 11
17:52:57 UTC 2011 x86_64 GNU/Linux

$ free -m
total used free shared buffers
cached
Mem: 16073 15988 85 0 3
14279
-/+ buffers/cache: 1705 14367
Swap: 12401 1 12400

Brian Adkins

unread,
Jan 4, 2012, 12:29:07 PM1/4/12
to mongodb-user
On Jan 4, 11:57 am, Brian Adkins <lojicdot...@gmail.com> wrote:
> I've noticed a performance regression after upgrading from 1.6.5 to
> 2.0.2, and I'm having some difficulty pinpointing the precise problem.
> [...]

Here's an example document:

{
"_id" : "0301995213273CR",
"cn" : "WAKE",
"n" : [
{ "l" : "ZILBERSCHATZ", "ls" : 32400, "f" : "GEORGE", "fs" :
32436 },
{ "l" : "SMITH", "ls" : 54400, "f" : "JOHN", "fs" : 89436 },
{ "l" : "JOHNSON", "ls" : 78400, "f" : "JIM", "fs" : 12436 }
],
"a1" : "PO BOX 123",
"cty" : "RALEIGH",
"st" : "NC",
"z5" : "28019",
"r" : "B",
"s" : "M",
"by" : 1954,
"bm" : 6,
"bd" : 22,
"dl" : "2412300",
"dls" : "NC",
"cno" : "6163964",
"ad" : "19950818",
"ed" : "20120101",
"cd" : "Y"
}

Brian Adkins

unread,
Jan 4, 2012, 1:11:26 PM1/4/12
to mongodb-user
On Jan 4, 12:29 pm, Brian Adkins <lojicdot...@gmail.com> wrote:
> On Jan 4, 11:57 am, Brian Adkins <lojicdot...@gmail.com> wrote:
>
> > I've noticed a performance regression after upgrading from 1.6.5 to
> > 2.0.2, and I'm having some difficulty pinpointing the precise problem.
> > [...]

Aha! I just verified a difference in index usage with $elemMatch
between 1.8.2 (so I assume 1.6.5 also) and 2.0.2:


Here is 2.0.2 (note first name *not* being used in explain):


> db.system.indexes.find({ns : 'nc1.extracts'})
{ "v" : 1, "key" : { "_id" : 1 }, "ns" : "nc1.extracts", "name" :
"_id_" }
{ "v" : 1, "key" : { "dl" : 1 }, "ns" : "nc1.extracts", "name" :
"dl_1" }
{ "v" : 1, "key" : { "s4" : 1 }, "ns" : "nc1.extracts", "name" :
"s4_1" }
{ "v" : 1, "key" : { "n.l" : 1, "n.f" : 1 }, "ns" : "nc1.extracts",
"name" : "n.l_1_n.f_1" }
{ "v" : 1, "key" : { "n.ls" : 1, "n.fs" : 1 }, "ns" : "nc1.extracts",
"name" : "n.ls_1_n.fs_1" }

> db.extracts.find( { n : { $elemMatch : { l : 'STANLEYZ', f : 'GEORGE' }}} ).explain()
{
"cursor" : "BtreeCursor n.l_1_n.f_1",
"nscanned" : 0,
"nscannedObjects" : 0,
"n" : 0,
"millis" : 0,
"nYields" : 0,
"nChunkSkips" : 0,
"isMultiKey" : true,
"indexOnly" : false,
"indexBounds" : {
"n.l" : [
[
"STANLEYZ",
"STANLEYZ"
]
],
"n.f" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
]
}
}


And here is 1.8.2:



> db.system.indexes.find({ns : 'nc1.extracts'})
{ "name" : "_id_", "ns" : "nc1.extracts", "key" : { "_id" : 1 }, "v" :
0 }
{ "name" : "dl_1", "ns" : "nc1.extracts", "key" : { "dl" : 1 }, "v" :
0 }
{ "name" : "s4_1", "ns" : "nc1.extracts", "key" : { "s4" : 1 }, "v" :
0 }
{ "name" : "n.l_1_n.f_1", "ns" : "nc1.extracts", "key" : { "n.l" : 1,
"n.f" : 1 }, "v" : 0 }
{ "name" : "n.ls_1_n.fs_1", "ns" : "nc1.extracts", "key" : { "n.ls" :
1, "n.fs" : 1 }, "v" : 0 }


> db.extracts.find( { n : { $elemMatch : { l : 'STANLEYZ', f : 'GEORGE' }}} ).explain()
{
"cursor" : "BtreeCursor n.l_1_n.f_1",
"nscanned" : 0,
"nscannedObjects" : 0,
"n" : 0,
"millis" : 0,
"nYields" : 0,
"nChunkSkips" : 0,
"isMultiKey" : true,
"indexOnly" : false,
"indexBounds" : {
"n.l" : [
[
"STANLEYZ",
"STANLEYZ"
]
],
"n.f" : [
[
"GEORGE",
"GEORGE"
]
]
}
}


No wonder my queries are dog slow on 2.0.2! Wow.

Any ideas when/how this got broken?

Thanks,
Brian

Brian Adkins

unread,
Jan 4, 2012, 1:44:12 PM1/4/12
to mongodb-user
On Jan 4, 1:11 pm, Brian Adkins <lojicdot...@gmail.com> wrote:
> On Jan 4, 12:29 pm, Brian Adkins <lojicdot...@gmail.com> wrote:
>
> > On Jan 4, 11:57 am, Brian Adkins <lojicdot...@gmail.com> wrote:
>
> > > I've noticed a performance regression after upgrading from 1.6.5 to
> > > 2.0.2, and I'm having some difficulty pinpointing the precise problem.
> > > [...]
>
> Aha! I just verified a difference in index usage with $elemMatch
> between 1.8.2 (so I assume 1.6.5 also) and 2.0.2:
> [...]

This appears to be a known problem with little interest :(

https://jira.mongodb.org/browse/SERVER-3104

I guess I'll try reverting to 1.8.4

Kyle Banker

unread,
Jan 4, 2012, 1:48:44 PM1/4/12
to mongodb-user
Journaling is enabled by default on 2.0.2. Did you intentionally
disable it? If not, this might partly explain the issues you're
seeing.

As for the difference in $elemMatch, will check on that. Stay tuned...

Brian Adkins

unread,
Jan 4, 2012, 1:54:13 PM1/4/12
to mongodb-user
On Jan 4, 1:48 pm, Kyle Banker <kyleban...@gmail.com> wrote:
> Journaling is enabled by default on 2.0.2. Did you intentionally
> disable it? If not, this might partly explain the issues you're
> seeing.
>
> As for the difference in $elemMatch, will check on that. Stay tuned...

I did not disable journaling since the reason for moving to 2.0.2 was
to get journaling :)

Are you saying that journaling may result in funky memory issues?

Regarding $elemMatch, I think I found the problem (see my 4th post
parallel to your reply), now all I need is a solution :) Since I was
on 1.6.5, reverting from 2.0.2 to 1.8.4 isn't a big problem, but I
would like better durability.

Thanks,
Brian

aaron

unread,
Jan 4, 2012, 2:11:47 PM1/4/12
to mongodb-user
Hi Brian,

The $elemMatch issue you identified was part of a larger change - it
was a known tradeoff and did not "get broken." Our earlier behavior
could return incorrect results in some cases and these were fixed in
the ticket <https://jira.mongodb.org/browse/SERVER-958>. One
consequence of the fix is that certain $elemMatch queries on compound
multikey indexes can be less performant than they were previously, and
some additional work is required to get that performance back. At the
time we were doing this work, we filed the ticket you identified
<https://jira.mongodb.org/browse/SERVER-3104> to optimize this
elemMatch case. If you are interested in seeing SERVER-3104
implemented I'd encourage you to vote and/or comment on the ticket.

Brian Adkins

unread,
Jan 4, 2012, 2:41:14 PM1/4/12
to mongodb-user
On Jan 4, 2:11 pm, aaron <aaron%10gen....@gtempaccount.com> wrote:
> Hi Brian,
>
> The $elemMatch issue you identified was part of a larger change - it
> was a known tradeoff and did not "get broken."  Our earlier behavior
> could return incorrect results in some cases and these were fixed in
> the ticket <https://jira.mongodb.org/browse/SERVER-958>.  One
> consequence of the fix is that certain $elemMatch queries on compound
> multikey indexes can be less performant than they were previously, and
> some additional work is required to get that performance back.  At the
> time we were doing this work, we filed the ticket you identified
> <https://jira.mongodb.org/browse/SERVER-3104> to optimize this
> elemMatch case.  If you are interested in seeing SERVER-3104
> implemented I'd encourage you to vote and/or comment on the ticket.

Thanks for the feedback Aaron.

Just to clarify, by "broken", I simply meant that my queries that were
accurate and fast before are now slow enough to be utterly unusable
and will require me to revert to 1.8.4. However, it's my bad for not
putting the new version through a more rigorous test cycle with my
app.

If you must call it a "tradeoff" instead of a "bug", at the very
least, I think there should be a gigantic warning somewhere in the
docs pertaining to upgrading to v2 that any queries of this sort may
have drastically reduced performance - as in several orders of
magnitude - as a result of the tradeoff. I can only assume that the
reason this hasn't been fixed already is because the type of query
that suffers from it is very rare, or the affected databases are
small.

Having vented, I should say that I'm certainly thankful that MongoDB
exists and is free, and that I appreciate the contributions the
developers have made.

Brian

Brian Adkins

unread,
Jan 4, 2012, 9:27:51 PM1/4/12
to mongodb-user
As a followup, I reverted MongoDB to 1.8.4 and reindexed and the
performance is simply outstanding. I realize how the different
behavior of $elemMatch with array indexes affected those specific
queries, but I also ran a half dozen other queries and they were all
*significantly* faster than 2.0.2 even though my index size was about
12% smaller with 2.0.2

The RAM usage is much better with mongod's use of RAM staying high
with no unexplained drops.

I don't have time to do a more detailed analysis, but as far as I can
tell, there was not a single query that ran faster with 2.0.2 than
with 1.8.4, and in fact, some (besides the $elemMatch problem) were
two orders of magnitude slower with 2.0.2.

I'll wait until the $elemMatch issue is solved and then try and setup
a parallel test to compare the two again.

Brian

aaron

unread,
Jan 5, 2012, 2:48:11 AM1/5/12
to mongodb-user
Hi Brian,

We can look at tests you have performed if you let us know what the
tests are. Please keep in mind that there are some other types of
queries on multikey indexes that may behave and perform differently
due to the change for <https://jira.mongodb.org/browse/SERVER-958>.

Brian Adkins

unread,
Jan 5, 2012, 11:03:14 AM1/5/12
to mongodb-user
On Jan 5, 2:48 am, aaron <aaron%10gen....@gtempaccount.com> wrote:
> Hi Brian,
>
> We can look at tests you have performed if you let us know what the
> tests are.  Please keep in mind that there are some other types of
> queries on multikey indexes that may behave and perform differently
> due to the change for <https://jira.mongodb.org/browse/SERVER-958>.

Aaron, it appeared that *any* query involving the multikey index where
my array elements are documents such as the following:

{
"_id" : "0301995213273CR",
"n" : [
{ "l" : "ZILBERSCHATZ", "ls" : 32400, "f" : "GEORGE", "fs" :
32436 },
{ "l" : "SMITH", "ls" : 54400, "f" : "JOHN", "fs" : 89436 },
{ "l" : "JOHNSON", "ls" : 78400, "f" : "JIM", "fs" : 12436 }
],
...
}

were slow enough to be unusable. For example:

db.extracts.find({ 'n.l' : /^A/ }).count()

That takes a few seconds on 1.8.4 and a few minutes on 2.0.2 (> 5
min.) with *tons* of disk I/O. Is it possible that mongod is just
"giving up" on my particular situation and doing table scans of the 43
M records? I think the nearly constant disk reading is a clue.

Besides using documents in the array elements, the only other thing I
can think of that might be slightly unique to me is that I use my own
natural key for the _id to save space since the key is guaranteed to
be unique. During the day, this collection is 100% read only - it's
only updated once each night.

Also, I have two indexes - one pulls out two fields from the name doc
(l, f) and the other pulls out the other two fields (ls, fs). If I had
more time, I'd try splitting the n field in to n and ns with two
fields each, that way the index would cover the entire doc in each
case - that's just a hunch on my part, but if it still only uses the
first of two fields, performance will be terrible.

I'll try and post back when I have time to do more analysis, but it
sounds like you're already aware of the problem regardless.

Brian

aaron

unread,
Jan 5, 2012, 2:33:30 PM1/5/12
to mongodb-user
Hi Brian,

I think in that case we are not doing key matching of the documents
being counted, again because of SERVER-958 changes. To confirm this,
you can check the performance difference between

db.extracts.find({ 'n.l' : /^A/ }).itcount()

runs on the two different mongo versions.

And also you can generally check if an index or full scan is being
performed using db.extracts.find({ 'n.l' : /^A/ }).explain( true )

Here is the ticket for performance of this type of query, if I have
guessed the issue correctly: <https://jira.mongodb.org/browse/
SERVER-3103>

damirv

unread,
Aug 30, 2012, 2:25:36 PM8/30/12
to mongod...@googlegroups.com
With the help of Jenny, I found out how you could have correct index bounds for these kind of queries under MongoDB 2.2+
Check my post (with an example):
https://groups.google.com/d/msg/mongodb-user/86wRYNVPmIE/BUIVjOkL33YJ

Can you try your queries with this 'hack'? (using $gte and $lte) I would like to know how the performance is compared to MongoDB 1.8..

Thanks.
Reply all
Reply to author
Forward
0 new messages