Compound indexes for date range queries

514 views
Skip to first unread message

John Myers

unread,
Oct 17, 2016, 6:54:19 PM10/17/16
to mongodb-user
Hi,

I currently have very large collection with millions of documents, where one field is the time of an event (in epoch seconds). All queries against this collection will at least have this field present. There are some other fields that can be included, depending on the application making the query.

Example schema:
{
  start: 1476744427,
  sender: "Bob",
  receiver: "Jane",
  amount_sent: 45,
  amount_received: 100
}

Let's say I want to support the following queries:

{$or: [{sender: "Bob", receiver: "Bob}], amount_sent: 30, start: {$gte: 12345, $lte: 5678}}
...
{start: {$gte: 12345, $lte: 5678}, amount_sent: 88}

If I need to sort, it will by most recent, so, descending.

The last one seems like I should just have an index on {start: -1, amount_sent: 1} (or should it be amount_sent first??). My thought is that if I use start as the first field in the compound index, other queries can be supported, where other fields are searched on besides amount_sent, it will just use the index to find the start times and then COLSCAN the rest. For instance a query against "start" and "amount_received" can still benefit from the index on start, but then have to scan for the amount_received values.

Supporting the $or statement, I'm not sure, per the docs, each field in the $or statement should have its own index, so do I need two here?
{start: -1, sender: 1} AND {start: -1, receiver: 1}

Thanks for any advice!

Kevin Adistambha

unread,
Nov 1, 2016, 11:21:11 PM11/1/16
to mongodb-user

Hi John

It’s difficult to say for sure what indexes is optimal without knowing for sure the cardinality of your data. However, I tried to make an educated guess regarding your data, and populated a test collection using 50,000 randomly generated documents to check out your queries.

I tried with the arguably simpler second query you posted:

{start: {$gte: 12345, $lte: 5678}, amount_sent: 88}

For this query, you need an index having the fields start and amount_sent. You suggested using {start:-1, amount_sent:1}. However, this is not optimal, which we can see in the explain() output. Note the use of hint() to force MongoDB to use a certain index:

> db.test.explain('executionStats').find({amount_sent:10, start: {$gte: 1467295200, $lte: 1483189200}}).sort({start:-1}).hint({start:-1, amount_sent:1})
    ...
    "nReturned": 253,
    "executionTimeMillis": 38,
    "totalKeysExamined": 24947,
    "totalDocsExamined": 253,
    ...

From the explain('executionStats') output, MongoDB needs to scan ~25,000 keys to return 253 documents matching the query.

A better index is {amount_sent:1, start:1}. This index is structured by first having an exact value of amount_sent, where for each value of amount_sent, sorted values of start is stored. Here’s an illustration of the index structure:

amount_sent 1 2 3
start 1 2 3 1 2 3 1 2 3

Since you need an exact value of amount_sent and a range of start (and possibly sort using start), you can verify from the illustration above that the index facilitates the query perfectly.

The explain() result of the query also shows its efficiency:

> db.test.explain('executionStats').find({amount_sent:10, start: {$gte: 1467295200, $lte: 1483189200}}).sort({start:-1}).hint({amount_sent:1, start:1})
    ...
    "nReturned": 253,
    "executionTimeMillis": 1,
    "totalKeysExamined": 253,
    "totalDocsExamined": 253,
    ...

Now MongoDB examines 253 keys (instead of ~25,000) and returns the same number of documents. One takeaway from this is that an index fits the query if the numbers in nReturned, totalKeysExamined, and totalDocsExamined are the same.

Your first query is more involved:

{$or: [{sender: “Bob”, receiver: “Bob}], amount_sent: 30, start: {$gte: 12345, $lte: 5678}}

Let’s say we have an index {sender:1, amount_sent:1, start:1}:

> db.test.explain('executionStats').find({$or:[{sender:"Oscar"}, {receiver:"Oscar"}], amount_sent:10, start: {$gte: 1467295200, $lte: 1483189200}}).sort({start:-1}).hint({sender:1, amount_sent:1, start:1})
    ...
    "nReturned": 4,
    "executionTimeMillis": 118,
    "totalKeysExamined": 50000,
    "totalDocsExamined": 50000,
    ...
    "stage": "SORT_KEY_GENERATOR",
    ...

This is not great at all. MongoDB looks at 50,000 keys and documents (basically the whole collection), only to return 4 of them. Also, note the stage called “SORT_KEY_GENERATOR” there. The existence of this stage means that MongoDB cannot use the index to present a sorted result, and thus is doing an in-memory sort instead, which is limited to 32 MB. If your query returns a large result set that exceeds 32 MB, the query will fail. See Use Indexes to Sort Query Results.

We can improve this by refactoring your query (by unwrapping the $or statement) and creating a couple of indexes to support both terms of the $or query.

The refactored query looks like:

{$or:[
    {sender:"Oscar", amount_sent:10, start: {$gte: 1467295200, $lte: 1483189200}}, 
    {receiver:"Oscar", amount_sent:10, start: {$gte: 1467295200, $lte: 1483189200}}
]}

and the two indexes are {sender:1, amount_sent:1, start:1} and {receiver:1, amount_sent:1, start:1}, to satisfy both terms of the $or query. The explain() result I got is:

> db.test.explain('executionStats').find({$or:[{sender:"Oscar", amount_sent:10, start: {$gte: 1467295200, $lte: 1483189200}}, {receiver:"Oscar", amount_sent:10, start: {$gte: 1467295200, $lte: 1483189200}}]}).sort({start:-1})
    ...
    "nReturned": 4,
    "executionTimeMillis": 1,
    "totalKeysExamined": 4,
    "totalDocsExamined": 4,
    ...
    "stage": "SORT_MERGE",
    ...

The refactored query and the two indexes appear to work together well. Now MongoDB returns 4 documents, and examined exactly the same number of index keys and documents. Also note the existence of a “SORT_MERGE” stage. This is basically the merge stage of a merge sort, which is not bound to the 32 MB limit. This stage merges the two (already sorted) results for each of the $or terms. Also, execution time of the query drops from 118 ms to just 1 ms.

Please note that the analysis I did is mostly an educated guess since I don’t know the exact nature of your data. To optimize your use case, I would encourage you to fully understand the explain() result: https://docs.mongodb.com/v3.2/reference/explain-results/

Best regards,
Kevin

Reply all
Reply to author
Forward
0 new messages