Can the MongoDB aggregation framework use different indexes for different $match stages of its pipeline?

354 views
Skip to first unread message

Cédric Warny

unread,
Apr 19, 2014, 1:24:33 PM4/19/14
to mongod...@googlegroups.com
Let's say I have a text index and a regular index on a date field. I want to query my data over a date range and with a text search, like so: 

db.col.find({publisheddatetime:{$gte:start, $lt:end}, $text:{$search:"something"}})

Building a compound index on both date and text as follows:

db.col.ensureIndex({publisheddatetime:1,text:"text"})

won't help, since, according to the docs

If the compound text index includes keys preceding the text index key, to perform a $text search, the query predicate must include equality match conditions on the preceding keys.

 What is a good strategy here? I have millions of documents containing fairly large "text" fields (can be as large as whole blog posts). Could I built two separate indexes on the date field and the text field and then use the aggregation pipeline and two different $match? My hope is that MongoDB could use the first date index for the first $match stage and the second text index for the second $match stage. Is that possible? Can the MongoDB aggregation framework use different indexes for different $match stages of its pipeline? 

The aggregation pipeline would look like this:

db.col.aggregate([{$match:{publisheddatetime:{$gte:start,$lt:end}}},{$match:{$text:{$search:"something"}}}])

Andrew Ryder

unread,
May 27, 2014, 12:17:56 AM5/27/14
to mongod...@googlegroups.com
Hi Cédric!

To answer your immediate question, the aggregation pipeline you posted will not work as is because the $text operator can only be used in the first pipeline stage - however, given that two match operations in series is logistically identical to a single compound $match, separating them makes no difference to the result.

You can simply swap the order of the stages and the resulting pipeline will work, if there is a text index available, but it won't use an index for the second stage.

I see the following options available to you:
  • Do the text stage first and accept that the date range portion is not indexed (this might be fast enough if you limit your results).
  • Run an aggregation that pipes the output to an initially empty collection which has the other index and then run the second part of the query against that result-set.
  • Pre-parse your text field into an array of keywords and index that instead. This will remove the need for the $text operator and let you perform a single query (or aggregation).
  • Use an index that puts the date range first and use a simpler regex match instead of the text search.

What you choose to do is very dependent on the data you are manipulating and you'll need to carry out testing to decide what the best approach is for you specific use case.

Kind regards,
Andrew

Reply all
Reply to author
Forward
0 new messages