Hi,
I have a collection of a few hundred million documents that is hash sharded on _id because we do thousands of reads and writes per second on single documents by ID. So far this has scaled well for these targeted reads and writes. Say that documents look like this:
{
_id: ObjectId,
a: ...,
b: ...,
c: ...,
...
dozens of other fields
...
}
I have an index on {a:1} and a couple of times per day, I want to look at all the documents in our database with a certain "a" value and query various other fields with it. It's a different set of field permutations each each time, so I can't add dozens of indexes, and multikey indexing was out of the question long ago because $all queries only looked up the first field in the query.
Let's say that I'm running a query for {"a" : "foo", "x" : "baz"}. And let's say that 10 million documents have {"a" : "foo"}, and of those, 5 million will match {"a" : "foo", "x" : "baz"}. I'm trying to pull back the _id of those documents. I've tried to time this and have tweaked things like the batchSize, but the best I'm getting is about 6500 documents per second. If I baseline it off a covered index by trying to pull back the _id for the query {"a" : "foo"}, I can get around 30,000 documents per second. However, both these speeds can be fairly slow -- if I wanted to pull back all 5 million _ids, it'll take about 12 minutes; if I wanted to pull back all 10 million _ids, it'll take over 5 minutes.
This is obviously a scatter-gather query, so the mongos has to compile results at the end which means that it can only be as fast as the fastest shard. I'm wondering if there is a way to do this in parallel. Even if I can get the shards and mongos to be faster, at the end of the day I still have one cursor returning results for me in this scenario.
I've been thinking about a "map/reduce" approach where I add an index on {a:1, _id:1}. Then, I can split up the query based on some range. For example, if the 10 million documents matching {"a":"foo"} represented a year's worth of data, I could run 12 queries, each by month (e.g., December, 2013 through Jan 2014 has _id between 529ac2500000000000000000 and 52c3a0d00000000000000000). These 12 queries could be run on 12 different servers in parallel, which then map the results to a reduce phase. I'm wondering if that would be a possible way to improve getting the data back.
What are your thoughts on how to best approach getting lots of data out of MongoDB quickly? Is spitting up the queries into smaller, range queries in parallel acceptable? What would you suggest?
Thanks,
Jon