Returning data from MongoDB faster

Jon

unread,

Jan 15, 2014, 9:17:04 AM1/15/14

to mongod...@googlegroups.com

Hi,

I have a collection of a few hundred million documents that is hash sharded on _id because we do thousands of reads and writes per second on single documents by ID. So far this has scaled well for these targeted reads and writes. Say that documents look like this:

{

_id: ObjectId,

a: ...,

b: ...,

c: ...,

...

dozens of other fields

...

}

I have an index on {a:1} and a couple of times per day, I want to look at all the documents in our database with a certain "a" value and query various other fields with it. It's a different set of field permutations each each time, so I can't add dozens of indexes, and multikey indexing was out of the question long ago because $all queries only looked up the first field in the query.

Let's say that I'm running a query for {"a" : "foo", "x" : "baz"}. And let's say that 10 million documents have {"a" : "foo"}, and of those, 5 million will match {"a" : "foo", "x" : "baz"}. I'm trying to pull back the _id of those documents. I've tried to time this and have tweaked things like the batchSize, but the best I'm getting is about 6500 documents per second. If I baseline it off a covered index by trying to pull back the _id for the query {"a" : "foo"}, I can get around 30,000 documents per second. However, both these speeds can be fairly slow -- if I wanted to pull back all 5 million _ids, it'll take about 12 minutes; if I wanted to pull back all 10 million _ids, it'll take over 5 minutes.

This is obviously a scatter-gather query, so the mongos has to compile results at the end which means that it can only be as fast as the fastest shard. I'm wondering if there is a way to do this in parallel. Even if I can get the shards and mongos to be faster, at the end of the day I still have one cursor returning results for me in this scenario.

I've been thinking about a "map/reduce" approach where I add an index on {a:1, _id:1}. Then, I can split up the query based on some range. For example, if the 10 million documents matching {"a":"foo"} represented a year's worth of data, I could run 12 queries, each by month (e.g., December, 2013 through Jan 2014 has _id between 529ac2500000000000000000 and 52c3a0d00000000000000000). These 12 queries could be run on 12 different servers in parallel, which then map the results to a reduce phase. I'm wondering if that would be a possible way to improve getting the data back.

What are your thoughts on how to best approach getting lots of data out of MongoDB quickly? Is spitting up the queries into smaller, range queries in parallel acceptable? What would you suggest?

Thanks,

Jon

Jonathan Hyman

unread,

Jan 15, 2014, 10:27:50 AM1/15/14

to mongod...@googlegroups.com

Also to clarify in my timing, I start the timer inside the cursor loop. So the 6500 and 30000 per second does not include the initial time it takes Mongo to fetch results. I'm trying to time subsequent get mores to see how fast I can retrieve the data.

Sent from my mobile device

--
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To post to this group, send email to mongod...@googlegroups.com
To unsubscribe from this group, send email to
mongodb-user...@googlegroups.com
See also the IRC channel -- freenode.net#mongodb

---
You received this message because you are subscribed to a topic in the Google Groups "mongodb-user" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mongodb-user/qFsRXZMBalw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mongodb-user...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Asya Kamsky

unread,

Jan 16, 2014, 2:15:34 AM1/16/14

to mongodb-user

Are you getting back the data single threaded? That's the limiting factor, it seems...

Asya

You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user...@googlegroups.com.

eugeniy....@gmail.com

unread,

Jan 16, 2014, 4:37:18 AM1/16/14

to mongod...@googlegroups.com

Hi, Jon!

Do you really need 5 million records at one time? Is it possible to you to get, for example a 10000 records apply some logic on them and then query next 10000? I think it would be good for performance if mongos will query only 1 server at 1 point of time, or as less as possible. So you could make a discision about querypack from statistic of shared collection. Of course you can do it in parallel threads.
Also if "a":"foo" has 10 millions of reccords and "x" : "baz" has 5 millions of records => your query is not effective cause of selectivity of indexes. May be its better to change your index policy. Check the explain() of your query.

Jonathan Hyman

unread,

Jan 16, 2014, 8:52:57 AM1/16/14

to mongod...@googlegroups.com

Hi Asya,

I definitely think that the single threaded nature of this is a problem -- what do you think of my suggestion of making many queries in parallel partitioned by time range (using _id)? Otherwise, how can I parallelize the results getting returned from Mongo? E.g.,

# Since I have one cursor, how do I parallelize this?

db.collection.find(...).select({_id:1}).each do |document|

end

Eugeniy, I need to process all 5 million records and do work on them. Ignore the indexes for now, I don't think it's very relevant to my question; I've already stated that even with a fully covered index I'm only able to retrieve ~30,000 docs/second which can still be slow when dealing with millions of documents. And to be clear, I'm totally fine fetching 10,000 at a time, doing work on them, and querying the next 10,000, I don't need all 5 million in one batch. Imagine that the work I do on them takes 0ms to do. The issue is still that even if I do this:

start_time = nil

i = 0

db.collection.find(...).select({_id:1}).each do |document|

start_time ||= Time.now

i += 1

end

puts("#{i} docs took #{Time.now - start_time} seconds")

it takes a few minutes to get the documents. I need this to be faster.

eugeniy....@gmail.com

unread,

Jan 16, 2014, 2:53:57 PM1/16/14

to mongod...@googlegroups.com

Jon, you have missed an important part of my answer.

I'll try to summarize:

1) Check the approximate size of records on 1 shard (to make a decision about query limit size (to make mongos quering as less instances as posible)

2) Query number of records

3) Do (2) in parallel threads. Quering in 1 thread is nearly useless.

I can't forget about indexes, cause its 1 of a cornerstones. Check the fragmentation, may be its time to rebuild them.

четверг, 16 января 2014 г., 17:52:57 UTC+4 пользователь Jon написал:

Jonathan Hyman

unread,

Jan 16, 2014, 3:56:52 PM1/16/14

to mongod...@googlegroups.com

How do you imagine doing (2) in parallel? Via range queries as I mention below?

Asya Kamsky

unread,

Jan 17, 2014, 1:33:21 AM1/17/14

to mongodb-user

You're currently getting 10K at a time - I would imagine doing range
partitioning such that multiple threads are each getting 10K
(non-overlapping records) would be the approach...

Then you're basically only limited by network bandwidth...

> You received this message because you are subscribed to the Google Groups
> "mongodb-user" group.
> To unsubscribe from this group and stop receiving emails from it, send an

Jonathan Hyman

unread,

Jan 17, 2014, 8:54:49 AM1/17/14

to mongod...@googlegroups.com

Okay great, glad to hear that my idea for _id partitioning wasn't prohibitive. Thanks, Asya and Eugeniy.

eugeniy....@gmail.com

unread,

Jan 19, 2014, 2:37:10 PM1/19/14

to mongod...@googlegroups.com

Jon, you didn't tell anything about your tools.

Usually, I use .net, so its not a problem to make a couple of queries to the shard in separate threads and then do something with the results.

пятница, 17 января 2014 г., 0:56:52 UTC+4 пользователь Jon написал:

Jonathan Hyman

unread,

Jan 19, 2014, 4:11:22 PM1/19/14

to mongod...@googlegroups.com

I was asking more for approach, I don't need code samples or anything. Basically was wondering if it made sense to make, say, 10 queries to cover different ranges or if there was something else I should be doing. I've coded up the method described below where I index on {a:1, _id:1} and make queries in parallel on different threads across different _id ranges based on the age of the data.