I wrote about this in a blog post recently
Basically if you are getting significant reductions in data volumes from querying then the default Mongo-Hadoop connector performs extremely sub-optimally.
There is no particularly easy answer (I play very briefly with a few options and report some simplistic benchmarks in the blog post)
For cases where the query reduction gets you down to only a few hundred K docs then I wrote an alternative splitter that uses skip/limit (it's open source and linked from the blog post), though obviously this is something of a "toy" case.
Having played some more since the blog post was written, I think the best architecture for a Mongo-Hadoop connector is completely different to what 10gen provide.
I would have a separate server (obv this could be built into a mongos instead) co-located with each mongod (aka shard; not worrying about replicas for the moment). The splitter then sends the query to each of these servers, which generates a single cursor per shard. The Hadoop tasks then connect to the server (ie many tasks connecting to a single cursor), which doles out data in blocks.
This would have a bunch of advantages:
- you are guaranteed to distribute well and not query any more data than necessary
- you only query once per shard, plus you don't have to created combined indexes with the shard key (this is explained further in the blog post) - one of the big bottlenecks for large jobs is that each query gets executed once per task, which can slam the mongod
- you can move in (approximately) natural order through the dataset, which massively improves read performance
I have it on my todo list to write this at some point, since I use Hadoop/MongoDB quite a lot and the existing implementation/my current extension does not work for many of my use cases
Alex