batch processing over entire sharded collection in the multiple millions of documents

Jesse Sanford

unread,

Aug 28, 2012, 12:14:51 PM8/28/12

to mongod...@googlegroups.com

It would seem to me that it would be smart to use hadoop to do this. However the mongo-hadoop adapter is too young. How are people doing this type of processing? doing this using the mongo js map-reduce system single threaded is not scalable. Should I dump the collection to disk move it to hdfs/load it in hive manually. This seems so archaic given all the support many rdbms systems have for doing this type of thing automatically with projects like sqoop.

Suggestions?

Mike O'Brien

unread,

Aug 28, 2012, 3:35:02 PM8/28/12

to mongod...@googlegroups.com

Hi Jesse,

The mongo-hadoop adapter has a stable 1.0 release and works well, are there specific features you need in it for this use case?

If you have a sharded system, native mongo mapreduce also runs on each shard to take advantage of the extra processing power available. You should benchmark these options to get an idea of what works best for your situation.

Jesse Sanford

unread,

Aug 28, 2012, 4:17:14 PM8/28/12

to mongod...@googlegroups.com

Mike, I should have framed "too new" in my last comment. I apologize for the confusion. Currently it is too new in the sense that there is little community support and a lack of documentation. While I believe that what is there is the start of something really useful it unfortunately would be a difficult to build a production system on top of it right now. As for the processing power. While I know that splitting the js across each shard will help I believe that Hadoop is the correct tool for the job. The data processing pipeline I invision will need to join multiple collections and do some long running possibly cpu intensive calculations. These calculations will evolve over time and will probably become workflow DAGs of sorts. Something like cascading would be an absolute win. Unfortunately there is no input or output taps for mongo that are supported. So I am wondering if there is someone out there using hadoop for this type of task. Are we simply dumping to json and then loading the json into hdfs/hive then starting down the hadoop ecosystem from there? Is that the most common setup? Again the use case implies hadoop. The fact that we are using mongo is set in stone. I need something that will be easy to find info/help about. Common is good.

Reply all

Reply to author

Forward