Hadoop Mongodb connector vs ETL tools (Talend & Pentahoo kettle)

Peter Packer

unread,

Jul 14, 2014, 12:12:22 AM7/14/14

to mongod...@googlegroups.com

Dear all,

I am planing to do extensive data mining in Mahout with data that are collected in mongodb.

I am reading this http://docs.mongodb.org/ecosystem/use-cases/hadoop/. This doc highlights "moving data first to hdfs then compute". A quick search give me the following tools:

http://www.severalnines.com/blog/big-data-integration-etl-clickstream-mongodb-hadoop-analytics

http://edpflager.com/?p=1642

http://engineering.foursquare.com/2014/01/28/mongo-on-hadoop/

In the meantime, I found out that I can use hadoop mongodb connector to run hadoop jobs seemlessly on top of mongodb without moving gigabytes of data to hdfs.

I wonder if you may help pointing out what is the most popular way to do extensive hadoop mining with mongodb data? I really mean in production.

Should I definitively stick with the official hadoop mongo connector?

Thanks very much,

Peter

Will Berkeley

unread,

Jul 17, 2014, 11:51:13 AM7/17/14

to mongod...@googlegroups.com

Hi Peter. The mongo-hadoop connector does let you run your Hadoop jobs directly against MongoDB. I'd recommend starting this way and considering appropriate sync/dump options once there's a need for it.

-Will

TJ Tang

unread,

Jul 19, 2014, 8:08:39 PM7/19/14

to mongod...@googlegroups.com

If run Hadoop job directly against MongoDB, would there be a bottleneck on Mongo instances serving the data? Consider the scenario where I might have 20 Hadoop computing nodes and all accessing data from a 2 shards Mongo cluster, in real time.

在 2014年7月17日星期四UTC+8下午11时51分13秒，Will Berkeley写道：

Will Berkeley

unread,

Jul 21, 2014, 12:44:42 PM7/21/14

to mongod...@googlegroups.com

That's correct that there might be a bottleneck for large enough Hadoop jobs, in which case you should consider dumping and syncing with HDFS instead of using the mongo-hadoop connector. The point is that it is much easier to use the mongo-hadoop connector than to do the syncing, so start off using the mongo-hadoop connector and, if you do encounter a bottleneck, consider syncing to HDFS to avoid the bottleneck.

-Will

Reply all

Reply to author

Forward