performance of mongodb-hadoop

20 views

Skip to first unread message

王相

unread,

Aug 23, 2016, 3:52:59 AM8/23/16

to mongodb-user

Hi, I am use mongodb to replicator mysql's data. I had a lot of data task run on mysql before, and now I want to use mongodb-hadoop to deal mogodb's data by hadoop. As I know, traditional hadoop use data on HDFS as input, and then deal these data at the datanode. HDFS is a distributed file system, data is save on several datanodes, so hadoop can keep a great performance. But now mongodb-hadoop use mongodb's data as input. If I don't use shard, data will save on only one node, so I am worried about the performance of mongodb-hadoop. Is it right?

Luke Lovett

unread,

Aug 23, 2016, 11:07:51 AM8/23/16

to mongodb-user

The way MongoOutputFormat works currently is to save all data that will be written to MongoDB to a temporary file first. When the reduce task is complete, the MongoOutputCommitter reads these files and executes all the inserts/updates all at once. Saving data in-bulk at the end of the reduce helps to save a lot of time.

If you think you'll be doing many back-to-back jobs with the same data from MongoDB, you can also look into BSONFileOutputFormat: https://github.com/mongodb/mongo-hadoop/wiki/Using-.bson-Files. Instead of writing back to MongoDB, this allows you to write a ".bson" file on HDFS. These ".bson" files can be read as input on the next job. When the job pipeline is complete, you can use MongoOutputFormat and write back to MongoDB. This strategy is nice because only the first job in a pipeline of jobs needs to read from MongoDB instead of each one, and having traditional files on HDFS helps preserve data locality, since each data node will process the range of the bson file that is stored on it directly.

Reply all

Reply to author

Forward

0 new messages