Mongodb hadoop integration questions

43 views
Skip to first unread message

Gonzalo López

unread,
Oct 7, 2016, 10:49:35 PM10/7/16
to mongodb-user
I'm doing a project for researching purposes and I'm trying to design an architecture (I'm not sure if it is a case where to apply mongodb + hadoop).

In this project, users will store their data and then they will be able to process it (through some service the application offers).

Users are able to store their data, so they can define their data format (a channel).
As an example, the user might want to send coordinates, he can define a channel "Coord"

Coord: {"X":"float", "Y":"float", "instant":"timestamp"}

There can also be video channels.

I was thinking of storing the channel definitions and the data in mongodb.

As I said in the beginning, users will also be able to process their data through some service, there can be a service that offers the user to do face recognition in their video channels.

Here is where most of my doubts start, I'll try to be concrete. Some questions might seem stupid but as I said, its for research (i'm learning).

I was thinking of doing the video processing (and all the processing) using mapreduce or spark.

Where do I store the information?
- should I have all the info in mongodb and hdfs?
- should I have all info in mongodb and copy to hdfs at the moment I want to process it? (think that might take some time)
- I've read about the mongo-hadoop connector, I might be wrong but for what I understood the info doesn't go to hdfs if I use the connector. Is it the same (in performance and stuff like that) if I do MapReduce to something in mongodb (through connector) and if it is done in the hdfs?

I used the connector to do video processing using the GridFSInputFormat, there are some things I have to solve here but my question remains: is it the same or is it better if I first copy the videos to the hdfs?

Any help will be much appreciated, I spent a lot of time thinking and trying to make a decision.

Wan Bachtiar

unread,
Oct 11, 2016, 3:34:07 AM10/11/16
to mongodb-user

should I have all the info in mongodb and hdfs?

Hi Gonzalo,

This would be a decision that you would have to make as the domain expert of the your system. You could store the 'channel' meta and data in MongoDB or in a combination of MongoDB and HDFS

You may find the following resources useful:

should I have all info in mongodb and copy to hdfs at the moment I want to process it? (think that might take some time)

This is depending on the processing use case. For example, using the MongoDB Connector for Spark you could load collection data from MongoDB and process in Apache Spark. You could even store the resulting computation back to MongoDB.

See also :

This would be suitable for documents manipulation or aggregation.

what I understood the info doesn't go to hdfs if I use the connector. Is it the same (in performance and stuff like that) if I do MapReduce to something in mongodb (through connector) and if it is done in the hdfs? I used the connector to do video processing using the GridFSInputFormat

Although you can store data in both MongoDB and HDFS, they are two different things. HDFS is a distributed file systems and MongoDB is a document-oriented database. It's not a straight forward comparison, as it depends whether you require other features/characteristics they have.

If your video processing is more about large binary files processing, HDFS would probably be more performant as it's closer to the file management level. See also When to use GridFS.

As always, you should also perform some tests under your specific environment and use cases.

its for research (i'm learning).

I would recommend to enrol in a free online course at MongoDB University to learn more about MongoDB. A new session has just started today so you can join straight away. Especially the M101 courses which cover Data Modelling/Schema topics.

Regards,

Wan.

Reply all
Reply to author
Forward
0 new messages