How do MongoDB and Hadoop play together in a virtualized Cluster?

34 views

Skip to first unread message

Koko

unread,

Jun 13, 2017, 5:48:18 PM6/13/17

to mongodb-user

I have a few questions concerning a virtualized Cluster consisting of Hadoop and MongoDB.

Some information about my data pipeline:

I do not frequently transfer data from HDFS to MongoDB, but sometimes I will have to do it.
I use HDFS as the data lake with some data warehousing capabilities offered by Hive.
I plan to use Spark to run analytic tasks on the MongoDB data.
I might use Spark to run some tasks on the HDFS data (not that likely at the moment)
I am aware of the fact that separating Hadoop and MongoDB on different virtual nodes, this might introduce network latency when writing data from HDFS to MongoDB or vice versa.

Questions:

Are there downsides of running the Hadoop ecosystem (HDFS, YARN, Hive, Spark to name a few services) on the same virtualized machine as MongoDB?
Are the virtualized Hadoop and MongoDB nodes supposed to be "always-on" or can the nodes be shut down when in longer idle state? Are there downsides of shutting them down - except the longer restart time for the next analytic tasks?
Spark is somewhat coupled to the Hadoop ecosystem. Which is the preferred way to run the cluster?
- running Hadoop and MongoDB on my virtual nodes together
- running Hadoop and MongoDB separately, but only have Spark on the Hadoop nodes
- running Hadoop and MongoDB separately, and have Spark on both the Hadoop and MongoDB nodes

I couldn't find sufficient information about how to build a cluster like this. Thanks in advance!

Note: I also asked this question on Stackoverflow but found that this might be another good, if not better, place to ask.

Wan Bachtiar

unread,

Jun 25, 2017, 8:28:36 PM6/25/17

to mongodb-user

Are there downsides of running the Hadoop ecosystem (HDFS, YARN, Hive, Spark to name a few services) on the same virtualized machine as MongoDB?

Hi Koko,

Be mindful of the resource contention when sharing systems. Ensure your workload suitable for both systems. See also MongoDB: Hardware Considerations

Are the virtualized Hadoop and MongoDB nodes supposed to be “always-on” or can the nodes be shut down when in longer idle state?

This may vary based on the nature of your application use case.

Generally, you have burst mode (start more for work/shut down while idle) for data processing nodes (Spark workers, Hadoop MR workers). However for data storage nodes (databases) generally you would keep them on. i.e. during the day there are constant few inserts into the database, once at the end of the day do aggregation processing (lots of database read).

Are there downsides of shutting them down - except the longer restart time for the next analytic tasks?

Reading from disk is quite expensive operation compared to reading from memory. Depending on your analytics tasks, you may pull in lots of similar data again (i.e. user ids, etc) which would otherwise be preloaded into memory previously.

Which is the preferred way to run the cluster?

Assuming that you’re referring to HDFS when stating ‘Hadoop’, and only intent to use Spark for processing rather than Hadoop MR; I would recommend trying out Spark with MongoDB nodes to achieve data locality. See MongoDB Spark Connector: FAQ for more information.

You may also find the following useful:

Regards,
Wan.

Reply all

Reply to author

Forward

0 new messages