I have a few questions concerning a virtualized Cluster consisting of Hadoop and MongoDB.
Some information about my data pipeline:
Questions:
Spark is somewhat coupled to the Hadoop ecosystem. Which is the preferred way to run the cluster?
running Hadoop and MongoDB on my virtual nodes together
running Hadoop and MongoDB separately, but only have Spark on the Hadoop nodes
running Hadoop and MongoDB separately, and have Spark on both the Hadoop and MongoDB nodes
I couldn't find sufficient information about how to build a cluster like this. Thanks in advance!
Are there downsides of running the Hadoop ecosystem (HDFS, YARN, Hive, Spark to name a few services) on the same virtualized machine as MongoDB?
Hi Koko,
Be mindful of the resource contention when sharing systems. Ensure your workload suitable for both systems. See also MongoDB: Hardware Considerations
Are the virtualized Hadoop and MongoDB nodes supposed to be “always-on” or can the nodes be shut down when in longer idle state?
This may vary based on the nature of your application use case.
Generally, you have burst mode (start more for work/shut down while idle) for data processing nodes (Spark workers, Hadoop MR workers). However for data storage nodes (databases) generally you would keep them on. i.e. during the day there are constant few inserts into the database, once at the end of the day do aggregation processing (lots of database read).
Are there downsides of shutting them down - except the longer restart time for the next analytic tasks?
Reading from disk is quite expensive operation compared to reading from memory. Depending on your analytics tasks, you may pull in lots of similar data again (i.e. user ids, etc) which would otherwise be preloaded into memory previously.
Which is the preferred way to run the cluster?
Assuming that you’re referring to HDFS when stating ‘Hadoop’, and only intent to use Spark for processing rather than Hadoop MR; I would recommend trying out Spark with MongoDB nodes to achieve data locality. See MongoDB Spark Connector: FAQ for more information.
You may also find the following useful:
Regards,
Wan.