Query regarding MongoDB and Spark integration ???

30 views
Skip to first unread message

Hemanta Baruah

unread,
Jul 4, 2017, 10:45:41 AM7/4/17
to mongodb-user


Hi everyone, I need some information regarding MongoDB and Spark integration. I have an Spark cluster of 3 nodes (1 worker and two data nodes) with YARN scheduler and I have a mongodb server which is outside the cluster but under the same LAN. Now I need to load whole the data that are stored in the MongoDB database to that cluster (HDFS) . Is it possible ? By using mongo-hadoop connector and pymongo_spark connector I can load data from a mongo server process it in the spark cluster and get back the computed result in the mongo server. But my problem is that I want to transfer my whole data to the spark cluster.

Wan Bachtiar

unread,
Jul 18, 2017, 1:12:20 AM7/18/17
to mongodb-user

Is it possible ?

Hi Hemanta,

Yes, as long as the cluster has network access to your MongoDB instance.
I would also recommend to check out MongoDB Connector for Spark

For Python example, you could load data from your MongoDB instance into a Spark DataFrame as below:

df = spark.read.format("com.mongodb.spark.sql.DefaultSource").load()

See also Spark Connector Python Guide for examples and tutorials.

But my problem is that I want to transfer my whole data to the spark cluster.

Once you have loaded the collection data to Spark’s RDD or DataFrame, you could store into your HDFS.
See also: pyspark.RDD and pyspark.sql.DataFrame

Regards,
Wan.

Reply all
Reply to author
Forward
0 new messages