Difference between MongoDB Spark connector and PyMong when using PySpark

sivapra...@gmail.com

unread,

Jun 15, 2017, 5:51:55 PM6/15/17

to mongodb-user

I'm working on some large scale processing. I'm using Python (PySpark) to do much of the development. In that case, what is the difference between using PyMongo and MongoDB Spark connector?

Bernie Hackett

unread,

Jun 15, 2017, 6:21:06 PM6/15/17

to mongodb-user

PyMongo doesn't provide any support for PySpark, but the Spark connector does:

https://docs.mongodb.com/spark-connector/master/python-api/

Hemanta Baruah

unread,

Jun 25, 2017, 3:22:21 AM6/25/17

to mongodb-user

Sir I am working in an organization which previously used mongoDB database for storing social media data. But since the data gradually increases and due to low latency of accessing the data we need to move to Spark immediately for real time processing and some distributed ML task. For this I have setup spark experimentally in a cluster of 3 nodes (1 namenode and 2 datanodes) under YARN resource manager . Till now my cluster works perfectly in the distributed mode . But as our data previously stored in Mongod we have to migrate it to Spark for further processing . But I am not able to setup the mongo-hadoop connector or I am doing something wrong in the configuration . Since I am new to this distributed field , Sir, please help me out with some detailed step-by-step explanation . Thanks in advance.

Wan Bachtiar

unread,

Jun 26, 2017, 9:28:59 PM6/26/17

to mongodb-user

please help me out with some detailed step-by-step explanation . Thanks in advance.

Hi Hemanta,

You can get started by reviewing the MongoDB Connector for Spark or following the free online course at MongoDB University M233: Getting Started with Spark and MongoDB.

Afterward, if you’re encountering a specific issue please open a new discussion thread describing your problem along with :

MongoDB version and topology
Spark version
MongoDB Connector for Spark version
The issue that you’re having and any error messages that you’re getting.

Regards,

Wan.

Hemanta Baruah

unread,

Jun 27, 2017, 9:31:56 AM6/27/17

to mongodb-user

Sir I am using
1. MongoDB 3.4.5
2. Spark 2.1.1
3. Hadoop 2.6.4
4. pymongo_spark

In terminal I am using the command

./bin/pyspark --jars mongo-hadoop-spark.jar --jars mongo-java-driver.jar --driver-class-path mongo-hadoop-spark.jar --driver-class-path mongo-java-driver.jar --py-files /home/hduser/mongo-hadoop/spark/src/main/python/pymongo_spark.py --conf "spark.mongodb.input.uri=mongodb://127.0.0.1/marketdata.minbars?readPreference=primaryPreferred"                   --conf "spark.mongodb.output.uri=mongodb://127.0.0.1/marketdata.people"        --packages org.mongodb.spark:mongo-spark-connector_2.10:1.1.0

I have put both the jar files in Spark home and for pymongo_spark I set the path for pymongo_spark.py . I am using Spark in a cluster of 3 nodes (1 master and two slaves) . I am running the above command from the name node terminal .

Spark python shell create successfully.

I want to import a collection stored in my minbars collection from marketdata database to an RDD and save back the RDD contents to another collection people in marketdata database
in my pyspark shell I ran this .

        >>> import pymongo_spark
        >>> pymongo_spark.activate()

        >>> rdd = sc.mongoRDD('mongodb://127.0.0.1:27017/marketdata.minbars')
        >>> rdd.saveToMongoDB('mongodb://127.0.0.1:27017/marketdata.people')

while I call rdd.saveToMongoDB() , it gives me the following error

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/tmp/spark-5faf68c7-4628-4fb4-924a-1254338f73eb/userFiles-fe40d5dc-634c-4e28-98c4-c92424576732/pymongo_spark.py", line 26, in saveToMongoDB
    sample = self.first()
File "/usr/local/spark/python/pyspark/rdd.py", line 1366, in first
    raise ValueError("RDD is empty")
ValueError: RDD is empty

I have also attached my collection snapshots . my output collection "people " is empty here.

Reply all

Reply to author

Forward