How to configure MongoDB-Spark connector for Windows

Shubham Shetty

unread,

Jun 29, 2017, 1:32:52 AM6/29/17

to mongodb-user

I am a beginner at spark, and i need to use spark with mongodb as a data store for a project for the company i am interning at. I have setup spark as standalone and have setup mongodb on my windows machine, but i am not able to configure the spark-mongodb connector. Can someone please list the steps to do so? I read the documentation on the official site but i am not able to figure it out.

Wan Bachtiar

unread,

Jul 3, 2017, 11:31:36 PM7/3/17

to mongodb-user

Hi Shubham,

You can start by following the documentation on MongoDB Spark Connector.

If you already have a Spark standalone and a MongoDB instance running, you could start by testing by invoking spark-shell as below example:

spark-shell --conf "spark.mongodb.input.uri=mongodb://<HOST>:<PORT>/<DATABASE>.<COLLECTION>" --conf "spark.mongodb.output.uri=mongodb://<HOST>:<PORT>/<DATABASE>.<COLLECTION>" --packages org.mongodb.spark:mongo-spark-connector_<SCALA_VERSION>:<MONGODB_SPARK_CONNECTOR_VERSION>

(Replace the values in < and > with something relevant for your environment)

If you’re using Scala, you could run below simple example code to read from a MongoDB collection:

val rdd = MongoSpark.load(sc)
println("Number of documents read from collection : " + rdd.count)

If you have any specific question, please provide:

MongoDB version
Apache Spark version
MongoDB Spark Connector version
Example code (whether it’s Scala, Python, Java)
What you’re trying to achieve

You may also find the Spark quickstart docs a useful reference for basic operations in Spark.

Regards,

Wan.

Shubham Shetty

unread,

Jul 6, 2017, 2:37:22 AM7/6/17

to mongodb-user

I'm running spark as standalone on Windows 7 without underlying Hadoop. When i run spark shell as you have specified, spark loads up but shows following log messages:

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLeve
l(newLevel).
17/07/06 11:53:59 WARN NativeCodeLoader: Unable to load native-hadoop library fo
r your platform... using builtin-java classes where applicable
17/07/06 11:54:09 WARN ObjectStore: Version information not found in metastore.
hive.metastore.schema.verification is not enabled so recording the schema versio
n 1.2.0
17/07/06 11:54:09 WARN ObjectStore: Failed to get database default, returning No
SuchObjectException
17/07/06 11:54:10 WARN ObjectStore: Failed to get database global_temp, returnin
g NoSuchObjectException
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.1.1
      /_/

Using Python version 2.7.13 (v2.7.13:a06454b1afa1, Dec 17 2016 20:42:59)
SparkSession available as 'spark'.

Is hadoop and hive necessary for this to work?

Shubham Shetty

unread,

Jul 6, 2017, 2:40:46 AM7/6/17

to mongodb-user

I am using scala version 2.11.8, spark version 2.1.1, and mongodb version 3.4.4

Wan Bachtiar

unread,

Jul 11, 2017, 9:12:53 PM7/11/17

to mongodb-user

When i run spark shell as you have specified, spark loads up but shows following log messages:

Hi Shubham,

Those are warning messages, you should still be able to use MongoDB Spark Connector from the (Python) Spark shell.

Are you having specific issue with the MongoDB Spark Connector via Spark shell ? are you seeing error messages ? If so, please provide the action that you're attempting and the error message.

If you want to, you could also change the logging level by changing conf/log4j.properties file. See also log4j v1.2 Level