MongoDB Spark Connector - how to create an RDD using the python connector

399 views

Skip to first unread message

kwb

unread,

Jul 25, 2017, 10:24:36 PM7/25/17

to mongodb-user

I'm doing a prototype using the MongoDB Spark Connector to load mongo documents into Spark. We have a large existing code base

written in python that does processing on input mongo documents and produces multiple documents per input document.

The mongo documents do not have a defined schema.

I've tried out the MongoDB Spark connector and have run into issues with the python connector.

Using the scala connector I can easily read a large number of documents from a mongo collection into a Spark RDD

of bson Documents, then map that to an RDD of json strings.

I want to use the python connector since all our existing code is python. The documentation on the python connector seems to indicate

that the mongo documents to be read into Spark using the python connector must have a defined schema.

The example shows them being read into a Spark DataFrame https://docs.mongodb.com/spark-connector/master/python/read-from-mongodb/

Is it possible to use the python connector to get a Spark RDD of bson documents (or json strings). Or is this planned ?

**********

This email and any files transmitted contain confidential information and are intended only for the individual named. Please notify the sender immediately by e-mail if you have received this e-mail in error and delete this e-mail from your system. If you are not the intended recipient you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited.

Wan Bachtiar

unread,

Aug 4, 2017, 12:32:31 AM8/4/17

to mongodb-user

The documentation on the python connector seems to indicate that the mongo documents to be read into Spark using the python connector must have a defined schema.

Hi,

You don’t have to define a schema. For example, in PySpark you can execute as below:

df = sqlContext.read.format("com.mongodb.spark.sql.DefaultSource")
                    .option("spark.mongodb.input.uri", "mongodb://host:port/dbname.collection")
                    .load()
# Print first record
df.first()
# Get RDD from Dataframe
myRDD = df.rdd

I’ve tried out the MongoDB Spark connector and have run into issues with the python connector.

If you have an issue using MongoDB Spark connector (Python), please provide:

MongoDB Spark Connector version
Spark version
Snippet code enough to reproduce the issue
Any error messages that you’re getting

Regards,
Wan.

Reply all

Reply to author

Forward

0 new messages