How to efficiently read mongodb collection into spark's DataFrame

41 views

Skip to first unread message

Bhanu

unread,

Apr 20, 2016, 4:13:09 PM4/20/16

to mongodb-user

I am trying to load a mongodb collection into spark's DataFrame using mongo-hadoop connector. Here is a snippet of relevant code:

from pyspark import SparkContext, SparkConf
import pymongo_spark

pymongo_spark.activate()
sc = SparkContext(conf=conf)
connection_string = 'mongodb://%s:%s/randdb.%s'%(dbhost, dbport, collection_name)
trainrdd = sc.mongoRDD(connection_string)
#     traindf = sqlcontext.createDataFrame(trainrdd)
#     traindf = sqlcontext.read.json(trainrdd)
traindf = sqlcontext.jsonRDD(trainrdd) 
I have also tried the variants which are commented out in the code. But all are equally slow. For a collection of size 2GB (100000 rows and 1000 columns), it takes around 6 hours(holy moly :/) on a cluster of 3 machines each with 12 cores and 72 GB RAM (using all the cores in this spark cluster). Mongodb server is also running on one of these machines.
I am not sure if I am doing it correctly. Any pointers on how to optimize this code would be really helpful.

Reply all

Reply to author

Forward

0 new messages