How to efficiently read mongodb collection into spark's DataFrame

41 views
Skip to first unread message

Bhanu

unread,
Apr 20, 2016, 4:13:09 PM4/20/16
to mongodb-user
I am trying to load a mongodb collection into spark's DataFrame using mongo-hadoop connector. Here is a snippet of relevant code:

from pyspark import SparkContext, SparkConf
import pymongo_spark

pymongo_spark
.activate()
sc
= SparkContext(conf=conf)
connection_string
= 'mongodb://%s:%s/randdb.%s'%(dbhost, dbport, collection_name)
trainrdd
= sc.mongoRDD(connection_string)
#     traindf = sqlcontext.createDataFrame(trainrdd)
#     traindf = sqlcontext.read.json(trainrdd)
traindf
= sqlcontext.jsonRDD(trainrdd)

I have also tried the variants which are commented out in the code. But all are equally slow. For a collection of size 2GB (100000 rows and 1000 columns), it takes around 6 hours(holy moly :/) on a cluster of 3 machines each with 12 cores and 72 GB RAM (using all the cores in this spark cluster). Mongodb server is also running on one of these machines.

I am not sure if I am doing it correctly. Any pointers on how to optimize this code would be really helpful.



Reply all
Reply to author
Forward
0 new messages