from pyspark import SparkContext, SparkConf
import pymongo_spark
pymongo_spark.activate()
sc = SparkContext(conf=conf)
connection_string = 'mongodb://%s:%s/randdb.%s'%(dbhost, dbport, collection_name)
trainrdd = sc.mongoRDD(connection_string)
# traindf = sqlcontext.createDataFrame(trainrdd)
# traindf = sqlcontext.read.json(trainrdd)
traindf = sqlcontext.jsonRDD(trainrdd)
I have also tried the variants which are commented out in the code. But all are equally slow. For a collection of size 2GB (100000 rows and 1000 columns), it takes around 6 hours(holy moly :/) on a cluster of 3 machines each with 12 cores and 72 GB RAM (using all the cores in this spark cluster). Mongodb server is also running on one of these machines.
I am not sure if I am doing it correctly. Any pointers on how to optimize this code would be really helpful.