Empty RDD while reading compressed bson files

32 views

Skip to first unread message

Shantanu Alshi

unread,

Apr 27, 2016, 7:39:47 AM4/27/16

to mongodb-user

Hi,

I am using Pyspark to analyse a BSON File. When I run my program on a deflated BSON file, it runs perfectly fine.

However, running the same program on the same file now compressed gives me an empty RDD.

pyspark_location = 'lib/pymongo_spark.py'

HDFS_HOME = 'hdfs://1.1.1.1/'

INPUT_FILE = 'really_large_bson.gz'

class BsonEncoder(JSONEncoder):

def default(self, obj):

if isinstance(obj, ObjectId):

return str(obj)

elif isinstance(obj, datetime):

return obj.isoformat()

return JSONEncoder.default(self, obj)

def setup_spark_with_pymongo(app_name='PySprkApp'):

conf = SparkConf().setAppName(app_name)

sc = SparkContext(conf=conf)

sc.addPyFile(pyspark_location)

return sc

def main():

spark_context = setup_spark_with_pymongo()

filename = HDFS_HOME + INPUT_FILE

import pymongo_spark

pymongo_spark.activate()

rdd = spark_context.BSONFileRDD(filename)

print(rdd.first()) # ValueError: RDD is empty

I am using mongo-java-driver.jar 3.2.2, mongo-hadoop-spark.jar 1.5.2, pymongo_spark and pymongo-3.2.2

The deployed Spark version is 1.6.1 and Hadoop 2.6.4.

I am aware that the current library does not support splitting compressed bson files, however it in my opinion, it should work with a single split. I have hundreds of them files to analyse, so deflating all of those does not seem a viable option.