Empty RDD while reading compressed bson files

32 views
Skip to first unread message

Shantanu Alshi

unread,
Apr 27, 2016, 7:39:47 AM4/27/16
to mongodb-user
Hi,


I am using Pyspark to analyse a BSON File. When I run my program on a deflated BSON file, it runs perfectly fine.
However, running the same program on the same file now compressed gives me an empty RDD.

pyspark_location = 'lib/pymongo_spark.py'

HDFS_HOME = 'hdfs://1.1.1.1/'

INPUT_FILE = 'really_large_bson.gz'


class BsonEncoder(JSONEncoder):

    def default(self, obj):

        if isinstance(obj, ObjectId):

            return str(obj)

        elif isinstance(obj, datetime):

            return obj.isoformat()

        return JSONEncoder.default(self, obj)



def setup_spark_with_pymongo(app_name='PySprkApp'):

    conf = SparkConf().setAppName(app_name)

    sc = SparkContext(conf=conf)

    sc.addPyFile(pyspark_location)

    return sc



def main():

    spark_context = setup_spark_with_pymongo()

    filename = HDFS_HOME + INPUT_FILE

    import pymongo_spark

    pymongo_spark.activate()

    rdd = spark_context.BSONFileRDD(filename)

    print(rdd.first())   # ValueError: RDD is empty



I am using mongo-java-driver.jar 3.2.2, mongo-hadoop-spark.jar 1.5.2, pymongo_spark and pymongo-3.2.2
The deployed Spark version is 1.6.1 and Hadoop 2.6.4.


I am aware that the current library does not support splitting compressed bson files, however it in my opinion, it should work with a single split. I have hundreds of them files to analyse, so deflating all of those does not seem a viable option.
Can anyone please give a direction in order to proceed? 
Reply all
Reply to author
Forward
0 new messages