Error reading compressed BSON with Apache Spark

286 views
Skip to first unread message

Rafael Aguiar

unread,
Jan 26, 2016, 1:32:47 AM1/26/16
to mongodb-user
I'm using pyspark (on spark 1.3.1) along with the mongo-hadoop jar,  both built from the master
branch.

rdd = sc.newAPIHadoopRDD(
 inputFormatClass
='com.mongodb.hadoop.BSONFileInputFormat',
 keyClass
='org.apache.hadoop.io.Text',
 valueClass
='org.apache.hadoop.io.MapWritable',
 conf
={
   
'mapred.input.dir': 's3n://my-bucket/compressed_bson.gz'
 
}
)


When I try to create the RDD above I get the following error:

INFO hadoop.BSONFileInputFormat: File s3n://my-bucket/compressed_bson.gz is compressed so cannot be split.
Traceback (most recent call last):
 
File "<stdin>", line 6, in <module>
 
File "/home/hadoop/spark/python/pyspark/context.py", line 547, in newAPIHadoopRDD
    jconf
, batchSize)
 
File "/home/hadoop/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
 
File "/home/hadoop/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j
.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
: java.lang.IllegalArgumentException: Wrong FS: s3n://my-bucket/compressed_bson.gz, expected: hdfs://10.0.2.139:9000

Has anyone faced something similar?

Luke Lovett

unread,
Jan 26, 2016, 3:02:41 PM1/26/16
to mongodb-user
It doesn't look like the problem is with the fact that the BSON is compressed, but the fact that Hadoop has not been configured to use the s3 filesystem (it's expecting HDFS, apparently). That the connector doesn't pick up on the fact that "s3n://" means s3 is the connector's problem (I just filed HADOOP-253), but you can work around the problem by configuring Hadoop to use s3 (and only s3) by setting "fs.default.name" and "fs.defaultFS".

Rafael Aguiar

unread,
Jan 27, 2016, 10:16:35 AM1/27/16
to mongodb-user
Luke, 

I can read a regular BSON from S3, it's just when I try the compressed ones that I see that error.

I'll try your suggestion, though; 

Luke Lovett

unread,
Jan 27, 2016, 12:29:14 PM1/27/16
to mongodb-user
I think it has to do with the code path that the compressed BSON takes. The way the FileSystem is being retrieved is ignoring the scheme in the URI. The fix for this issue is already in the works. By the time that 1.5 comes out this will no longer be a problem.

Luke Lovett

unread,
Jan 27, 2016, 2:18:33 PM1/27/16
to mongodb-user
I just resolved HADOOP-253; this should be fixed now in the master branch.

Rafael Aguiar

unread,
Feb 1, 2016, 11:05:55 AM2/1/16
to mongod...@googlegroups.com
I tested and it works. Thanks, again Luke!!

--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.
 
For other MongoDB technical support options, see: http://www.mongodb.org/about/support/.
---
You received this message because you are subscribed to a topic in the Google Groups "mongodb-user" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mongodb-user/2jcrxOdRuFo/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mongodb-user...@googlegroups.com.
To post to this group, send email to mongod...@googlegroups.com.
Visit this group at https://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/9b9e529c-9d0d-4d07-835c-1584124b80eb%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--

Rafael AguiarData Science Engineer
Website: inlocomedia.com
inlocomedia LinkedIn Facebook Twitter

Reply all
Reply to author
Forward
0 new messages