Rafael Aguiar

unread,

Jan 26, 2016, 1:32:47 AM1/26/16

to mongodb-user

I'm using pyspark (on spark 1.3.1) along with the mongo-hadoop jar, both built from the master

branch.

rdd = sc.newAPIHadoopRDD(
 inputFormatClass='com.mongodb.hadoop.BSONFileInputFormat',
 keyClass='org.apache.hadoop.io.Text',
 valueClass='org.apache.hadoop.io.MapWritable',
 conf={
    'mapred.input.dir': 's3n://my-bucket/compressed_bson.gz'
 }
)

When I try to create the RDD above I get the following error:

INFO hadoop.BSONFileInputFormat: File s3n://my-bucket/compressed_bson.gz is compressed so cannot be split.
Traceback (most recent call last):
  File "<stdin>", line 6, in <module>
  File "/home/hadoop/spark/python/pyspark/context.py", line 547, in newAPIHadoopRDD
    jconf, batchSize)
  File "/home/hadoop/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
  File "/home/hadoop/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
: java.lang.IllegalArgumentException: Wrong FS: s3n://my-bucket/compressed_bson.gz, expected: hdfs://10.0.2.139:9000

Has anyone faced something similar?

Luke Lovett

unread,

Jan 26, 2016, 3:02:41 PM1/26/16

to mongodb-user

It doesn't look like the problem is with the fact that the BSON is compressed, but the fact that Hadoop has not been configured to use the s3 filesystem (it's expecting HDFS, apparently). That the connector doesn't pick up on the fact that "s3n://" means s3 is the connector's problem (I just filed HADOOP-253), but you can work around the problem by configuring Hadoop to use s3 (and only s3) by setting "fs.default.name" and "fs.defaultFS".

Rafael Aguiar

unread,

Jan 27, 2016, 10:16:35 AM1/27/16

to mongodb-user

Luke,

I can read a regular BSON from S3, it's just when I try the compressed ones that I see that error.

I'll try your suggestion, though;

Luke Lovett

unread,

Jan 27, 2016, 12:29:14 PM1/27/16

to mongodb-user

I think it has to do with the code path that the compressed BSON takes. The way the FileSystem is being retrieved is ignoring the scheme in the URI. The fix for this issue is already in the works. By the time that 1.5 comes out this will no longer be a problem.

Luke Lovett

unread,

Jan 27, 2016, 2:18:33 PM1/27/16

to mongodb-user

I just resolved HADOOP-253; this should be fixed now in the master branch.

Rafael Aguiar

unread,

Feb 1, 2016, 11:05:55 AM2/1/16

to mongod...@googlegroups.com

I tested and it works. Thanks, again Luke!!

--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.

For other MongoDB technical support options, see: http://www.mongodb.org/about/support/.
---
You received this message because you are subscribed to a topic in the Google Groups "mongodb-user" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mongodb-user/2jcrxOdRuFo/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mongodb-user...@googlegroups.com.
To post to this group, send email to mongod...@googlegroups.com.
Visit this group at https://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/9b9e529c-9d0d-4d07-835c-1584124b80eb%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

Rafael AguiarData Science Engineer
Mobile: +55 81 99730.0415
Skype: rafael_aguiar_
Office: +55 81 3127.0881
Website: inlocomedia.com

Reply all

Reply to author

Forward

Rafael AguiarData Science Engineer
Mobile: +55 81 99730.0415 Skype: rafael_aguiar_
Office: +55 81 3127.0881 Website: inlocomedia.com

Error reading compressed BSON with Apache Spark

Rafael Aguiar

Luke Lovett

Rafael Aguiar

Luke Lovett

Luke Lovett

Rafael Aguiar

Rafael AguiarData Science EngineerMobile: +55 81 99730.0415Skype: rafael_aguiar_Office: +55 81 3127.0881Website: inlocomedia.com

Rafael AguiarData Science Engineer
Mobile: +55 81 99730.0415
Skype: rafael_aguiar_
Office: +55 81 3127.0881
Website: inlocomedia.com