Hey Koert,
Just to explain a bit more. The Hadoop snappy library that deals with
reading in Snappy files is this:
org.apache.hadoop.io.compress.SnappyCodec
This is pulled in based on whichever version of Hadoop you are using
when you compile Spark.
The snappy codec we use internally for shuffle outputs is this:
org.apache.spark.io.SnappyCompressionCodec
it uses the following library:
org.xerial.snappy
This is a completely distinct implementation of snappy that uses
separate libraries than Hadoop. The error you mentioned is a well
known issue with the xerial library. It can be fixed by setting the
"spark.io.compression.codec" away from snappy. At least, I think it
can. Did you try changing this setting?
The version of xerial we are using (1.0.5) is the newest version I'm aware of:
http://mvnrepository.com/artifact/org.xerial.snappy/snappy-java
You said that you upgraded snappy - what version did you upgrade it to?
- Patrick