snappy

Koert Kuipers

unread,

Oct 18, 2013, 7:29:42 PM10/18/13

to spark...@googlegroups.com

the snappy bundled with spark 0.8 is causing trouble on CentOS 5:

 java.lang.UnsatisfiedLinkError: /tmp/snappy-1.0.5-libsnappyjava.so: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.9' not found (required by /tmp/snappy-1.0.5-libsn

appyjava.so)

Patrick Wendell

unread,

Oct 18, 2013, 9:20:40 PM10/18/13

to spark...@googlegroups.com

This has to do with the xerial compression library, on some
architectures it doesn't work due to the way the compile it. We
actually switched the default away from snappy for this reason. I'd
recommend changing back to the default compression.

http://code.google.com/p/snappy-java/issues/detail?id=12

- Patrick

> --
> You received this message because you are subscribed to the Google Groups
> "Spark Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to spark-users...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.

Koert Kuipers

unread,

Oct 18, 2013, 11:36:51 PM10/18/13

to spark...@googlegroups.com

I need to read existing avro datasets that are snappy compressed

Patrick Wendell

unread,

Oct 19, 2013, 2:04:13 AM10/19/13

to spark...@googlegroups.com

Are you setting "spark.io.compression.codec" to snappy? This was my
assumption based on seeing the problem you have (because I've seen
this same problem when doing that). If so, know that this setting only
affects data that is internally generated by Spark (e.g. shuffle
data). It doesn't affect input data that you are reading from an
external system.

If you are reading external data and the file ends in ".snappy" I
think the hadoop inputformat should figure it out and use the Hadoop
codecs automatically. Otherwise, you may need to specify this when
reading the input using a special inputformat.

- Patrick

Patrick Wendell

unread,

Oct 19, 2013, 2:11:22 AM10/19/13

to spark...@googlegroups.com

I just submitted a patch to make this more clear. I realized in the
docs it's not clear that this property is orthogonal to whether the
user's input data is compressed:

https://github.com/apache/incubator-spark/pull/76/files

Koert Kuipers

unread,

Oct 19, 2013, 12:15:18 PM10/19/13

to spark...@googlegroups.com

I should have been more clear: i am reading data from avro files on hdfs that are internally compressed with snappy when I see the error mentioned in the spark task on the slaves.

Adding a newer version of snappy to my job does not solve the problem. I still see the same problem indicating that the snappy version bundled with spark takes precedence on the workers classpath.

Rebuilding spark with a newer snappy and deploying that on the cluster does solve the problem. However i do not think that generally a  rebuild of spark should be necessary for every versioning issue. so i suggest there should be a way to put the users jars ahead of spark's jars on the classpath in tasks.

Rebuilding spark with a newer snappy and deploying that on the cluster does solve the problem

.

Patrick Wendell

unread,

Oct 19, 2013, 1:54:18 PM10/19/13

to spark...@googlegroups.com

Hey Koert,

Just to explain a bit more. The Hadoop snappy library that deals with
reading in Snappy files is this:

org.apache.hadoop.io.compress.SnappyCodec

This is pulled in based on whichever version of Hadoop you are using
when you compile Spark.

The snappy codec we use internally for shuffle outputs is this:

org.apache.spark.io.SnappyCompressionCodec

it uses the following library:

org.xerial.snappy

This is a completely distinct implementation of snappy that uses
separate libraries than Hadoop. The error you mentioned is a well
known issue with the xerial library. It can be fixed by setting the
"spark.io.compression.codec" away from snappy. At least, I think it
can. Did you try changing this setting?

The version of xerial we are using (1.0.5) is the newest version I'm aware of:

http://mvnrepository.com/artifact/org.xerial.snappy/snappy-java

You said that you upgraded snappy - what version did you upgrade it to?

- Patrick

Koert Kuipers

unread,

Oct 19, 2013, 2:18:12 PM10/19/13

to spark...@googlegroups.com

avro also uses org.xerial.snappy for its compression. so when i read avro files within spark using a subclass of HadoopRDD i run into the fact that xerial 1.0.5 does not work on Centos5 (which is what our cluster slaves run on).

xerial 1.1.0-M4 does work on centos. so i tried adding that to my job (and to the jars for SparkContext) but the workers don't seem to pick it up... they continue to use 1.0.5 which blows up.

so i ended up rebuilding and redeploying spark 0.8 with xerial bumped to 1.1.0-M4

i now have a similar (but unrelated) issue with avro itself. i need avro 1.7.5 for a certain feature. again ideally i would just add it to the jars for my SparkContext, but i suspect it will not work since avro 1.7.4 is included with spark... so again i am going to have to rebuild and redeploy spark which i do not think is ideal. spark needs something similar to hadoops mapreduce.user.classpath.first (see https://issues.apache.org/jira/browse/MAPREDUCE-4521)

Patrick Wendell

unread,

Oct 19, 2013, 5:59:49 PM10/19/13

to spark...@googlegroups.com

Hey Koert,

Ah, I understand now. So this is yet another way in which snappy can
be used... as part of the Avro snappy library.

For now the only option here is to build Spark yourself and change the
dependencies. I've explored the issue a bit and added a JIRA
explaining how we can support this in the future. It's a bit trickier
for us because we don't launch each task inside of it's own JVM like
Hadoop does.

https://spark-project.atlassian.net/browse/SPARK-939

- Patrick

Koert Kuipers

unread,

Oct 24, 2013, 9:03:37 AM10/24/13

to spark...@googlegroups.com, pwen...@gmail.com

fiddling with classpath would be easy if a jvm was.

you say no jvm is launched for the tasks from a sparkcontext? that surprises me. the docs say: "When running on a cluster, each Spark application gets an independent set of executor JVMs that only
run tasks and store data for that application."

Koert Kuipers

unread,

Oct 24, 2013, 9:04:19 AM10/24/13

to spark...@googlegroups.com, pwen...@gmail.com

sorry, i meant to say:
fiddling with classpath would be easy if a jvm was LAUNCHED.

Patrick Wendell

unread,

Oct 24, 2013, 7:51:12 PM10/24/13

to Koert Kuipers, spark...@googlegroups.com

Hey Koert. That documentation is correct. Each application has one long-lived JVM (the Executor) for all of it's tasks.

I just meant the way we do classloading means a naive solution isn't possible. For instance, at any given time, I can call:

sc.addJar(XX)

And this will get added to the slaves and needs to be then instantly visible to all slaves (without, restarting a JVM). This means that we use dynamic classlaoding instead of just setting the classpath when we launch the JVM.

So we can't do something simple like "Add the user jars before the Spark jars in the classpath when the JVM is initialized". This is how I presume Hadoop does it, because the class-path is constant over the course of a single task.

Reply all

Reply to author

Forward