Understanding Scalding serialization

201 views

Skip to first unread message

Alex Zhicharevich

unread,

Jan 12, 2015, 6:56:36 AM1/12/15

to cascadi...@googlegroups.com

Hi,

I'm trying to read in spark a sequence file written by Scalding.
I have my data in a TypedPipe[MyClass] and I'm writing it like this:
data.toPipe('myClass).write(SequenceFile(outPath))
MyClass is just a regular scala class that is not Writable.

I'm trying to read this sequence file back using Spark's HadoopFile API, with setting the hadoop conf to use scalding serializations:
conf.set("io.serializations", "cascading.tuple.hadoop.TupleSerialization,org.apache.hadoop.io.serializer.WritableSerialization,com.twitter.chill.hadoop.KryoSerialization,org.apache.hadoop.io.serializer.JavaSerialization")

it looks like TupleSerialization is deserializing my tuples, but then I get com.esotericsoftware.kryo.KryoException: Class cannot be created (missing no-arg constructor): MyClass.

I've tried making MyClass a case class, but still getting the same error.

There must be something that I miss here on how serialization works in Scalding since Scalding deserializes those classes with no problem although there is no no-args ctor

Thanks,
Alex

Oscar Boykin

unread,

Jan 12, 2015, 11:17:40 AM1/12/15

to cascadi...@googlegroups.com

SequenceFile is not such a good choice here. Unfortunately (as I just checked) that class is not documented well, so I apologize. That uses Cascading's SequenceFileScheme, which encodes the cascading tuple and its contents into a Hadoop sequence file. The issue is you will have to set up the Hadoop serializations the same as you did in the scalding job to read it. While this is doable, it might be easier to just run a job in scalding to convert data.

I try to encourage people as often as I can to use thrift, Avro, or protobuf for output data. This is exactly the reason.

Also note we recently added a TypedJson source and sink which is safe to use (but slower than the above mentioned choices). With the Json you could also read that in spark.

Lastly, you might look at WritableSequenceFile. For that you will need to implement Hadoop's writable interface for your output data. This should be readable with spark as well.

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at http://groups.google.com/group/cascading-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/323cbce8-6497-420e-b4a0-58a891cfd653%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.