Serialization errors

364 views
Skip to first unread message

Sowmitra Thallapragada

unread,
May 19, 2015, 10:13:18 AM5/19/15
to cascadi...@googlegroups.com
Hello,

I am trying to write Avro objects to HDFS using the PackedAvroScheme. I have an aggregator function which sets the filled Avro object in a Tuple and adds it to the output collector.

When I run my job, I see errors like: "Could not load serializer for java.util.ArrayList" from Hadoop Serializer ...

Following some forum discussions, I added to "cascading.avro.serialization.AvroSpecificRecordSerialization" and "org.apache.hadoop.io.serializer.JavaSerialization" to  "io.serializations" configuration parameter. But now, I am seeing errors like below in my task logs (during the Reduce phase):

java.io.IOException: java.lang.ClassNotFoundException: cascading.tuple.Tuple

at org.apache.hadoop.io.serializer.JavaSerialization$JavaSerializationDeserializer.deserialize(JavaSerialization.java:61)

at org.apache.hadoop.io.serializer.JavaSerialization$JavaSerializationDeserializer.deserialize(JavaSerialization.java:40)

at cascading.tuple.hadoop.TupleSerialization$SerializationElementReader.read(TupleSerialization.java:628)

at cascading.tuple.hadoop.io.HadoopTupleInputStream.readType(HadoopTupleInputStream.java:105)

at cascading.tuple.hadoop.io.HadoopTupleInputStream.getNextElement(HadoopTupleInputStream.java:52)

at cascading.tuple.io.TupleInputStream.readTuple(TupleInputStream.java:78)

at cascading.tuple.io.TupleInputStream.readTuple(TupleInputStream.java:67)

at cascading.tuple.hadoop.io.HadoopTupleInputStream.readIndexTuple(HadoopTupleInputStream.java:58)

at cascading.tuple.io.TupleInputStream.readIndexTuple(TupleInputStream.java:106)

at cascading.tuple.hadoop.io.IndexTupleDeserializer.deserialize(IndexTupleDeserializer.java:38)

at cascading.tuple.hadoop.io.IndexTupleDeserializer.deserialize(IndexTupleDeserializer.java:28)

at org.apache.hadoop.mapred.Task$ValuesIterator.readNextValue(Task.java:1421)

at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:1361)


I am using cascading 2.6.1 and cascading-avro-2.1.2. Any idea what's going on here?

Thanks!

Sowmi



Ken Krugler

unread,
May 19, 2015, 10:22:54 AM5/19/15
to cascadi...@googlegroups.com
Hi Sowmi,

Did you *add* those two serializations to io.serializations, or did you accidentally replace what was in io.serializations?

Because it looks like you're missing the standard Cascading serialization support, so it's trying to use Java serialization to handle the Tuple.

-- Ken



From: Sowmitra Thallapragada

Sent: May 19, 2015 7:13:17am PDT

To: cascadi...@googlegroups.com

Subject: Serialization errors




--------------------------
Ken Krugler
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr





Sowmitra Thallapragada

unread,
May 19, 2015, 10:35:40 AM5/19/15
to cascadi...@googlegroups.com
I am pretty sure I added them. This is what the final io.serializations looked like (I

cascading.tuple.hadoop.TupleSerialization,org.apache.hadoop.io.serializer.WritableSerialization,cascading.avro.serialization.AvroSpecificRecordSerialization,org.apache.hadoop.io.serializer.avro.AvroSpecificSerialization,org.apache.hadoop.io.serializer.avro.AvroReflectSerialization,org.apache.hadoop.io.serializer.JavaSerialization

I notice that the TupleSerialization and WritableSerialization always come up in the order even if I set the AvroSpeicificRecordSerialization ahead in the list.

Ken Krugler

unread,
May 19, 2015, 10:52:46 AM5/19/15
to cascadi...@googlegroups.com
Hi Sowmi,

Sorry, in that case I don't know why you're getting this error - I haven't run into it myself.

I notice you haven't mentioned the execution environment - is this distributed or Hadoop local mode? And if distributed, what version?

Also what version of Hadoop are you building against?

Not sure if any of that is relevant, but it's typically useful information to include.

Regards,

-- Ken


From: Sowmitra Thallapragada

Sent: May 19, 2015 7:35:39am PDT

To: cascadi...@googlegroups.com

Subject: Re: Serialization errors


Sowmitra Thallapragada

unread,
May 19, 2015, 12:49:02 PM5/19/15
to cascadi...@googlegroups.com
Ken,

I apologize, here's more information:

We run them on a distributed Hadoop 2 cluster. The code is compiled against Hadoop 2.6.0.5, and I believe the run time is the same.

The original exception with the ArrayList is below, this is before I added the AvroSpecificRecordSerialization and JavaSerialization to the list.

Error: cascading.CascadingException: unable to load serializer for: java.util.ArrayList from: org.apache.hadoop.io.serializer.SerializationFactory at cascading.tuple.hadoop.TupleSerialization.getNewSerializer(TupleSerialization.java:447) at cascading.tuple.hadoop.TupleSerialization$SerializationElementWriter.write(TupleSerialization.java:743) at cascading.tuple.io.TupleOutputStream.writeElement(TupleOutputStream.java:114) at cascading.tuple.io.TupleOutputStream.write(TupleOutputStream.java:89) at cascading.tuple.io.TupleOutputStream.writeTuple(TupleOutputStream.java:64) at 

Please let me know if you need any other information. Thanks for your help!

Sowmi

--
You received this message because you are subscribed to a topic in the Google Groups "cascading-user" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/cascading-user/ROsxUjQvVfo/unsubscribe.
To unsubscribe from this group and all its topics, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at http://groups.google.com/group/cascading-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/66155670-785F-40D5-A888-55B2BB7E91F6%40transpac.com.

For more options, visit https://groups.google.com/d/optout.

Ken Krugler

unread,
May 19, 2015, 2:19:24 PM5/19/15
to cascadi...@googlegroups.com
Hi Sowmi,

I can only suggest some random things to try…

1. Only add the JavaSerialization, not AvroSpecificRecordSerialization, and see what error you get.

2. Try running the job using Hadoop local mode (removes some possible classpath errors on the cluster)

-- Ken
 

From: Sowmitra Thallapragada

Sent: May 19, 2015 9:49:01am PDT

Sowmitra Thallapragada

unread,
May 19, 2015, 6:11:04 PM5/19/15
to cascadi...@googlegroups.com
Hi Ken,

I tried JavaSerialization but it did not help. Does the following error log make any sense to you?

2015-05-19 21:25:35,022 INFO [Thread-5] cascading.tuple.collect.SpillableTupleList: attempting to load codec: org.apache.hadoop.io.compress.GzipCodec
2015-05-19 21:25:35,022 INFO [Thread-5] cascading.tuple.collect.SpillableTupleList: found codec: org.apache.hadoop.io.compress.GzipCodec
2015-05-19 21:25:35,031 ERROR [Thread-5] cascading.tuple.hadoop.TupleSerialization$SerializationElementReader: failed deserializing token: 32 with classname: java.util.ArrayList
java.io.IOException: java.lang.ClassNotFoundException: cascading.tuple.Tuple
	at org.apache.hadoop.io.serializer.JavaSerialization$JavaSerializationDeserializer.deserialize(JavaSerialization.java:61)
	at org.apache.hadoop.io.serializer.JavaSerialization$JavaSerializationDeserializer.deserialize(JavaSerialization.java:40)
	at cascading.tuple.hadoop.TupleSerialization$SerializationElementReader.read(TupleSerialization.java:628)
	at cascading.tuple.hadoop.io.HadoopTupleInputStream.readType(HadoopTupleInputStream.java:105)
	at cascading.tuple.hadoop.io.HadoopTupleInputStream.getNextElement(HadoopTupleInputStream.java:52)
	at cascading.tuple.io.TupleInputStream.readTuple(TupleInputStream.java:78)
	at cascading.tuple.io.TupleInputStream.readTuple(TupleInputStream.java:67)

--
You received this message because you are subscribed to a topic in the Google Groups "cascading-user" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/cascading-user/ROsxUjQvVfo/unsubscribe.
To unsubscribe from this group and all its topics, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at http://groups.google.com/group/cascading-user.

Sowmitra Thallapragada

unread,
May 20, 2015, 12:15:32 AM5/20/15
to cascadi...@googlegroups.com

Hi Ken,

Finally zeroed in on the problem. For one of the datasets, I was using AvroScheme instead of PackedAvroScheme, and the data I was reading had a schema like below. Looks like AvroScheme does not handle the ser/de of the array field well. Are you aware of any issues around that?

{

  "type" : "record",

  "name" : "Object",

  "fields" : [ {

    "name" : "field1",

    "type" : "int"

  }, {

    "name" : "field2",

    "type" : "string"

  }, {

    "name" : "field3",

    "type" : "boolean"

  }, {

    "name" : "field4",

    "type" : {

      "type" : "array",

      "items" : {

        "type" : "record",

        "name" : "field4_type",

        "fields" : [ {

          "name" : "subField1",

          "type" : "string"

        }, {

          "name" : "subField2",

          "type" : "string"

        } ]

      }

    }

  } ]

}

Ken Krugler

unread,
May 20, 2015, 10:49:59 AM5/20/15
to cascadi...@googlegroups.com
Hi Sowmi,

No, I hadn't heard of any issues with this - arrays are turned into a Tuple for serialization.

I'll try to add this as a unit test for the cascading.avro project to see what happens.

-- Ken


From: Sowmitra Thallapragada

Sent: May 19, 2015 9:15:29pm PDT

You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.

To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at http://groups.google.com/group/cascading-user.

For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages