"unread block data" error when reading from HBase in a cluster a shell, but works locally

596 views
Skip to first unread message

Aki Matsukawa

unread,
Apr 13, 2013, 11:53:16 PM4/13/13
to spark...@googlegroups.com
I am trying to read data out of HBase. I am trying it out in the spark shell, doing something along the lines of:

// imports ...
val test = sc.newAPIHadoopRDD(conf, classOf[AvroTableInputFormat[Data]], classOf[ImmutableBytesWritable], classOf[Data])
test.count()

AvroTableInputFormat is something I wrote, it wraps a TableInputFormat and deserializes the bytes that come out of HBase with an avro. This works in a shell launched locally, like so:

$ MASTER=local[4] ./spark-shell -cp <some jars to include>

But when I hook up the shell to an existing cluster I have like so:

$ MASTER=<master_ip> ./spark-shell -cp <some jars to include>

I get the following error:

java.lang.IllegalStateException: unread block data
	at java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2400)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1379)
	at java.io.ObjectInputStream.skipCustomData(ObjectInputStream.java:1935)
	at java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1829)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1775)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1347)
	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
	at spark.JavaDeserializationStream.readObject(JavaSerializer.scala:23)
	at spark.JavaSerializerInstance.deserialize(JavaSerializer.scala:45)
	at spark.executor.Executor$TaskRunner.run(Executor.scala:98)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:722)

I've tested that the cluster is able to run the example programs. I can also run simple things from the shell, like
 
scala> sc.parallelize(1 to 10000, 100).count()
... 
res0: Long = 10000 

Has anyone encountered this type of error before?

Thanks!
 

Matei Zaharia

unread,
Apr 14, 2013, 10:43:14 PM4/14/13
to spark...@googlegroups.com
That's a serialization error -- it means you might have a different version of Spark or of your user code on each node.

Matei

--
You received this message because you are subscribed to the Google Groups "Spark Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-users...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Jamjae Kim

unread,
Jul 15, 2013, 9:25:56 AM7/15/13
to spark...@googlegroups.com
hi, 

I have same problem. 
locally work is good, but working in 3VMs error. (1 driver node, 2 worker nodes.)

My Spark version is 0.7.2, and all node have same code. Cause i was setting VMs using scp -r command.

like code...

val sc = new SparkContext("spark://sparkdriveHost:PORT", "hbaseTest" )
val conf = HBaseConfiguration.create()
conf.set(TableInputFormat.INPUT_TABLE, "tb_test")

val hBaseRDD = new NewHadoopRDD(sc, classOf[TableInputFormat],
                        classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
                        classOf[org.apache.hadoop.hbase.client.Result], conf)

println( "rdd id : " + hBaseRDD.id )
println( "rdd count : " + hBaseRDD.count() )        <----------- error 


The error occurred in working node, when execute RDD action code (like count() ).


How can i solve this...?



2013년 4월 15일 월요일 오전 11시 43분 14초 UTC+9, Matei Zaharia 님의 말:
Reply all
Reply to author
Forward
0 new messages