Read protobuf data into hive

70 views

Skip to first unread message

VinNi

unread,

Feb 8, 2016, 7:28:18 AM2/8/16

to elephantbird-dev

I'd like to process the output of MapReduce job with Hive. The output is a SequenceFile with NullWritable as a key and ProtobufWritable as value.

I try to read data into hive:

ADD JAR elephant-bird-core-4.7-SNAPSHOT.jar;
ADD JAR elephant-bird-hadoop-compat-4.7-SNAPSHOT.jar;
ADD JAR elephant-bird-hive-4.7-SNAPSHOT.jar;
ADD JAR ProtobufGeneratedClass.jar;
CREATE EXTERNAL TABLE sessions
    ROW FORMAT SERDE "com.twitter.elephantbird.hive.serde.ProtobufDeserializer"
    WITD SERDEPROPERTIES ("serialization.class"="Serialization ClassPath")
    STORED AS 
    INPUTFORMAT "org.apache.hadoop.mapred.SequenceFileInputFormat"
    OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"
    LOCATION 'PathToDirectoryOfMROutput';

SELECT COUNT(*) FROM sessions;

And get the exception

Error: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing writable com.twitter.elephantbird.mapreduce.io.ProtobufWritable@e35fcc5{could not be deserialized} 
....
Caused by: java.lang.ClassCastException: com.twitter.elephantbird.mapreduce.io.ProtobufWritable cannot be cast to org.apache.hadoop.io.BytesWritable