Protobuf-Hive

1,631 views
Skip to first unread message

jcrke

unread,
Aug 24, 2012, 12:31:02 PM8/24/12
to elephant...@googlegroups.com
Hi,

I new to Elephant-Bird and hive and have spent days trying to figure out how to create a hive table on top of a EB protobuf output.
I'm running hive 0.7.1 and using EB core/hive 3.0.3

To start off easy, I'm tried to create an external hive table using the tutorial proto "example.proto".
I ran the sample code using the -Dproto.test=lzoOut -Dproto.test.format=Block args and created an output dir which contains /user/jcroke/output/part-m-00000.lzo

Now I want to create the hive table.

Added some jars to hive:

add jar elephant-bird-examples-2.2.0.jar;
add jar lib/elephant-bird-core-3.0.3-SNAPSHOT.jar;
add jar lib/elephant-bird-hive-3.0.3-SNAPSHOT.jar;

Create table:

create external table boo
(
        name STRING,
        age  INT

)
partitioned by (dt string)
row format serde "com.twitter.elephantbird.hive.serde.ProtobufDeserializer"
 with serdeproperties (
"serialization.class"="com.twitter.elephantbird.examples.proto.Examples")
stored as
inputformat "com.twitter.elephantbird.mapred.input.HiveMultiInputFormat"
outputformat "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat";

Output:

FAILED: Error in metadata: java.lang.RuntimeException: MetaException(message:org.apache.hadoop.hive.serde2.SerDeException java.lang.ClassNotFoundException: com.twitter.elephantbird.examples.proto.Examples)
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask
hive>


Now I know that this class is in elephant-bird-examples-2.2.0.jar which was added to the jars.

jar tf elephant-bird-examples-2.2.0.jar

...
com/twitter/elephantbird/examples/proto/Examples$1.class
com/twitter/elephantbird/examples/proto/Examples$Age$Builder.class
com/twitter/elephantbird/examples/proto/Examples$Age.class
com/twitter/elephantbird/examples/proto/Examples.class
...

So this is one problem.

The other is I'm even unsure of the create table arguments.

I've looked at https://github.com/kevinweil/elephant-bird/wiki/How-to-use-Elephant-Bird-with-Hive/d0bca978ab5f1dfe73d79ca62447dc65fae442a8
and https://github.com/kevinweil/elephant-bird/wiki/How-to-use-Elephant-Bird-with-Hive
but can't seem to find the right combination of classes to make this work.

Can someone guide me?

Thanks,
Jon

Raghu Angadi

unread,
Aug 24, 2012, 1:38:10 PM8/24/12
to elephant...@googlegroups.com, jcrke

the objects would be of 'Examples$Age' class and not 'Examples'. 

Can you post the stacktrace?


--
You received this message because you are subscribed to the Google Groups "elephantbird-dev" group.
To view this discussion on the web visit https://groups.google.com/d/msg/elephantbird-dev/-/kHPGMyc9dtUJ.
To post to this group, send email to elephant...@googlegroups.com.
To unsubscribe from this group, send email to elephantbird-d...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/elephantbird-dev?hl=en.

jcrke

unread,
Aug 24, 2012, 3:16:13 PM8/24/12
to elephant...@googlegroups.com

Hi Raghu,

Still getting same error. Can't find this class.


hive> create external table boo

    > (
    >         name STRING,
    >         age  INT
    >
    > )
    > partitioned by (dt string)
    > row format serde "com.twitter.elephantbird.hive.serde.ProtobufDeserializer"
    >  with serdeproperties (
    > "serialization.class"="com.twitter.elephantbird.examples.proto.Examples$Age")

    > stored as
    > inputformat "com.twitter.elephantbird.mapred.input.HiveMultiInputFormat"
    > outputformat "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat";
FAILED: Error in metadata: java.lang.RuntimeException: MetaException(message:org.apache.hadoop.hive.serde2.SerDeException java.lang.ClassNotFoundException: com.twitter.elephantbird.examples.proto.Examples$Age)

FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask

I hope this is the stack trace you wanted.

Thanks,
Jon

jcrke

unread,
Aug 24, 2012, 4:46:52 PM8/24/12
to elephant...@googlegroups.com
Hi,

Tried this also:

hive> create external table boo

    > (
    >         name STRING,
    >         age  INT
    >
    > )
    > partitioned by (dt string)
    > ;
OK
Time taken: 0.115 seconds
hive> alter table boo
    >     set fileformat inputformat "com.twitter.elephantbird.mapred.input.HiveMultiInputFormat"
    >     outputformat "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat";
OK
Time taken: 0.084 seconds
hive> alter table boo
    >     set serde "com.twitter.elephantbird.hive.serde.ProtobufDeserializer"
    >     with serdeproperties ("serialization.class"="com.twitter.elephantbird.examples.proto.Examples");
OK
Time taken: 0.097 seconds



hive> describe boo;
Failed with exception MetaException(message:org.apache.hadoop.hive.serde2.SerDeException java.lang.ClassNotFoundException: com.twitter.elephantbird.examples.proto.Examples)

FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask
hive> drop table boo;
FAILED: Hive Internal Error: java.lang.RuntimeException(MetaException(message:org.apache.hadoop.hive.serde2.SerDeException java.lang.ClassNotFoundException: com.twitter.elephantbird.examples.proto.Examples))

java.lang.RuntimeException: MetaException(message:org.apache.hadoop.hive.serde2.SerDeException java.lang.ClassNotFoundException: com.twitter.elephantbird.examples.proto.Examples)
        at org.apache.hadoop.hive.ql.metadata.Table.getDeserializer(Table.java:255)
        at org.apache.hadoop.hive.ql.metadata.Table.getCols(Table.java:485)
        at org.apache.hadoop.hive.ql.metadata.Table.checkValidity(Table.java:161)
        at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:871)
        at org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.analyzeDropTable(DDLSemanticAnalyzer.java:676)
        at org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.analyzeInternal(DDLSemanticAnalyzer.java:190)
        at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:238)
        at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:340)
        at org.apache.hadoop.hive.ql.Driver.run(Driver.java:736)
        at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:209)
        at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:286)
        at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:516)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
Caused by: MetaException(message:org.apache.hadoop.hive.serde2.SerDeException java.lang.ClassNotFoundException: com.twitter.elephantbird.examples.proto.Examples)
        at org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:207)
        at org.apache.hadoop.hive.ql.metadata.Table.getDeserializer(Table.java:253)
        ... 16 more

hive>




On Friday, August 24, 2012 12:31:02 PM UTC-4, jcrke wrote:

Raghu Angadi

unread,
Aug 24, 2012, 5:04:20 PM8/24/12
to elephant...@googlegroups.com, jcrke
none of the EB classes are involved in the stacktrace.. Hive is not able to find the class.

I will try this out myself (mostly tomorrow).

Raghu.

--
You received this message because you are subscribed to the Google Groups "elephantbird-dev" group.
To view this discussion on the web visit https://groups.google.com/d/msg/elephantbird-dev/-/47QqNIV1LIYJ.

jcrke

unread,
Aug 27, 2012, 5:45:03 PM8/27/12
to elephant...@googlegroups.com, jcrke
Hi Raghu,

It seems to be happening inside ProtobufDeserializer.java.

->        protobufClass = conf.getClassByName(className).asSubclass(Message.class);

getClassByName is throwing the exception. I checked the className string and it is not null.

To make things cleaner than previous post, I put the example.proto into the latest 3.0.3 and did a maven package.

Here is the latest output:

[jcroke@daca2 kevinweil-elephant-bird-4b28225]$ hive
Hive history file=/tmp/jcroke/hive_job_log_jcroke_201208272139_1019372938.txt
hive> use posit;
OK
Time taken: 1.531 seconds
hive> add jar lib/elephant-bird-core-3.0.3-SNAPSHOT.jar;
Added lib/elephant-bird-core-3.0.3-SNAPSHOT.jar to class path
Added resource: lib/elephant-bird-core-3.0.3-SNAPSHOT.jar
hive> add jar lib/elephant-bird-hive-3.0.3-SNAPSHOT.jar;
Added lib/elephant-bird-hive-3.0.3-SNAPSHOT.jar to class path
Added resource: lib/elephant-bird-hive-3.0.3-SNAPSHOT.jar
hive> create external table test1

    > partitioned by (dt string)
    > row format serde "com.twitter.elephantbird.hive.serde.ProtobufDeserializer"
    >  with serdeproperties (
    > "serialization.class"="com.twitter.elephantbird.examples.proto.Examples$Age")
    > stored as
    > inputformat "com.twitter.elephantbird.mapred.input.DeprecatedRawMultiInputFormat"
    > outputformat "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat";
FAILED: Error in metadata: java.lang.RuntimeException: MetaException(message:org.apache.hadoop.hive.serde2.SerDeException Something happened here: getClassByName failed: com.twitter.elephantbird.examples.proto.Examples$Age)

FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask
hive> list jars;
lib/elephant-bird-hive-3.0.3-SNAPSHOT.jar
lib/elephant-bird-core-3.0.3-SNAPSHOT.jar
hive> quit
    > ;

[jcroke@daca2 kevinweil-elephant-bird-4b28225]$ jar tf lib/elephant-bird-core-3.0.3-SNAPSHOT.jar | less
...
com/twitter/elephantbird/examples/proto/Examples.class
com/twitter/elephantbird/examples/proto/Examples$Age$Builder.class
com/twitter/elephantbird/examples/proto/ThriftFixtures$1.class
com/twitter/elephantbird/examples/proto/ThriftFixtures$OneOfEach.class
com/twitter/elephantbird/examples/proto/Examples$1.class
com/twitter/elephantbird/examples/proto/ThriftFixtures$OneOfEach$Builder.class
com/twitter/elephantbird/examples/proto/ThriftFixtures.class
com/twitter/elephantbird/examples/proto/Examples$Age.class
...

Have you had a chance to look at this since Friday?

Thanks for any help.

Jon

Raghu Angadi

unread,
Aug 28, 2012, 12:08:14 PM8/28/12
to elephant...@googlegroups.com, jcrke
Hi Jon,

We are using Hive 0.10.0 and ProtobufDeserializer has been working fine. I want to try current official releases (0.9.0 or 0.8.1).

I was very close to getting complete example working to post on eb github yesterday.. I will do it today. sorry about the delay.

meanwhile you can try one thing : set 'HIVE_AUX_JARS_PATH' or HADOOP_CLASSPATH to include all the jars and not depend on 'add jar'.

Raghu.

To view this discussion on the web visit https://groups.google.com/d/msg/elephantbird-dev/-/kYD4k0tX1-cJ.

jcrke

unread,
Aug 28, 2012, 1:26:48 PM8/28/12
to elephant...@googlegroups.com, jcrke
Hi Raghu,

Thanks, that seemed to work for creating the table, but I'm getting a new exception about casting a mapred.Reporter to a mapreduce.Status.Reporter.

[jcroke@daca2 kevinweil-elephant-bird-4b28225]$ export HIVE_AUX_JARS_PATH="/nfs_home/jcroke/elephant/kevinweil-elephant-bird-4b28225/lib/elephant-bird-core-3.0.3-SNAPSHOT.jar:/nfs_home/jcroke/elephant/kevinweil-elephant-bird-4b28225/lib/elephant-bird-hive-3.0.3-SNAPSHOT.jar"

hive> use posit;
OK
Time taken: 0.01 seconds

hive> create external table test1
    > partitioned by (dt string)
    > row format serde "com.twitter.elephantbird.hive.serde.ProtobufDeserializer"
    >  with serdeproperties (
    > "serialization.class"="com.twitter.elephantbird.examples.proto.Examples$Age")
    > stored as
    > inputformat "com.twitter.elephantbird.mapred.input.DeprecatedRawMultiInputFormat"
    > outputformat "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat";
OK
Time taken: 0.699 seconds
hive>
    >
    > ALTER TABLE test1 ADD IF NOT EXISTS PARTITION (dt='2012/07/16/12')
    >                         LOCATION '/user/jcroke/output';

OK
Time taken: 0.178 seconds

hive> select * from test1;
OK
Failed with exception java.io.IOException:java.lang.ClassCastException: org.apache.hadoop.mapred.Reporter$1 cannot be cast to org.apache.hadoop.mapreduce.StatusReporter
Time taken: 0.348 seconds
hive>

Unfortunately I can't change the hive version we are using too easily because its on our production server, so I'm stuck to using Cloudera's hive version 0.7.1-cdh3u3.

Thanks,
Jon

jcrke

unread,
Aug 29, 2012, 7:17:41 PM8/29/12
to elephant...@googlegroups.com, jcrke
Hi,
 
Tried a different approach, but to no avail.
 
Here's how I generate the data:
 
hadoop dfs -rmr output
hadoop jar ./lib/elephant-bird-core-3.0.3-SNAPSHOT.jar com/twitter/elephantbird/examples/ProtobufMRExample -libjars ./lib/protobuf-java-2.3.0.jar,./build/elephant-bird-2.2.0.jar,./lib/hadoop-lzo-0.4.15.jar,./lib/guava-r06.jar -D proto.test=lzoOut input output
 
Include the jars:
 
export HIVE_AUX_JARS_PATH="/nfs_home/jcroke/elephant/kevinweil-elephant-bird-4b28225/lib/elephant-bird-core-3.0.3-SNAPSHOT.jar:/nfs_home/jcroke/elephant/kevinweil-elephant-bird-4b28225/lib/elephant-bird-hive-3.0.3-SNAPSHOT.jar"

[jcroke@daca2 kevinweil-elephant-bird-4b28225]$
[jcroke@daca2 kevinweil-elephant-bird-4b28225]$ hive
Hive history file=/tmp/jcroke/hive_job_log_jcroke_201208292306_1471504768.txt

hive> create external table test1
    > partitioned by (dt string)
    > row format serde "com.twitter.elephantbird.hive.serde.ProtobufDeserializer"
    >  with serdeproperties (
    > "serialization.class"="com.twitter.elephantbird.examples.proto.Examples$Age")
    > stored as
    > inputformat "com.twitter.elephantbird.mapreduce.input.RawMultiInputFormat"
    > outputformat "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat";
OK
Time taken: 2.199 seconds

hive>
    >
    > ALTER TABLE test1 ADD IF NOT EXISTS PARTITION (dt='2012/07/16/12')
    >                         LOCATION '/user/jcroke/output';
OK
Time taken: 0.222 seconds

hive> select * from test1;
FAILED: Error in semantic analysis: Line 1:14 Input format must implement InputFormat test1
Using inputformat "com.twitter.elephantbird.mapred.input.DeprecatedRawMultiInputFormat" doesn't work either as I've shown in previous post.
 
Thanks. Sitll floundering.
Jon

Raghu Angadi

unread,
Aug 30, 2012, 2:41:48 AM8/30/12
to elephant...@googlegroups.com, jcrke
Failed with exception java.io.IOException:java.lang.ClassCastException: org.apache.hadoop.mapred.Reporter$1 cannot be cast to org.apache.hadoop.mapreduce.StatusReporter

You are close :).  See this thread where Shanzhong saw same error. He said it went away after clearing some old elephant-bird jars. Also try a query other than 'select *'. I remember Travis mentioned 'select *' takes a slightly different code path in (older versions of) Hive. It  does not result in an actual MR job, does not set up the inputformat properly. If this goes away with a hive query that results in an MR job, then we will see what we can do to tolerate incorrectly set up inputformat. We saw this error in couple other unrelated cases using similar 'wrapped' inputformats.

Can you try Hive 10 in a dev environment? 'select *' should work..

> [...] inputformat "com.twitter.elephantbird.mapreduce.input.RawMultiInputFormat" [...]

This won't work since RawMultiInputFormat is meant for different MapReduce interface from what Hive expects. [ long story, essentially Hadoop has two different interfaces for MR 'old/deprected' and 'new' ].

Raghu.

To view this discussion on the web visit https://groups.google.com/d/msg/elephantbird-dev/-/y_aKTYA78NUJ.

jcrke

unread,
Aug 30, 2012, 5:11:34 PM8/30/12
to elephant...@googlegroups.com, jcrke
Hi Raghu,

Got it to work. Yeah!

I decided to have a look at jobtracker logs after the following hive query failed:
hive> describe test1;
OK
name_   string  from deserializer
age_    int     from deserializer
memoizedserializedsize  int     from deserializer
dt      string
Time taken: 0.117 seconds
Why does every element end with a _ ?

hive> select name_ from test1;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201208160854_27843, Tracking URL = http://dacd002.us.msudev.noklab.net:50030/jobdetails.jsp?jobid=job_201208160854_27843
Kill Command = /usr/lib/hadoop/bin/hadoop job  -Dmapred.job.tracker=dacd002.us.msudev.noklab.net:8021 -kill job_201208160854_27843
2012-08-30 20:15:55,103 Stage-1 map = 0%,  reduce = 0%
2012-08-30 20:16:23,253 Stage-1 map = 100%,  reduce = 100%
Ended Job = job_201208160854_27843 with errors
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask

Looking at log

Caused by: java.lang.NoClassDefFoundError: com/google/common/base/Function
    at com.twitter.elephantbird.mapreduce.input.MultiInputFormat.determineFileFormat(MultiInputFormat.java:185)
    at com.twitter.elephantbird.mapreduce.input.MultiInputFormat.createRecordReader(MultiInputFormat.java:87)
    at com.twitter.elephantbird.mapreduce.input.RawMultiInputFormat.createRecordReader(RawMultiInputFormat.java:36)
    at com.twitter.elephantbird.mapred.input.DeprecatedInputFormatWrapper$RecordReaderWrapper.<init>(DeprecatedInputFormatWrapper.java:230)
    at com.twitter.elephantbird.mapred.input.DeprecatedInputFormatWrapper.getRecordReader(DeprecatedInputFormatWrapper.java:92)
    at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.<init>(CombineHiveRecordReader.java:68)
 
Found this is in the guava-r06.jar, and I also thought having protobuf-java-2.3.0.jar would be a good idea too.

export HIVE_AUX_JARS_PATH="/nfs_home/jcroke/elephant/kevinweil-elephant-bird-4b28225/lib/elephant-bird-core-3.0.3-SNAPSHOT.jar:/nfs_home/jcroke/elephant/kevinweil-elephant-bird-4b28225/lib/elephant-bird-hive-3.0.3-SNAPSHOT.jar:/nfs_home/jcroke/elephant/kevinweil-elephant-bird-4b28225/lib/protobuf-java-2.3.0.jar:/nfs_home/jcroke/elephant/kevinweil-elephant-bird-4b28225/lib/guava-r06.jar"

[jcroke@daca2 kevinweil-elephant-bird-4b28225]$ hive
Hive history file=/tmp/jcroke/hive_job_log_jcroke_201208302103_1688217610.txt
hive> use posit;
OK
Time taken: 1.507 seconds
hive> select name_ from test1 where dt = '2012/07/16/12';
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201208160854_28016, Tracking URL = http://dacd002.us.msudev.noklab.net:50030/jobdetails.jsp?jobid=job_201208160854_28016
Kill Command = /usr/lib/hadoop/bin/hadoop job  -Dmapred.job.tracker=dacd002.us.msudev.noklab.net:8021 -kill job_201208160854_28016
2012-08-30 21:04:19,435 Stage-1 map = 0%,  reduce = 0%
2012-08-30 21:04:23,466 Stage-1 map = 100%,  reduce = 0%
2012-08-30 21:04:24,475 Stage-1 map = 100%,  reduce = 100%
Ended Job = job_201208160854_28016
OK
Jon
Jenn
David
Duncan
Time taken: 11.016 seconds
hive> select count(*) from test1;                       
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapred.reduce.tasks=<number>
Starting Job = job_201208160854_28017, Tracking URL = http://dacd002.us.msudev.noklab.net:50030/jobdetails.jsp?jobid=job_201208160854_28017
Kill Command = /usr/lib/hadoop/bin/hadoop job  -Dmapred.job.tracker=dacd002.us.msudev.noklab.net:8021 -kill job_201208160854_28017
2012-08-30 21:04:39,394 Stage-1 map = 0%,  reduce = 0%
2012-08-30 21:04:43,419 Stage-1 map = 100%,  reduce = 0%
2012-08-30 21:04:52,473 Stage-1 map = 100%,  reduce = 100%
Ended Job = job_201208160854_28017
OK
4
Time taken: 19.078 seconds
hive>

So I was missing some jar files and that was all.

Thanks for your Help : )

Jon
Reply all
Reply to author
Forward
0 new messages