Uploading thrift serialized data to Hive

158 views
Skip to first unread message

ian.von...@rd.io

unread,
Mar 20, 2013, 7:36:59 PM3/20/13
to elephant...@googlegroups.com
Hi, I am looking to take some log data, serialize it with thrift and upload it to a hive database.

To this end I created a table with the following command:

CREATE TABLE IF NOT EXISTS test_db.test_thrift4
-- no need to specify a schema - it will be discovered at runtime
    PARTITIONED BY (month STRING, day INT)
        ROW FORMAT serde "org.apache.hadoop.hive.serde2.thrift.ThriftDeserializer"
      with serdeproperties (
            "serialization.class"="PlayEvent",
            "serialization.format"="org.apache.thrift.protocol.TBinaryProtocol")
        stored as
        inputformat "com.twitter.elephantbird.mapred.input.DeprecatedRawMultiInputFormat"
        outputformat "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat";

Then to test I serialized one record using a python script and wrote it to a file using the thrift TFileObjectTransport.

I loaded that with

load data local inpath 'file:///home/ivonseggern/ian-test/sample_data.tempthrift1' overwrite into table test_db.test_thrift4 partition(month='2013-02', day=6);

But when I do select * from test_thrift4; I get 

Exception in thread "main" java.lang.InstantiationError: org.apache.hadoop.mapreduce.JobContext
at com.twitter.elephantbird.mapred.input.DeprecatedInputFormatWrapper.getSplits(DeprecatedInputFormatWrapper.java:99)
at org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:281)
at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:320)
at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:154)
at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:1382)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:269)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:215)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:406)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:744)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:607)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at org.apache.hadoop.util.RunJar.main(RunJar.java:208)

I don't know why 'getSplits' can't instantiate. Does this mean I serialized the data incorrectly? Is it even possible to upload records in the manner I am attempting? I have seen your recommendations to others to use sequence files and place each thrift record into a value with an empty key. Do I have to do this instead? Or is there a way to simply upload the thrift serialized data?


Thanks for your help! Best,

Ian

Travis Crawford

unread,
Mar 20, 2013, 10:12:51 PM3/20/13
to elephant...@googlegroups.com
This feels like it might be a classpath issue. What's the full
classpath you're using? Do you have multiple hadoop jars on there?

--travis
> --
> You received this message because you are subscribed to the Google Groups
> "elephantbird-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elephantbird-d...@googlegroups.com.
> To post to this group, send email to elephant...@googlegroups.com.
> Visit this group at http://groups.google.com/group/elephantbird-dev?hl=en.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

ian.von...@rd.io

unread,
Mar 21, 2013, 4:43:14 PM3/21/13
to elephant...@googlegroups.com
Gotcha. When I open the hive command line it starts with just the built in jar:

hive> list jars;
file:/usr/lib/hive/lib/hive-builtins-0.9.0-cdh4.1.2.jar

Then I add three jars, so I have:

hive> list jars;
file:/usr/lib/hive/lib/hive-builtins-0.9.0-cdh4.1.2.jar
/home/ivonseggern/elephant-bird/hive/target/elephant-bird-hive-3.0.8-SNAPSHOT.jar
/home/ivonseggern/elephant-bird/core/target/elephant-bird-core-3.0.8-SNAPSHOT.jar
PlayEvent.jar
hive> select * from test_db.test_thrift4;                                                       
OK
Exception in thread "main" java.lang.InstantiationError: org.apache.hadoop.mapreduce.JobContext
at com.twitter.elephantbird.mapred.input.DeprecatedInputFormatWrapper.getSplits(DeprecatedInputFormatWrapper.java:99)

Does that all seem correct? Thanks for the quick reply!
Ian

ian.von...@rd.io

unread,
Mar 21, 2013, 4:52:39 PM3/21/13
to elephant...@googlegroups.com
P.S. Also, just want to check, this should work as a strategy, I shouldn't need to write to sequence files or anything?
Reply all
Reply to author
Forward
0 new messages