Loading nested json into hive table with elephant-bird without specifying schema

813 views
Skip to first unread message

Ken Dale

unread,
Apr 25, 2013, 7:00:50 PM4/25/13
to elephant...@googlegroups.com
Are there any examples of or instructions for loading nested json into a hive table using elephant-bird without having to specify the json schema? I've poked around with it a good amount but haven't been able to get it to load any rows. However, the json file works great when I process it with

load '/myFolder/myFile.json' using com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad');

in a pig script.

Ken

Russell Jurney

unread,
Apr 25, 2013, 7:02:42 PM4/25/13
to elephant...@googlegroups.com
This tool may be of interest. It goes well with Hive and JSON by specifying the schema of the JSON for you to plug into Hive.

--
You received this message because you are subscribed to the Google Groups "elephantbird-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elephantbird-d...@googlegroups.com.
To post to this group, send email to elephant...@googlegroups.com.
Visit this group at http://groups.google.com/group/elephantbird-dev?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Ken Dale

unread,
Apr 29, 2013, 9:27:30 AM4/29/13
to elephant...@googlegroups.com
Thanks for the suggestion! Opened an issue for it throwing NullPointerException for empty arrays, if that is fixed that might be a good viable solution.

Is it possible to use elephant-bird and hive in some manner without having to specify schema though? I wasn't able to get it working with the instruction provided at https://github.com/kevinweil/elephant-bird/wiki/How-to-use-Elephant-Bird-with-Hive

Thanks!
Ken

Namit Jain

unread,
Jul 10, 2013, 6:55:11 AM7/10/13
to elephant...@googlegroups.com
I am also trying to read data created through protobuf via hive using the article https://github.com/kevinweil/elephant-bird/wiki/How-to-use-Elephant-Bird-with-Hive

Both 

create table users
  row format serde "com.twitter.elephantbird.hive.serde.ProtobufDeserializer"
  with serdeproperties (
  "serialization.class"=
  "org.apache.hadoop.hive.serde2.proto.test.Complexpb$Complex");

and create table users
  row format serde "com.twitter.elephantbird.hive.serde.ProtobufDeserializer"
  with serdeproperties (
  "serialization.class"=
  "org.apache.hadoop.hive.serde2.proto.test.Complexpb$Complex")
  stored as
  inputformat "com.twitter.elephantbird.mapred.input.DeprecatedRawMultiInputFormat"
;


gave an error.

Has anyone gotten it to work ? I compiled elephant-bird, and did 'add jar elephant-bird-hive-4.1-SNAPSHOT.jar' from hive.

Thanks,
-namit

Dmitriy Ryaboy

unread,
Jul 10, 2013, 12:27:05 PM7/10/13
to elephant...@googlegroups.com
We have production tables reading protocol buffers (via hcat).
What error do you get? Do you have the right jars registered?



For more options, visit https://groups.google.com/groups/opt_out.
 
 



--
Dmitriy V Ryaboy
Twitter Analytics
http://twitter.com/squarecog

Namit Jain

unread,
Jul 10, 2013, 12:46:58 PM7/10/13
to elephant...@googlegroups.com
Thanks for your reply Dmitriy,

I dont want to use hcat.

I compiled elephant-bird and added the jars in aux path

namit@namit-VirtualBox:~/hive/build/dist/bin$ echo $HIVE_AUX_JARS_PATH
/tmp/elephant-bird-hive-4.1-SNAPSHOT.jar:/tmp/elephant-bird-core-4.1-SNAPSHOT.jar



The command:

namit@namit-VirtualBox:~/hive/build/dist/bin$ ./hive

Logging initialized using configuration in jar:file:/home/namit/hive/build/dist/lib/hive-common-0.12.0-SNAPSHOT.jar!/hive-log4j.properties
Hive history file=/tmp/namit/hive_job_log_bb23f054-e0c4-4ff7-8ac9-0374bac118f6_1393565580.txt
hive> add jar /tmp/elephant-bird-hive-4.1-SNAPSHOT.jar;
Added /tmp/elephant-bird-hive-4.1-SNAPSHOT.jar to class path
Added resource: /tmp/elephant-bird-hive-4.1-SNAPSHOT.jar
hive> add jar /tmp/elephant-bird-core-4.1-SNAPSHOT.jar;
Added /tmp/elephant-bird-core-4.1-SNAPSHOT.jar to class path
Added resource: /tmp/elephant-bird-core-4.1-SNAPSHOT.jar
hive> create table users
    >   row format serde "com.twitter.elephantbird.hive.serde.ProtobufDeserializer"
    >   with serdeproperties (
    >   "serialization.class"=
    >   "org.apache.hadoop.hive.serde2.proto.test.Complexpb$Complex")
    >   stored as
    >   inputformat "com.twitter.elephantbird.mapred.input.DeprecatedRawMultiInputFormat"
    >   outputformat "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"
    > ;
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.RuntimeException: MetaException(message:org.apache.hadoop.hive.serde2.SerDeException java.lang.ClassCastException: class org.apache.hadoop.hive.serde2.proto.test.Complexpb$Complex)



throws an error


Do I need to do something else also ?

Thanks,
-namit




--
You received this message because you are subscribed to a topic in the Google Groups "elephantbird-dev" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elephantbird-dev/pkGIRNg6LRk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elephantbird-d...@googlegroups.com.

Namit Jain

unread,
Jul 10, 2013, 12:50:43 PM7/10/13
to elephantbird-dev
Is there a requirement for java version or thrift version to compile elephant-bird ?

Dmitriy Ryaboy

unread,
Jul 10, 2013, 12:53:26 PM7/10/13
to elephant...@googlegroups.com
Namit, 
Java 6 and 7 both work. Thrift 7 works.
Given that you have a class cast exception, I am guessing that perhaps you have the wrong version of protocol buffers on the classpath? Can you get a full stack trace?

Namit Jain

unread,
Jul 10, 2013, 1:01:24 PM7/10/13
to elephant...@googlegroups.com
The full stack trace is:

2013-07-10 22:15:33,926 ERROR hive.log (MetaStoreUtils.java:printStackTrace(90)) - java.lang.reflect.Method.invoke(Meth\
od.java:616)
2013-07-10 22:15:33,926 ERROR hive.log (MetaStoreUtils.java:printStackTrace(90)) - org.apache.hadoop.util.RunJar.main(R\
unJar.java:156)
2013-07-10 22:15:33,942 ERROR exec.DDLTask (DDLTask.java:execute(434)) - org.apache.hadoop.hive.ql.metadata.HiveExcepti\
on: java.lang.RuntimeException: MetaException(message:org.apache.hadoop.hive.serde2.SerDeException java.lang.ClassCastE\
xception: class org.apache.hadoop.hive.serde2.proto.test.Complexpb$Complex)
        at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:603)
        at org.apache.hadoop.hive.ql.exec.DDLTask.createTable(DDLTask.java:3640)
        at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:251)
        at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:151)
        at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:65)
        at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1424)
        at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1204)
        at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1009)
        at org.apache.hadoop.hive.ql.Driver.run(Driver.java:881)
        at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:259)
        at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:216)
        at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:413)
        at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:782)
        at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:675)
        at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:614)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:616)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
Caused by: java.lang.RuntimeException: MetaException(message:org.apache.hadoop.hive.serde2.SerDeException java.lang.ClassCastException: class org.apache.hadoop.hive.serde2.proto.test.Complexpb$Complex)
        at org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:275)
        at org.apache.hadoop.hive.ql.metadata.Table.getDeserializer(Table.java:266)
        at org.apache.hadoop.hive.ql.metadata.Table.getCols(Table.java:592)
        at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:577)
        ... 19 more
Caused by: MetaException(message:org.apache.hadoop.hive.serde2.SerDeException java.lang.ClassCastException: class org.a\
pache.hadoop.hive.serde2.proto.test.Complexpb$Complex)
        at org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:226)
        at org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:273)
        ... 22 more

2013-07-10 22:15:33,943 ERROR ql.Driver (SessionState.java:printError(383)) - FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.RuntimeException: MetaException(message:org.apache.hadoop.hive.ser\
de2.SerDeException java.lang.ClassCastException: class org.apache.hadoop.hive.serde2.proto.test.Complexpb$Complex)


I was using thrift version 0.8.0.

and

namit@namit-VirtualBox:~/elephant-bird$ protoc --version
libprotoc 2.4.1
namit@namit-VirtualBox:~/elephant-bird$ 


Dmitriy Ryaboy

unread,
Jul 10, 2013, 1:06:25 PM7/10/13
to elephant...@googlegroups.com
protoc version is right, thrift version is not but it shouldn't matter since you are trying to use Protocol Buffers, not Thrift.

What version of Hive? We are on 0.10 and have yet to test 11. Though for some reason "show create table" doesn't work, which is supposed to be in 10 iirc :-/. 

Is there a way to get hive to tell you what version it is? --version didn't work, I'm just looking at the version in the jar but maybe we have a custom build.

D

Namit Jain

unread,
Jul 10, 2013, 1:10:53 PM7/10/13
to elephantbird-dev
I was using hive from trunk.

No there is no way to tell which version of hive you are on - I will file a jira, never thought about that.


Dmitriy Ryaboy

unread,
Jul 10, 2013, 1:12:16 PM7/10/13
to elephant...@googlegroups.com
I'm getting lost in Hive's wrapping of exceptions.. you have a lot more experience with this -- where is the actual cast happening? I keep finding stuff like 

    } catch (Exception e) {

      throw new HiveException(e);

    }

...

Dmitriy Ryaboy

unread,
Jul 10, 2013, 1:13:47 PM7/10/13
to elephant...@googlegroups.com
fyi here's what pig does for --version:

[dmitriy@host ~]$ pig --version
Apache Pig version 0.11.2+91 (rexported) 
compiled May 17 2013, 16:28:06

Namit Jain

unread,
Jul 12, 2013, 6:43:32 AM7/12/13
to elephantbird-dev
I re-compiled with java 6, thrift 7 but the error is the same.

I am compiling elephant-bird trunk. Is there a stable elephant-bird jar that I can use ?
but this doesn't look like a valid jar file.

namit@namit-VirtualBox:~/hive/testProto$ jar -tvf elephant-bird-2.2.3.jar
java.util.zip.ZipException: error in opening zip file
at java.util.zip.ZipFile.open(Native Method)
at java.util.zip.ZipFile.<init>(ZipFile.java:131)
at java.util.zip.ZipFile.<init>(ZipFile.java:92)
at sun.tools.jar.Main.list(Main.java:997)
at sun.tools.jar.Main.run(Main.java:242)
at sun.tools.jar.Main.main(Main.java:1167)


gives some random error.

Namit Jain

unread,
Jul 12, 2013, 6:44:09 AM7/12/13
to elephantbird-dev
Switching to hive branch 11 instead of hive trunk did not help.

Namit Jain

unread,
Jul 16, 2013, 12:24:52 PM7/16/13
to elephantbird-dev
Hi Dmitriy,

I was busy with some other things and so could not get to it before.

Is there any way I can get the jar file for elephant jars which work for you ? Is there a stable version which I
can use ? Can we talk sometime ?

Thanks,
-namit

Raghu Angadi

unread,
Jul 16, 2013, 1:57:45 PM7/16/13
to elephant...@googlegroups.com
Hi Namit,

We have seen multiple issues reported with Hive. We don't actively use Hive, that is the cause for delay. I will try to install hive and run the example listed on Elephant-bird wiki.

The EB trunk is the right version to use.

Raghu.

Dmitriy Ryaboy

unread,
Jul 16, 2013, 2:16:28 PM7/16/13
to elephant...@googlegroups.com
Raghu -- we do have hive as part of HCat though.

Namit Jain

unread,
Jul 16, 2013, 10:44:22 PM7/16/13
to elephantbird-dev
Raghu/Dmitiry,

It might be something really simple, which I am not doing correct. We can get into a quick phone call if you want
and I can tell you exactly what I am going.


Thanks,
-namit

Dmitriy Ryaboy

unread,
Jul 18, 2013, 1:22:14 AM7/18/13
to elephant...@googlegroups.com
Hi Namit,
Here's what Feng Peng, who owns our DAL and HCat stuff, tells me:

We are using EB 4.1 the here are the output for "describe extended":

Table(tableName:xxxx, dbName:hdfstables, owner:null, createTime:1351061003, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[], location:hdfs://hadoop-dw-nn.twitter.com/tables/xxxx, inputFormat:com.twitter.elephantbird.mapred.input.DeprecatedRawMultiInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:com.twitter.elephantbird.hive.serde.ProtobufDeserializer, parameters:{serialization.class=com.twitter.data.proto.Tables$xxxx, serialization.format=1}), bucketCols:[], sortCols:[], parameters:{}), partitionKeys:[FieldSchema(name:part_dt, type:string, comment:)], parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1352160284}, viewOriginalText:null, viewExpandedText:null, tableType:EXTERNAL_TABLE)

I would suggest to check how Hive CLI deals with the protobuf inner class name. EB does the following and it may not play nice with Hive CLI:

https://github.com/kevinweil/elephant-bird/blob/master/core/src/main/java/com/twitter/elephantbird/util/Protobufs.java#L58

Also just FYI we don't use Hive CLI to create tables; we set StorageDescriptor properties directly in our programs instead.
Reply all
Reply to author
Forward
0 new messages