Hive loading with RCFileThriftOutputFormat

159 views
Skip to first unread message

szzhu

unread,
Jul 13, 2012, 3:02:02 AM7/13/12
to elephant...@googlegroups.com

  I am interested in using RCFileProtobufOutputFormat to load my log data into Hive, to take advantage of both RCFile efficiency and Protocol Buffer flexibility.

 Right now, I am able to generate RCFileProtobufOutputFormat data with a MR job. Can anyone provide an example of Hive schema showing how to load the data into Hive? I noticed there is recent effort of separating SerDe from Inputformat. I am not sure if it also applies to RCFileThriftOutputFormat.

 I am currently using EB 3.0.1 checked out from trunk. And my Hive version is 0.9.0.

Thanks,
Shanzhong

Raghu Angadi

unread,
Jul 13, 2012, 12:55:21 PM7/13/12
to elephant...@googlegroups.com, szzhu

Did you mean InputFormat?

It is not not possible without a little bit of code yet. The usage would be similar to how Lzo-Thrift was supported  : https://github.com/kevinweil/elephant-bird/wiki/How-to-use-Elephant-Bird-with-Hive/d0bca978ab5f1dfe73d79ca62447dc65fae442a8  (note that wiki page is an older version, but still valid).

Travis, I think we should keep the previous method of accessing from Hive also on the wiki.

I will try to get an example usage working. Note that you can take maximum advantage only when we support projection push so that only the required fields are read. 

Raghu



--
You received this message because you are subscribed to the Google Groups "elephantbird-dev" group.
To view this discussion on the web visit https://groups.google.com/d/msg/elephantbird-dev/-/-P-0SXmt33AJ.
To post to this group, send email to elephant...@googlegroups.com.
To unsubscribe from this group, send email to elephantbird-d...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/elephantbird-dev?hl=en.

szzhu

unread,
Jul 13, 2012, 2:49:10 PM7/13/12
to elephant...@googlegroups.com, szzhu

Thanks, Raghu.

So right now, I cannot really take advantage of column-oriented placement in RCFileProtobuf format, since fields will be deserialized to the full protobuf object before executing the query. Actually the performance could be worse compared with directly using Protobuf format, since the serialized bytes for a protobuf object are together on disk for this case, while for RCFileProtobuf, it's column oriented. Is my understanding corrrect?

I checked the older version of the Hive page. Seems only Thrift Serde is supported. So protobuf is still not supported for Hive yet?

Thanks,
Shanzhong


On Friday, July 13, 2012 9:55:21 AM UTC-7, Raghu Angadi wrote:

Did you mean InputFormat?

It is not not possible without a little bit of code yet. The usage would be similar to how Lzo-Thrift was supported  : https://github.com/kevinweil/elephant-bird/wiki/How-to-use-Elephant-Bird-with-Hive/d0bca978ab5f1dfe73d79ca62447dc65fae442a8  (note that wiki page is an older version, but still valid).

Travis, I think we should keep the previous method of accessing from Hive also on the wiki.

I will try to get an example usage working. Note that you can take maximum advantage only when we support projection push so that only the required fields are read. 

Raghu

On Fri, Jul 13, 2012 at 12:02 AM, szzhu <shan...@gmail.com> wrote:

  I am interested in using RCFileProtobufOutputFormat to load my log data into Hive, to take advantage of both RCFile efficiency and Protocol Buffer flexibility.

 Right now, I am able to generate RCFileProtobufOutputFormat data with a MR job. Can anyone provide an example of Hive schema showing how to load the data into Hive? I noticed there is recent effort of separating SerDe from Inputformat. I am not sure if it also applies to RCFileThriftOutputFormat.

 I am currently using EB 3.0.1 checked out from trunk. And my Hive version is 0.9.0.

Thanks,
Shanzhong


--
You received this message because you are subscribed to the Google Groups "elephantbird-dev" group.
To view this discussion on the web visit https://groups.google.com/d/msg/elephantbird-dev/-/-P-0SXmt33AJ.
To post to this group, send email to elephantbird-dev@googlegroups.com.
To unsubscribe from this group, send email to elephantbird-dev+unsubscribe@googlegroups.com.

Raghu Angadi

unread,
Jul 13, 2012, 7:15:34 PM7/13/12
to elephant...@googlegroups.com, szzhu
Shanzhong,

we do take full advantage of column-oriented placement in Pig. 

We don't have that support in Hive. Yes, we have only the ThriftSerDe, adding ProtobufSerDe would be pretty simple I will try to add that. But most likely it won't take advantage of column-storage, just like ThriftSerde. we  will see.

Raghu.

To view this discussion on the web visit https://groups.google.com/d/msg/elephantbird-dev/-/LrLBxktmengJ.

To post to this group, send email to elephant...@googlegroups.com.
To unsubscribe from this group, send email to elephantbird-d...@googlegroups.com.

Raghu Angadi

unread,
Jul 17, 2012, 11:59:15 AM7/17/12
to elephant...@googlegroups.com, szzhu
FWIW, added a ProtobufDeserializer (https://github.com/kevinweil/elephant-bird/pull/234). This helps with reading protobufs from other formats.

But RCFile support is not there yet. The main issue is passing protobuf class name to RCFileProtobufInputFormat. Apparently Hive does not support passing any info to input format. 

We can sort of cheat and use the classname stored in the RCFile metadata as an alternative...

Raghu.

szzhu

unread,
Jul 17, 2012, 3:05:16 PM7/17/12
to elephant...@googlegroups.com, szzhu

Great News! Thanks a lot, Raghu.

To load into Hive, we defined "serialization.format" for Thrift.

"serialization.format"="org.apache.thrift.protocol.TBinaryProtocol"

What is the setting for protobuf?

Thanks,
Shanzhong

Raghu Angadi

unread,
Jul 17, 2012, 3:51:24 PM7/17/12
to elephant...@googlegroups.com, szzhu
Its is not required. There is only one deserialization format for protobufs.

To view this discussion on the web visit https://groups.google.com/d/msg/elephantbird-dev/-/LknwuNNWV6cJ.

To post to this group, send email to elephant...@googlegroups.com.
To unsubscribe from this group, send email to elephantbird-d...@googlegroups.com.

chao wang

unread,
Jan 15, 2015, 10:19:37 PM1/15/15
to elephant...@googlegroups.com

hi, i want to using protobuf to compress column contents, then using rcfile to store all columns.
when i read one column, did it just deserialize that column ?

does RCFileProtobufOutputFormat can do that job ? thank you.


在 2012年7月13日星期五 UTC+8下午3:02:02,szzhu写道:

chao wang

unread,
Jan 15, 2015, 10:20:05 PM1/15/15
to elephant...@googlegroups.com, shan...@gmail.com
hi, i want to using protobuf to compress column contents, then using rcfile to store all columns.
when i read one column, did it just deserialize that column ?

does RCFileProtobufOutputFormat can do that job ? thank you.

在 2012年7月14日星期六 UTC+8上午12:55:21,Raghu Angadi写道:
Reply all
Reply to author
Forward
0 new messages