EB 4.3 and ThriftPigLoader with Pig 0.12

93 views
Skip to first unread message

Rohan

unread,
Nov 29, 2013, 12:09:49 PM11/29/13
to elephant...@googlegroups.com

Hi

I was trying to read base64 encoded thrift serialized file (where every line represents a base64 encoded thrift serialized object)

Code for generation of data
        TSerializer ts = new TSerializer(new TBinaryProtocol.Factory());
        Base64 encoder = new Base64(0);
        String s = new String(encoder.encode(ts.serialize(t)));

For reading via pig


REGISTER ./elephant-bird-core-4.3.jar;
REGISTER ./elephant-bird-pig-4.3.jar;
REGISTER ./elephant-bird-hadoop-compat-4.3.jar;
REGISTER ./libthrift-0.9.0.jar;
REGISTER ./thrift-struct-1.0.0.jar;
raw_data = load '$INPUT_FILES' using com.twitter.elephantbird.pig.load.ThriftPigLoader('XXXXX);
DUMP raw_data;


But every time it doesnt give any output

Exec command
pig -x local -f test.pig  --param INPUT_FILES='input.txt'


AnyHelp would be appreciated

Regards
Rohan

Dmitriy Ryaboy

unread,
Nov 29, 2013, 4:57:50 PM11/29/13
to elephant...@googlegroups.com
Unfortunately EB assumes your input is not only serialized thrift, but lzo-compressed serlialized thrift.

If someone did the work to make MultiInputFormat not extend LzoInputFormat, or have LzoInputFormat not actually insist on Lzo, that would be a great contrib.


--
You received this message because you are subscribed to the Google Groups "elephantbird-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elephantbird-d...@googlegroups.com.
To post to this group, send email to elephant...@googlegroups.com.
Visit this group at http://groups.google.com/group/elephantbird-dev.
For more options, visit https://groups.google.com/groups/opt_out.



--
Dmitriy V Ryaboy
Twitter Analytics
http://twitter.com/squarecog

rohan rai

unread,
Nov 29, 2013, 11:42:39 PM11/29/13
to elephant...@googlegroups.com
Thanks Dmitry, you are always a help

In the meanwhile here is a hack to have achieved the same for someone who is interested


REGISTER ./elephant-bird-core-4.3.jar;
REGISTER ./elephant-bird-pig-4.3.jar;
REGISTER ./elephant-bird-hadoop-compat-4.3.jar;
REGISTER ./libthrift-0.9.0.jar;
REGISTER ./data-struct-1.0.0.jar;
REGISTER ./internal_pigudf-1.0.0.jar;

DEFINE ThriftBytesToTupleDef com.twitter.elephantbird.pig.piggybank.ThriftBytesToTuple('XXXXX');

raw_data = load '$INPUT_FILES' using TextLoader() as (record:chararray);
decoded = FOREACH raw_data GENERATE FLATTEN(pigudf.B64Decode(record));
decoded_data = FOREACH decoded GENERATE ThriftBytesToTupleDef($0);

DUMP decoded_data;

Here pigudf.B64Decode udf converts chararray to a base64 decoded bytearray

Regards
Rohan



--
You received this message because you are subscribed to a topic in the Google Groups "elephantbird-dev" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elephantbird-dev/rXsv8qRm7nw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elephantbird-d...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages