Pig Protobuf Error

680 views
Skip to first unread message

Samir Madhavan

unread,
Jun 6, 2013, 4:50:53 AM6/6/13
to elephant...@googlegroups.com
Hi,

I'm trying to read protobuf data using PIG but I'm getting the following error. I'm not able to narrow down the problem.

grunt> raw = LOAD '/prt/LogCopy.bin' USING Logs;
2013-06-06 13:16:52,555 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: Pig script failed to parse: 
<line 2, column 6> pig script failed to validate: java.lang.RuntimeException: could not instantiate 'com.twitter.elephantbird.pig.piggybank.ProtobufBytesToTuple' with arguments '[com.protobuf.LogProtos.Logs]'
2013-06-06 13:16:52,555 [main] WARN  org.apache.pig.tools.grunt.Grunt - There is no log file to write to.
2013-06-06 13:16:52,555 [main] ERROR org.apache.pig.tools.grunt.Grunt - Failed to parse: Pig script failed to parse: 
<line 2, column 6> pig script failed to validate: java.lang.RuntimeException: could not instantiate 'com.twitter.elephantbird.pig.piggybank.ProtobufBytesToTuple' with arguments '[com.protobuf.LogProtos.Logs]'
at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:191)
at org.apache.pig.PigServer$Graph.validateQuery(PigServer.java:1572)
at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1545)
at org.apache.pig.PigServer.registerQuery(PigServer.java:518)
at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:991)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:412)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:194)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:170)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69)
at org.apache.pig.Main.run(Main.java:538)
at org.apache.pig.Main.main(Main.java:157)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
Caused by: 
<line 2, column 6> pig script failed to validate: java.lang.RuntimeException: could not instantiate 'com.twitter.elephantbird.pig.piggybank.ProtobufBytesToTuple' with arguments '[com.protobuf.LogProtos.Logs]'
at org.apache.pig.parser.LogicalPlanBuilder.buildLoadOp(LogicalPlanBuilder.java:835)
at org.apache.pig.parser.LogicalPlanGenerator.load_clause(LogicalPlanGenerator.java:3236)
at org.apache.pig.parser.LogicalPlanGenerator.op_clause(LogicalPlanGenerator.java:1315)
at org.apache.pig.parser.LogicalPlanGenerator.general_statement(LogicalPlanGenerator.java:799)
at org.apache.pig.parser.LogicalPlanGenerator.statement(LogicalPlanGenerator.java:517)
at org.apache.pig.parser.LogicalPlanGenerator.query(LogicalPlanGenerator.java:392)
at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:184)
... 15 more
Caused by: java.lang.RuntimeException: could not instantiate 'com.twitter.elephantbird.pig.piggybank.ProtobufBytesToTuple' with arguments '[com.protobuf.LogProtos.Logs]'
at org.apache.pig.impl.PigContext.instantiateFuncFromSpec(PigContext.java:618)
at org.apache.pig.parser.LogicalPlanBuilder.buildLoadOp(LogicalPlanBuilder.java:823)
... 21 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:525)
at org.apache.pig.impl.PigContext.instantiateFuncFromSpec(PigContext.java:586)
... 22 more
Caused by: java.lang.NoClassDefFoundError: Could not initialize class com.protobuf.LogProtos
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:188)
at com.twitter.elephantbird.util.Protobufs.getInnerClass(Protobufs.java:92)
at com.twitter.elephantbird.util.Protobufs.getInnerProtobufClass(Protobufs.java:87)
at com.twitter.elephantbird.util.Protobufs.getProtobufClass(Protobufs.java:69)
at com.twitter.elephantbird.util.Protobufs.getProtobufClass(Protobufs.java:55)
at com.twitter.elephantbird.pig.util.PigUtil.getProtobufClass(PigUtil.java:55)
at com.twitter.elephantbird.pig.util.PigUtil.getProtobufTypeRef(PigUtil.java:89)
at com.twitter.elephantbird.pig.piggybank.ProtobufBytesToTuple.<init>(ProtobufBytesToTuple.java:37)
... 27 more

Regards,
samir

Dmitriy Ryaboy

unread,
Jun 6, 2013, 9:07:13 AM6/6/13
to elephant...@googlegroups.com
It sounds like your class com.protobuf.LogProtos is either missing on the classpath, or created with the wrong version of protoc binary.


--
You received this message because you are subscribed to the Google Groups "elephantbird-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elephantbird-d...@googlegroups.com.
To post to this group, send email to elephant...@googlegroups.com.
Visit this group at http://groups.google.com/group/elephantbird-dev?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.
 
 



--
Dmitriy V Ryaboy
Twitter Analytics
http://twitter.com/squarecog

Samir Madhavan

unread,
Jun 6, 2013, 12:09:46 PM6/6/13
to elephant...@googlegroups.com
Thanks Dmitriy, I'll check that out. 

I was also trying to use the default example of addressbook. But I get the following error. The data is in a binary file. Am I passing in the wrong data or something else is going wrong?

Sorry if the question sounds naive but I'm new to protobufs

raw = LOAD 'AddressBook.bin' USING AddressBook;
2013-06-06 20:52:56,475 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: Pig script failed to parse: 
<line 2, column 6> pig script failed to validate: java.lang.ClassCastException: com.twitter.elephantbird.pig.piggybank.ProtobufBytesToTuple cannot be cast to org.apache.pig.LoadFunc
2013-06-06 20:52:56,475 [main] WARN  org.apache.pig.tools.grunt.Grunt - There is no log file to write to.
2013-06-06 20:52:56,475 [main] ERROR org.apache.pig.tools.grunt.Grunt - Failed to parse: Pig script failed to parse: 
<line 2, column 6> pig script failed to validate: java.lang.ClassCastException: com.twitter.elephantbird.pig.piggybank.ProtobufBytesToTuple cannot be cast to org.apache.pig.LoadFunc
at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:191)
at org.apache.pig.PigServer$Graph.validateQuery(PigServer.java:1572)
at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1545)
at org.apache.pig.PigServer.registerQuery(PigServer.java:518)
at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:991)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:412)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:194)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:170)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69)
at org.apache.pig.Main.run(Main.java:538)
at org.apache.pig.Main.main(Main.java:157)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
Caused by: 
<line 2, column 6> pig script failed to validate: java.lang.ClassCastException: com.twitter.elephantbird.pig.piggybank.ProtobufBytesToTuple cannot be cast to org.apache.pig.LoadFunc
at org.apache.pig.parser.LogicalPlanBuilder.buildLoadOp(LogicalPlanBuilder.java:835)
at org.apache.pig.parser.LogicalPlanGenerator.load_clause(LogicalPlanGenerator.java:3236)
at org.apache.pig.parser.LogicalPlanGenerator.op_clause(LogicalPlanGenerator.java:1315)
at org.apache.pig.parser.LogicalPlanGenerator.general_statement(LogicalPlanGenerator.java:799)
at org.apache.pig.parser.LogicalPlanGenerator.statement(LogicalPlanGenerator.java:517)
at org.apache.pig.parser.LogicalPlanGenerator.query(LogicalPlanGenerator.java:392)
at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:184)
... 15 more
Caused by: java.lang.ClassCastException: com.twitter.elephantbird.pig.piggybank.ProtobufBytesToTuple cannot be cast to org.apache.pig.LoadFunc
at org.apache.pig.parser.LogicalPlanBuilder.buildLoadOp(LogicalPlanBuilder.java:823)
... 21 more


Raghu Angadi

unread,
Jun 6, 2013, 12:18:51 PM6/6/13
to elephant...@googlegroups.com
oh, to load you need to use the Protobuf loader:

a = load 'input.lzo' using com.twitter.elephantbird.pig.load.ProtobufPigLoader('class_name');

What is the format of you input? If it a sequencefile, then there is another loader for that.

Raghu.

Samir Madhavan

unread,
Jun 6, 2013, 12:42:05 PM6/6/13
to elephant...@googlegroups.com
Thanks a lot. I got misled by the Readme in this link https://github.com/daggerrz/Pig-Protobuf

jobillouis joseph

unread,
Jun 7, 2013, 3:32:42 AM6/7/13
to elephant...@googlegroups.com
Hi,

I'm facing the issue after the Load. Following the example code I get the following error. I also removed the phonenumber and just kept person.name, the map reduce starts but it does not map to any fields in the data. I've added the log after the following block. Also attached the data i'm using. 


grunt>raw = LOAD 'AddressBook.data' USING com.twitter.elephantbird.pig.load.ProtobufPigLoader('com.twitter.elephantbird.examples.proto.AddressBookProtos.AddressBook');
grunt> person_phone_numbers = foreach raw generate name, FLATTEN(phone.phone_tuple.number) as phone_number;
2013-06-07 12:32:51,881 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1128: Cannot find field phone_tuple in name:chararray,id:int,email:chararray,phone:bag{phone_tuple:tuple(number:chararray,type:chararray)}
2013-06-07 12:32:51,881 [main] WARN  org.apache.pig.tools.grunt.Grunt - There is no log file to write to.
2013-06-07 12:32:51,881 [main] ERROR org.apache.pig.tools.grunt.Grunt - Failed to parse: Pig script failed to parse: 
<line 22, column 23> pig script failed to validate: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1128: Cannot find field phone_tuple in name:chararray,id:int,email:chararray,phone:bag{phone_tuple:tuple(number:chararray,type:chararray)}
at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:191)










grunt> person_phone_numbers = foreach raw generate person.name;
dump person_phone_numbers
2013-06-07 12:34:36,338 [main] INFO  org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: UNKNOWN
2013-06-07 12:34:36,342 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false
2013-06-07 12:34:36,343 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1
2013-06-07 12:34:36,343 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1
2013-06-07 12:34:36,344 [main] INFO  org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to the job
2013-06-07 12:34:36,344 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
2013-06-07 12:34:36,344 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Using reducer estimator: org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.InputSizeReducerEstimator
2013-06-07 12:34:36,347 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.InputSizeReducerEstimator - BytesPerReducer=1000000000 maxReducers=999 totalInputFileSize=81
2013-06-07 12:34:36,347 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting Parallelism to 1
2013-06-07 12:34:36,573 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - creating jar file Job5349291827032820997.jar
2013-06-07 12:34:39,371 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - jar file Job5349291827032820997.jar created
2013-06-07 12:34:39,378 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job
2013-06-07 12:34:39,379 [main] INFO  org.apache.pig.data.SchemaTupleFrontend - Key [pig.schematuple] is false, will not generate code.
2013-06-07 12:34:39,379 [main] INFO  org.apache.pig.data.SchemaTupleFrontend - Starting process to move generated code to distributed cacche
2013-06-07 12:34:39,379 [main] INFO  org.apache.pig.data.SchemaTupleFrontend - Setting key [pig.schematuple.classes] with classes to deserialize []
2013-06-07 12:34:39,388 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.
2013-06-07 12:34:39,405 [JobControl] WARN  org.apache.hadoop.mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
2013-06-07 12:34:39,632 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2013-06-07 12:34:39,632 [JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 0
2013-06-07 12:34:39,889 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete
2013-06-07 12:34:40,419 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_201306062036_0012
2013-06-07 12:34:40,419 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Processing aliases person_phone_numbers,raw
2013-06-07 12:34:40,419 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M: raw[21,6],person_phone_numbers[22,23] C:  R: 
2013-06-07 12:34:40,419 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - More information at: http://localhost:50030/jobdetails.jsp?jobid=job_201306062036_0012
2013-06-07 12:34:49,996 [main] WARN  mapreduce.Counters - Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead
2013-06-07 12:34:49,998 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2013-06-07 12:34:49,999 [main] INFO  org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics: 

HadoopVersion PigVersion UserId StartedAt FinishedAt Features
2.0.0-cdh4.3.0 0.11.0-cdh4.3.0 hduser 2013-06-07 12:34:36 2013-06-07 12:34:49 UNKNOWN

Success!

Job Stats (time in seconds):
JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime MedianMapTime MaxReduceTime MinReduceTime AvgReduceTime MedianReducetime Alias Feature Outputs
job_201306062036_0012 0 0 0 0 0 0 0 0 0 0 person_phone_numbers,raw MAP_ONLYhdfs://localhost:8020/tmp/temp976511903/tmp-690803741,

Input(s):
Successfully read 0 records from: "hdfs://localhost:8020/user/hduser/input/AddressBook.data"

Output(s):
Successfully stored 0 records in: "hdfs://localhost:8020/tmp/temp976511903/tmp-690803741"

Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0

Job DAG:
job_201306062036_0012


2013-06-07 12:34:50,016 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
2013-06-07 12:34:50,017 [main] INFO  org.apache.pig.data.SchemaTupleBackend - Key [pig.schematuple] was not set... will not generate code.
2013-06-07 12:34:50,029 [main] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 0
2013-06-07 12:34:50,029 [main] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 0


AddressBook.data

Raghu Angadi

unread,
Jun 7, 2013, 11:21:53 AM6/7/13
to elephant...@googlegroups.com
first issue: could you try 'FLATTEN(phone.number)' (without phone_tuple). Here 'phone' is the array ('bag' in pig). 'phone_tuple' is the for the type. It is similar to accessing a java array : '
  // if you have 
  PhoneInfo[] phones = new PhoneInfo[10]; ...
  // you would access number for first phone:
  phones[0].number // not phones[0].PhoneInfo.number

second issue: Your input AddressBook.data is not in the right format. How did you create it? It needs to be in in base64 lines or elephantbird's 'binary format'.. or it could be a Hadoop sequencefile (loaded with another loader). I could you an example if you like.

Raghu.

jobillouis joseph

unread,
Jun 8, 2013, 8:45:58 AM6/8/13
to elephant...@googlegroups.com
Thanks Raghu. For the second issue, it would be great if you could give me an example.

ramgo...@gmail.com

unread,
Jun 9, 2013, 7:16:15 AM6/9/13
to elephant...@googlegroups.com
It would be great to have the example. Facing the same issue. I'm able to create the data using standard protobuf tutorial which is similar to the above file. Also modified the code to generate the base64 encoded file but its not mapping. Attached the python code along with this post, just for reference.
add_person2.py

Samir Madhavan

unread,
Jun 9, 2013, 10:14:50 AM6/9/13
to elephant...@googlegroups.com
Raghu, could you elaborate on the "elephant bird binary format" as on How do we store the protobuf messages in that format. 

Also just curious, won't using base64 line encoded increase the size of the messages?

Raghu Angadi

unread,
Jun 9, 2013, 2:35:50 PM6/9/13
to elephant...@googlegroups.com, ramgo...@gmail.com, samir.m...@gmail.com

First I will describe the simplest method using base64 encoding since it does not have deal with binary data. I am attaching sample file with two 'Person' records.
  • Whats in the input file:
    $ lzop -dc  persons.lzo | cat
    CgtKYWNrIERvcnNleRAUGgxqYWNrQHR3aXR0ZXIiEAoMNDE1LTU1NS0xMjM0EAAiEAoMNDE1LTU1NS01Njc4EAE=
    CgpKb2huIFNtaXRoEB4aDGpvaG5AdHdpdHRlciIQCgw2NTAtNTU1LTEyMzQQACIQCgw2NTAtNTU1LTU2NzgQAQ==

  • pig script to load:
    $pig -x local
    > register [ elephant-bird jars ]
    > a = load 'persons.lzo' using ProtobufPigLoader('com.twitter.data.proto.tutorial.AddressBookProtos.Person');
    > dump a;

  • Output of 'dump a':
    (Jack Dorsey,20,jack@twitter,{(415-555-1234,MOBILE),(415-555-5678,HOME)})
    (John Smith,30,john@twitter,{(650-555-1234,MOBILE),(650-555-5678,HOME)})
This should get you going. You could create these base64 encoded files in various ways, starting with the binary protobufs. The reason you can't just have series of binary protobufs with out any delimiter is that these input formats don't always process from the beginning of the file till the end. They need to start at an arbitrary position in the input file and be able to find start of next protobuf.  Newline in a base64 encoded file serves as the delimiter

Yes, Base64 increase the size by 30%. There are three other options in elephant-bird:
Can you tell us if you have use any hadoop formats (stored and processed data in Hadoop)? And a bit more high level problem (what you have and what you need to do).

I need to add examples like this to EB wiki.

Raghu.

--
persons.lzo

Samir Madhavan

unread,
Jun 10, 2013, 10:34:30 AM6/10/13
to Raghu Angadi, ramgo...@gmail.com, elephant...@googlegroups.com

Thanks Raghu, this is really helpful.

We are basically doing some analytics on application server data.
We have planned to store the protobuf message in a mq system then transfer it to Hadoop. We wanted something that doesn't occupy space which in turn will put less stress on the bandwidth and doing computation would be fast.

After going through elephant bird and based on our understanding of it, we have planned to structure the data pipeline in the following manner.

1. Read the data from the mq on an hourly basis to a file in the local file system. While reading we'll be converting the protobuf message to elephant bird block format using the protobufblockwriter.

2. Once the data file is populated, we'll lzop it

3. Transfer the data to Hadoop

4. Run the lzo indexer on it

5. Pig will take care of the rest to query it using the elephant bird library.

Hope this data pipeline makes sense?

P.S we are using Kafka for mq and we could have written in a map red to obtain the data from mq but since our data is relatively small and our deliverables our soon. Camus is great for avro, something similar for protobuf seems like a need. We have approached their community, I think someone is working on a plugin for protobufs too.

Raghu Angadi

unread,
Jun 10, 2013, 11:32:36 AM6/10/13
to Samir Madhavan, ramgo...@gmail.com, elephant...@googlegroups.com
Hi Samir,

The scheme looks fine. Couple notes: 

While compressing the file, the output stream can create index in line (set 'elephantbird.lzo.output.index' hadoop conf to true). This can avoid extra indexing step. 

If you are willing to spend more CPU while compressing, you can increase lzo compression level : set 'io.compression.codec.lzo.compression.level' to 7.

If your protobuf is relatively flat (many fields at the top level), RCFile option gives the best space savings and access speeds.

Raghu.

Samir Madhavan

unread,
Jun 10, 2013, 11:39:26 AM6/10/13
to Raghu Angadi, elephant...@googlegroups.com, Samir Madhavan

Thanks Raghu for all the info :)

Reply all
Reply to author
Forward
0 new messages