OutOfMemory when using hive+elephant bird+protobuf

23 views

Skip to first unread message

Haiying Huang

unread,

Oct 16, 2014, 7:37:17 PM10/16/14

to elephant...@googlegroups.com

Hi,

Have anybody tried using EB with hive and protobuf binary data in sequence file? I got OOM error for a simple count query on 2TB of data, however EB+pig on the same data set ran successfully.

Here is my table schema:

create external table userprofile

row format serde "com.twitter.elephantbird.hive.serde.ProtobufDeserializer"

with serdeproperties (

"serialization.class"="UserProfileProtos$UserProfile")

STORED AS SEQUENCEFILE

LOCATION '...';

My query is select count(*) from userprofile;

To avoid continuous memory growth, I did the following setting:

set mapreduce.map.memory.mb=2560;

set mapreduce.reduce.memory.mb=2560;

set mapred.child.java.opts=-server -Xmx2048m;

set java.net.preferIPv4Stack=true;

I also increased the # mappers by setting smaller split size:

set mapreduce.input.fileinputformat.split.maxsize=134217728;

I am using EB-4.0.4 with protobuf 2.4, hadoop 2.5.

Can anybody shed any light?

Thanks,

-- Haiying

Cristi Calugaru

unread,

Oct 17, 2014, 4:08:05 AM10/17/14

to elephant...@googlegroups.com

Hi Haiying,

I am facing similar issues with the elephant bird libraries. I am actually having a rather complex protobuf message, for which I had to change Rahul Ravindran repeatedFieldFix branch, and do some updates on that. For example, I had issues with fields declared as enums, and byte arrays, as well as messages which have some specific levels of nesting. Still, Rahul's code branch proved a very good starting point.

Not about the issues I am facing, I had a simple count(*) on one 52 TB table, where data is compressed with lzo. On a quite large cluster (200+) mappers start to timeout and they eventually throw OOM errors. I ave tried with an XmX up to 32 GB, without an obvious result. I feel that the code has some very nasty memory leak someplace, or we are just missing something obvious in the configs.

If you can share your message and some sample data, I would gladly try it out with the current codebase I have, just to see if I get anything different. I have been messing around with the codebase for more than 3 weeks know, and, while all basic queries work fine (this wasn't working out of the box, even with all of Rahul's fixes), there seem to be serious performance issues.

Did you try to partition your data, to see if it makes a difference?

Also, you say that with pig everything worked fine?

--
You received this message because you are subscribed to the Google Groups "elephantbird-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elephantbird-d...@googlegroups.com.
To post to this group, send email to elephant...@googlegroups.com.
Visit this group at http://groups.google.com/group/elephantbird-dev.
For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward

0 new messages