Pigpen hangs for large files

Punit Naik

unread,

Mar 10, 2016, 5:16:06 AM3/10/16

to PigPen Support

I wrote a program using Pigpen and I tested it with a small subset of the actual data, which is 30GB in my case, and got valid results. Then I tried the same script on the actual large file and it does not even produce an output as it gets timed out. The log file generated by Pig only shows timed out error. What could be the problem and how can I fix it?

Matt Bossenbroek

unread,

Mar 10, 2016, 11:27:03 AM3/10/16

to Punit Naik, PigPen Support

It's most likely data skew. There are some tips in here [1] for how to deal with data skew, but the best approach is to create a histogram of the problematic join to see what the skewed key is.

If you haven't already, make sure that you've increased the parallelism sufficiently for your joins such that only one reducer is timing out (the skewed key). If you have too many reducers & they all time out, that just points to slow user code being executed and timing out.

As a last resort, you can increase that timeout (see pig or hadoop docs) and just let it run longer, but this might be a very long time.

Also, it would be useful to see (at minimum) a stack trace for the exception you're seeing. More useful would be steps to reproduce the problem.

-Matt

[1] https://pig.apache.org/docs/r0.15.0/perf.html

On Thursday, March 10, 2016 at 2:16 AM, Punit Naik wrote:

I wrote a program using Pigpen and I tested it with a small subset of the actual data, which is 30GB in my case, and got valid results. Then I tried the same script on the actual large file and it does not even produce an output as it gets timed out. The log file generated by Pig only shows timed out error. What could be the problem and how can I fix it?

--
You received this message because you are subscribed to the Google Groups "PigPen Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pigpen-suppor...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Punit Naik

unread,

Mar 10, 2016, 2:33:44 PM3/10/16

to PigPen Support

But I don't use joins in my code. My code only includes grouping by a key and then aggregating.

Matt Bossenbroek

unread,

Mar 10, 2016, 2:48:25 PM3/10/16

to Punit Naik, PigPen Support

It's still the same fundamental problem (though easier to debug). One of the keys likely has too many values and is taking too long to process. The sequence of values is presented as a lazy-seq, but if you hold on to the head of the list it will keep the whole list in memory.

I would recommend changing your group-by to just use (fold/count) to count the results and see which key is too large. If possible, it would be advantageous to write the aggregation itself as a fold; that way your code would never be presented with the full list.

https://github.com/Netflix/PigPen/wiki/Folding-Data

-Matt

On Thursday, March 10, 2016 at 11:33 AM, Punit Naik wrote:

But I don't use joins in my code. My code only includes grouping by a key and then aggregating.

Punit Naik

unread,

Mar 10, 2016, 2:50:44 PM3/10/16

to PigPen Support

Thanks Matt. Will try that.

Reply all

Reply to author

Forward