Secondary-Sort Optimisation

Punit Naik

unread,

Mar 28, 2016, 7:32:04 AM3/28/16

to PigPen Support

If I perform a sort on an entire relation, will PigPen optimise it internally? i.e. will it run secondary sort internally?

Punit Naik

unread,

Mar 28, 2016, 7:46:55 AM3/28/16

to PigPen Support

Also, I think (pig/fold (fold/sort...) directly optimises it. If this is true, what is the use of (pig/sort-by) ?

Matt Bossenbroek

unread,

Mar 28, 2016, 11:49:55 AM3/28/16

to Punit Naik, PigPen Support

Not sure what you mean by optimizing it internally or a secondary sort, but the pig/sort command uses pig’s order-by command [1] under the hood.

This is actually the command you probably want. It does a parallel sort by partitioning the data into n reducers, sorting each of them individually, and then merging the results.

The pig/fold command is intended to reduce all of the data into a single value, so that’s probably the worst case for performance of sorting a large dataset. The fold/sort function there is just included for completeness, but probably has very little real world use.

-Matt

[1] https://pig.apache.org/docs/r0.15.0/basic.html#order-by

--
You received this message because you are subscribed to the Google Groups "PigPen Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pigpen-suppor...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Punit Naik

unread,

Mar 28, 2016, 1:50:42 PM3/28/16

to PigPen Support, naik.p...@gmail.com

Okay thanks a lot for the reply. I am sorry I was talking in Hadoop terms. But anyways, your response cleared my doubt, so thank you.

Reply all

Reply to author

Forward