Bucket join in Scalding

Saket Kumar

unread,

May 23, 2019, 5:34:01 PM5/23/19

to Scalding Development

There is a feature in Hive to do Sorted Merge Bucket join. How can this be implemented in Scalding?

Alex Levenson

unread,

May 23, 2019, 5:58:45 PM5/23/19

to Saket Kumar, Scalding Development

I'm not very familiar with that. I did some googling, it looks like that's for merging two already sorted datasets, is that right?

On Thu, May 23, 2019 at 2:34 PM Saket Kumar <saket....@gmail.com> wrote:

There is a feature in Hive to do Sorted Merge Bucket join. How can this be implemented in Scalding?

--
You received this message because you are subscribed to the Google Groups "Scalding Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scalding-dev...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/scalding-dev/fc7f4c54-651c-4ef8-aae1-5798c206b9fa%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Alex Levenson

@THISWILLWORK

Saket Kumar

unread,

May 23, 2019, 6:07:57 PM5/23/19

to Scalding Development

No, it is not merging of sorted datasets. There is a concept in HIVE where you can bucket a table on join columns and it creates that many files. Then when join is performed only similar buckets are joined which happens at the map side as the size can be loaded into memory. https://learning.oreilly.com/library/view/apache-hive-cookbook/9781782161080/ch07s06.html

On Thursday, May 23, 2019 at 2:58:45 PM UTC-7, Alex Levenson wrote:

I'm not very familiar with that. I did some googling, it looks like that's for merging two already sorted datasets, is that right?

On Thu, May 23, 2019 at 2:34 PM Saket Kumar <saket....@gmail.com> wrote:

There is a feature in Hive to do Sorted Merge Bucket join. How can this be implemented in Scalding?

--
You received this message because you are subscribed to the Google Groups "Scalding Development" group.

To unsubscribe from this group and stop receiving emails from it, send an email to scaldi...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/scalding-dev/fc7f4c54-651c-4ef8-aae1-5798c206b9fa%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
Alex Levenson
@THISWILLWORK

Rajat Ahuja

unread,

May 23, 2019, 6:11:28 PM5/23/19

to Alex Levenson, Saket Kumar, Scalding Development

@Alex It is efficient if data sets are already partitioned so that we do not pass it through reducers to partition it.

@Saket Scalding Library does not support sorted bucketed join as of now.

Thanks

Rajat Ahuja

To view this discussion on the web visit https://groups.google.com/d/msgid/scalding-dev/CA%2Bkkn9-Kbcfr31%2B%2BLXQnetBJyQD6oi17BUShuLW71_S0OOXxjA%40mail.gmail.com.

Alex Levenson

unread,

May 23, 2019, 6:15:12 PM5/23/19

to Rajat Ahuja, Saket Kumar, Scalding Development

Yes, we don't have that feature in scalding unfortunately.

--

Alex Levenson

@THISWILLWORK

Saket Kumar

unread,

May 23, 2019, 6:40:40 PM5/23/19

to Scalding Development

Thanks for replying to this. Is there any other technique in scalding to join two large tables?

On Thursday, May 23, 2019 at 3:15:12 PM UTC-7, Alex Levenson wrote:

Yes, we don't have that feature in scalding unfortunately.

On Thu, May 23, 2019 at 3:11 PM Rajat Ahuja <rah...@twitter.com> wrote:

@Alex It is efficient if data sets are already partitioned so that we do not pass it through reducers to partition it.
@Saket Scalding Library does not support sorted bucketed join as of now.

Thanks
Rajat Ahuja

On Fri, May 24, 2019 at 3:28 AM 'Alex Levenson' via Scalding Development <scaldi...@googlegroups.com> wrote:

I'm not very familiar with that. I did some googling, it looks like that's for merging two already sorted datasets, is that right?

On Thu, May 23, 2019 at 2:34 PM Saket Kumar <saket....@gmail.com> wrote:

There is a feature in Hive to do Sorted Merge Bucket join. How can this be implemented in Scalding?

--
You received this message because you are subscribed to the Google Groups "Scalding Development" group.

To unsubscribe from this group and stop receiving emails from it, send an email to scaldi...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/scalding-dev/fc7f4c54-651c-4ef8-aae1-5798c206b9fa%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
Alex Levenson
@THISWILLWORK

--
You received this message because you are subscribed to the Google Groups "Scalding Development" group.

To unsubscribe from this group and stop receiving emails from it, send an email to scaldi...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/scalding-dev/CA%2Bkkn9-Kbcfr31%2B%2BLXQnetBJyQD6oi17BUShuLW71_S0OOXxjA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

--
Alex Levenson
@THISWILLWORK

Alex Levenson

unread,

May 23, 2019, 6:44:09 PM5/23/19

to Saket Kumar, Scalding Development

Scalding's `join` methods (join, joinLeft, etc) are the way to join 2 large tables. It's implemented as a standard map/reduce shuffle join, and scales horizontally, though it does require sending the full dataset across the network from the mappers to the reducers.

If you have skew in your keyspace (some keys appear far more often than others) you can use a skew join, which has special handling for frequently appearing keys. You can tell if you have skew in your keyspace from your hadoop counters and from the symptom of a small number of your (many) reducers taking much much longer than the others.

To unsubscribe from this group and stop receiving emails from it, send an email to scalding-dev...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/scalding-dev/17ce3eb7-b90e-4c7e-9442-5fd3d6088e55%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

Alex Levenson

@THISWILLWORK

Alex Levenson

unread,

May 23, 2019, 6:44:55 PM5/23/19

to Saket Kumar, Scalding Development

sorry I meant leftJoin not joinLeft

--

Alex Levenson

@THISWILLWORK

Reply all

Reply to author

Forward