Bucket join in Scalding

13 views
Skip to first unread message

Saket Kumar

unread,
May 23, 2019, 5:34:01 PM5/23/19
to Scalding Development
There is a feature in Hive to do Sorted Merge Bucket join. How can this be implemented in Scalding? 

Alex Levenson

unread,
May 23, 2019, 5:58:45 PM5/23/19
to Saket Kumar, Scalding Development
I'm not very familiar with that. I did some googling, it looks like that's for merging two already sorted datasets, is that right?

On Thu, May 23, 2019 at 2:34 PM Saket Kumar <saket....@gmail.com> wrote:
There is a feature in Hive to do Sorted Merge Bucket join. How can this be implemented in Scalding? 

--
You received this message because you are subscribed to the Google Groups "Scalding Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scalding-dev...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/scalding-dev/fc7f4c54-651c-4ef8-aae1-5798c206b9fa%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--
Alex Levenson
@THISWILLWORK

Saket Kumar

unread,
May 23, 2019, 6:07:57 PM5/23/19
to Scalding Development
No, it is not merging of sorted datasets. There is a concept in HIVE where you can bucket a table on join columns and it creates that many files. Then when join is performed only similar buckets are joined which happens at the map side as the size can be loaded into memory. https://learning.oreilly.com/library/view/apache-hive-cookbook/9781782161080/ch07s06.html


On Thursday, May 23, 2019 at 2:58:45 PM UTC-7, Alex Levenson wrote:
I'm not very familiar with that. I did some googling, it looks like that's for merging two already sorted datasets, is that right?

On Thu, May 23, 2019 at 2:34 PM Saket Kumar <saket....@gmail.com> wrote:
There is a feature in Hive to do Sorted Merge Bucket join. How can this be implemented in Scalding? 

--
You received this message because you are subscribed to the Google Groups "Scalding Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scaldi...@googlegroups.com.


--
Alex Levenson
@THISWILLWORK

Rajat Ahuja

unread,
May 23, 2019, 6:11:28 PM5/23/19
to Alex Levenson, Saket Kumar, Scalding Development
@Alex It is efficient if data sets are already partitioned so that we do not pass it through reducers to partition it. 
@Saket Scalding Library does not support sorted bucketed join as of now.  

Thanks
Rajat Ahuja

Alex Levenson

unread,
May 23, 2019, 6:15:12 PM5/23/19
to Rajat Ahuja, Saket Kumar, Scalding Development
Yes, we don't have that feature in scalding unfortunately.

--
Alex Levenson
@THISWILLWORK

Saket Kumar

unread,
May 23, 2019, 6:40:40 PM5/23/19
to Scalding Development
Thanks for replying to this. Is there any other technique in scalding to join two large tables?



On Thursday, May 23, 2019 at 3:15:12 PM UTC-7, Alex Levenson wrote:
Yes, we don't have that feature in scalding unfortunately.


On Thu, May 23, 2019 at 3:11 PM Rajat Ahuja <rah...@twitter.com> wrote:
@Alex It is efficient if data sets are already partitioned so that we do not pass it through reducers to partition it. 
@Saket Scalding Library does not support sorted bucketed join as of now.  

Thanks
Rajat Ahuja

On Fri, May 24, 2019 at 3:28 AM 'Alex Levenson' via Scalding Development <scaldi...@googlegroups.com> wrote:
I'm not very familiar with that. I did some googling, it looks like that's for merging two already sorted datasets, is that right?

On Thu, May 23, 2019 at 2:34 PM Saket Kumar <saket....@gmail.com> wrote:
There is a feature in Hive to do Sorted Merge Bucket join. How can this be implemented in Scalding? 

--
You received this message because you are subscribed to the Google Groups "Scalding Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scaldi...@googlegroups.com.


--
Alex Levenson
@THISWILLWORK

--
You received this message because you are subscribed to the Google Groups "Scalding Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scaldi...@googlegroups.com.


--
Alex Levenson
@THISWILLWORK

Alex Levenson

unread,
May 23, 2019, 6:44:09 PM5/23/19
to Saket Kumar, Scalding Development
Scalding's `join` methods (join, joinLeft, etc) are the way to join 2 large tables. It's implemented as a standard map/reduce shuffle join, and scales horizontally, though it does require sending the full dataset across the network from the mappers to the reducers.

If you have skew in your keyspace (some keys appear far more often than others) you can use a skew join, which has special handling for frequently appearing keys. You can tell if you have skew in your keyspace from your hadoop counters and from the symptom of a small number of your (many) reducers taking much much longer than the others. 

To unsubscribe from this group and stop receiving emails from it, send an email to scalding-dev...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/scalding-dev/17ce3eb7-b90e-4c7e-9442-5fd3d6088e55%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.


--
Alex Levenson
@THISWILLWORK

Alex Levenson

unread,
May 23, 2019, 6:44:55 PM5/23/19
to Saket Kumar, Scalding Development
sorry I meant leftJoin not joinLeft 
--
Alex Levenson
@THISWILLWORK
Reply all
Reply to author
Forward
0 new messages