Hey,
We have the following scenario that we are trying to solve.
I have X tuples (where X is typically ~10M)
I need to send these tuples to an external resource but I need to control the number of threads doing this sending (to not overload the resource)
There may be repeating keys, if so, they need to be sent together.
What we have attempted is
GroupBy(KEY)
Every(Fields.ALL, new UniformGroupingBuffer())
Where UniformGroupingBuffer adds a GROUP field with the value of the current sliceNum
We are setting the number of reducers on this task to match the number of threads we want to send over.
Then
GroubBy(GROUP)
Every(Fields.ALL, new Sendingbuffer())
Again we set the number of reducers to the number of threads we want to send over.
The problem is thatin the first buffer we see a nice distribution of tuples, however in the next buffer there is a massive skew (1 reducer taking multiples of groups more than the other ones)
Is this expected?
Any advice/experience on how to achieve this?
The data is still evenly distributed across the groups, just the groups are not distributed across the reducers.
Cheers!
Paul