Hey scalding pros,
I've got a strange java heap space issue in my mapper. I've got a fix that helps, but I would like to understand better what is going on under the hood, why my fix helps and whether there is an alternative solution (e.g. changing job parameters). This the code in question
pipe
.map { candidateSet => (candidateSet.key, candidateSet.candidates) }
.collect { case (Some(key), candidates) => (key, candidates) }
.group
//.forceToReducers - adding this line solves the problem
.toList // this does not cause the issue, the rows have unique keys
.mapValues {_.flatten}
After this group the pipe is joined with another pipe using the same key, so I keep it as UnsortedGroupped[K,V]
The data has unique keys, so there are no map side reductions, and .toList call is actually redundant. My guess is that mapper tries to execute some map-side sorting / data optimization and this is what causes problems. The default amount of memory is sufficient for all job overheads (works fine for lots of other jobs), just to be sure I increased the heap size significantly and it did not help.
.forceToReducers solves the problem, it was my semi-intelligent guess, I expected this call to turn off some mapper logic that was redundant in case of unique keys, but still I don't understand why exactly it helped. Could be the way the input data is buffered and sorted in memory.
Any ideas?
Thanks,
Kostya