Strange java heap space in mapper, solved by .forceToReducers

Kostya Salomatin

unread,

Aug 24, 2016, 1:56:35 AM8/24/16

to Scalding Development

Hey scalding pros,

I've got a strange java heap space issue in my mapper. I've got a fix that helps, but I would like to understand better what is going on under the hood, why my fix helps and whether there is an alternative solution (e.g. changing job parameters). This the code in question

pipe
  .map { candidateSet => (candidateSet.key, candidateSet.candidates) }
  .collect { case (Some(key), candidates) => (key, candidates) }
  .group
  //.forceToReducers - adding this line solves the problem
  .toList // this does not cause the issue, the rows have unique keys

  .mapValues {_.flatten}

After this group the pipe is joined with another pipe using the same key, so I keep it as UnsortedGroupped[K,V]

The data has unique keys, so there are no map side reductions, and .toList call is actually redundant. My guess is that mapper tries to execute some map-side sorting / data optimization and this is what causes problems. The default amount of memory is sufficient for all job overheads (works fine for lots of other jobs), just to be sure I increased the heap size significantly and it did not help.

.forceToReducers solves the problem, it was my semi-intelligent guess, I expected this call to turn off some mapper logic that was redundant in case of unique keys, but still I don't understand why exactly it helped. Could be the way the input data is buffered and sorted in memory.

Any ideas?

Thanks,
Kostya

P. Oscar Boykin

unread,

Aug 24, 2016, 3:52:23 AM8/24/16

to Kostya Salomatin, Scalding Development

Yes, by default scalding attempts map-side aggregation of any commutative operation (which we assume ToList to be since there is no ordering here).

Your solution is a fine one here. Another solution is to turn down the size of the map-side cache (see Config.scala for options on this). Another approach is to use a different map-side cache that automatically tunes its own size based on memory usage and cache hit-rate.

We could disable map-side caching for toList since the usually it will be very unlikely to help (toList is not a information reducing operation). Perhaps that is a good solution to reduce the chance of this problem.

--
You received this message because you are subscribed to the Google Groups "Scalding Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scalding-dev...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Kostya Salomatin

unread,

Aug 24, 2016, 2:04:43 PM8/24/16

to Scalding Development, salo...@gmail.com

Thanks, good to know.

Reply all

Reply to author

Forward