My idea was just to periodically throw a dummy value through, to keep hadoop happy. e.g. using using either or option or something, and then filter it out later. e.g. of the style
x: DList[T]
x.parallelDo { new DoFn[T, Option[V]] {
override def process(...) {
(1 to 100).foreach { i =>
doWorkPart(i)
emit(None)
}
emit(Some(realResults))
}
}
}.map_reduce_boundary.
flatMap(x => x) // get rid of the Nones
So that as far as hadoop is concerned, data is continually going through. And if you can break up your work like I did in my example, nice, otherwise with a lot of care you can use a thread to emit every 5 minutes or something.
But a question for the core scoobi developers would be, I assume it's necessary for this data to make it all the way to end of the mapper-or-reducer (e.g. if you immediately filtered it out, it would do nothing). Is there any a good way of doing this? Like maybe DList.groupBarrier has this property ?