Dear all,
I have been coding in Scala professionally for over a year. In that time, there is
one pattern in particular that keeps popping up and I would like to know if anybody
else has seen it and how they deal with it:
1. there is some very cheap rule for generating a starting state
2. we have to perform some costly mapping to each state
3. there is a filtering or reduction step
This is trivial: it's just a variation of MapReduce with a generation step, e.g.
(1 until 100).map(_ * 2).reduce(_ + _)
and we can perform our "expensive" map in parallel
(1 until 100).par.map(_ * 2).reduce(_ + _)
However, throw in this spanner and it gets interesting:
4. there are far too many states to store them all
(1 until Int.MaxValue).map(_ * 2).reduce(_ + _)
--> Human sacrifice, dogs and cats living together... mass hysteria!
(and, most certainly, OutOfMemoryError)
Of course, the way to get around the OOM is to implement a custom
Iterator/TraversableOnce: it is memory-less and will allow the
garbage collector to clean up older elements as we go.
However, there is no simple way (that I know of) to wrap an Iterator as a
GenIterable/ParIterable, which would allow the map to occur in parallel.
Arguably, the correct solution is to use Akka. An actor to generate and gather results
(ensuring only X states are currently being processed), and a heavily distributed Actor
that does the expensive piece, and another that does the reduction
(with failure handling so that nothing gets lost, etc, etc)
But using Akka incurs a lot of code and setup, which is only really feasible if the project
is already making heavy use of Akka (and perhaps overkill in any case).
What I'd really like to have would be a way to get the same benefits as .par gives
me for solid Collection implementations, e.g.
val myIterator = new MyLazyForgetfulIterator
parallelIterable(myIterator).map(_ * 2).reduce(_ + _)
I can imagine how this might work: invocations of `map` would grab N states from
the underlying Iterator (N defining the buffer size, controlling maximum memory
usage) and apply a Future map to all of them, putting the results into a buffer.
When `next` is called, the head of the buffered futures would be popped and a new
Future-mapped state would be pushed to the end. The next method would block until the
Future finishes (without the buffer, the map step would block anyway, but at least this way
it has a little bit of a head start) and return the result. So long as the next is being called fast
enough, there can be as many as N Futures on the go.
But this sounds like a recipe for a bug-ridden piece of code that probably doesn't
handle all the other things that Iterable is supposed to handle, so I've not been inclined
to write it yet.
I'd really appreciate this groups thoughts (both on this suggested implementation, and
other alternatives I've not covered).
Best regards,
Sam