in parallel programs it is a common pattern to accumulate a list of results in parallel using an associative operator. This can be seen as a simple form of the map-reduce pattern where each element of the list is mapped into a monoid before combining the results using `mconcat`. This pattern can be abstracted by the `foldMap` function of the `Foldable` class and we can give a simple parallel implementation using `par` and `pseq`:
~~~ import Data.Monoid import Control.Parallel
foldMap :: Monoid m => (a -> m) -> [a] -> m foldMap f [] = mempty foldMap f [x] = f x foldMap f xs = let (ys,zs) = splitAt (length xs `div` 2) xs ys' = foldMap f ys zs' = foldMap f zs in zs' `par` (ys' `pseq` (ys' `mappend` zs')) ~~~
How can this pattern be implemented in Haskell efficiently?
I plan to investigate the following options and am interested in previous work.
1. Data.Sequence (from containers)
As finger trees are already balanced they look like a good candidate for parallel consumption. However, the current implementation of the `Foldable` instance is sequential. I wonder what would be the overhead and how it could be reduced if it is is too large. I think, I would first go for an implementation outside of `Foldable` and later consider the overhead of overloading.
2. Data.Array.Repa (from repa)
provides a `map` and (sequential) `fold` functions. I think that the main advantage of repa is fusion of successive traversals which is not necessary for the pattern in question. Hence, I'm not sure whether it's a good candidate for the job. Also I don't know how to implement the pattern using repa. Can it be done using `slice` or should it be done by changing repa itself?
3. Data.Monoid.Reducer (from monoids)
`foldMapReduce` looks promising. I wonder whether it could be used for parallel reduction and how efficient it would be.
Which approach do you think is most promising? What other options are there?
Sebastian Fischer <federigo.pescatore@...> writes: > foldMap :: Monoid m => (a -> m) -> [a] -> m > foldMap f [] = mempty > foldMap f [x] = f x > foldMap f xs = > let (ys,zs) = splitAt (length xs `div` 2) xs > ys' = foldMap f ys > zs' = foldMap f zs > in zs' `par` (ys' `pseq` (ys' `mappend` zs')) > ~~~
> How can this pattern be implemented in Haskell efficiently?
0. use actual lists: does not really work: length and splitAt are expensive. (Note: it does work, sort of, in cases like mergesort, when you foldMap the "sort" part only. It seems the overhead is small compared to the work that "merge" does.)
> 1. Data.Sequence (from containers) > As finger trees are already balanced they look like a good candidate > for parallel consumption. However, the current implementation of the > `Foldable` instance is sequential.
Indeed. But you have Data.Sequence.splitAt (in log time), so you can take the above code. When I tried this, I found the problem is elsewhere: You generally want to avoid splitting and sparking (par) deep down the tree (once you exhausted the actual number of capabilities (= cores)). Then you want to use a sequential fold - and I found Data.Sequence.foldl (rather, the Foldable instance) quite expensive. You could instead use toList, but that's expensive as well (should it be?)
>> foldMap :: Monoid m => (a -> m) -> [a] -> m >> foldMap f [] = mempty >> foldMap f [x] = f x >> foldMap f xs = >> let (ys,zs) = splitAt (length xs `div` 2) xs >> ys' = foldMap f ys >> zs' = foldMap f zs >> in zs' `par` (ys' `pseq` (ys' `mappend` zs')) >> ~~~
>> How can this pattern be implemented in Haskell efficiently?
This looks like a classic divide & conquer pattern to me. What you need to do is to
a) only spark of computations when the granularity is high enough to make it worth while b) only spark of computations that are evaluated to normal forms. Otherwise, Haskell will only apply WHNF evaluation to the lists, which won't be very useful for parallelisation
I would use a version using the new strategies:
>> foldMap :: Monoid m => (a -> m) -> [a] -> m >> foldMap f [] = mempty >> foldMap f [x] = f x >> foldMap f xs = >> let (ys,zs) = splitAt (length xs `div` 2) xs >> ys' = foldMap f ys >> zs' = foldMap f zs
In accordance with University policy on electronic mail, this email reflects the opinions of the individual concerned, may contain confidential or copyright information that should not be copied or distributed without permission, may be of a private or personal nature unless explicitly indicated otherwise, and should not under any circumstances be taken as an official statement of University policy or procedure (see http://www.st-and.ac.uk).
> a) only spark of computations when the granularity is high enough to make it worth while > b) only spark of computations that are evaluated to normal forms.
thanks! The version with unboxed vectors looks like a good baseline to benchmarks other solutions.
>> a) only spark of computations when the granularity is high enough to make it > worth while >> b) only spark of computations that are evaluated to normal forms.
> sure, that's what I do:
Not exactly. Your code sparks of computations as long as there are capabilities. You will use two sparks for a vector of length two.
The threshold approach can create more sparks than there are capabilities which is good if the amount of work for each spark fluctuates and unnecessary if all sparks require the same amount of work. It creates less sparks for small inputs but that may be negligible as small inputs can be processed fast anyway..