# More generic parfoldr than this...

60 views

### Prabhat Totoo

Jul 23, 2011, 11:15:39 AM7/23/11
My aim is to have a parallel foldr function. At first, it seemed
rather straight forward to achieve and this is what I had in mind:

First break up the input list into partitions based on the number of
cores (numCapabilities). Then apply foldr to each partition, which
will result in a list of folded values for each partition. Then do a
foldr again on that list to obtain the final value.

listChunkSize = numCapabilities

chunk n [] = []
chunk n xs = ys : chunk n zs
where (ys,zs) = splitAt n xs

parfoldr f z [] = z
parfoldr f z xs = res
where
parts = chunk listChunkSize xs
partsRs = map (foldr f z) parts `using` parList rdeepseq
res = foldr f z partsRs

The above code does not work because obviously the definition of
foldr, (a -> b -> b) -> b -> [a] -> b, implies that the input list
type is (well, can be) different from the accumulator and result type.

For example,
1) foldr (+) 0 [1..10] => list type = accumulator type (Integer)
2) foldr (\i acc -> (i>5) && acc) True [1..10] => list type (Integer) !
= accumulator type (Bool)

So, looking at my code above, the map will generate a list of type `b`
which is then passed as argument to the second foldr. But the second
foldr accepts list of type `a`. So, that won't work.

An ugly solution would be to provide a different type signature for
the parfoldr, e.g.
parfoldr :: (NFData a) => (a -> a -> a) -> a -> [a] -> a

This will work but then it is not exactly equivalent to foldr. Example
1 above will do just fine, but not example 2.
So, question 1 is: how to define parfoldr to have same type signature
as foldr?

Comparing the 2 folds:
input = [1..1000000]
seqfold = foldr (+) 0
parfold = parfoldr (+) 0

I get the foll. times on a dual core machine:
\$ ./test
seqfold: 4.99s
parfold: 25.16s

\$ ./test
seqfold: 5.32s
parfold: 25.55s
\$ ./test +RTS -N1
seqfold: 5.32s
parfold: 25.53s
\$ ./test +RTS -N2
seqfold: 3.48s
parfold: 3.68s
\$ ./test +RTS -N3
seqfold: 3.57s
parfold: 2.36s
\$ ./test +RTS -N4
seqfold: 3.03s
parfold: 1.70s

Observations from these measurements:
- foldr seems to give lower runtime when num of cores is increased.
why is that?
- parfold gives better runtimes for N => 3.

Any suggestions and ideas for improvement is appreciated :)

Prabhat

### Kevin Hammond

Jul 23, 2011, 11:37:40 AM7/23/11
My first reaction is that you should be defining parFold rather than parFoldr. This will naturally have a different type signature (the function f should be both commutative and associative). But I'll look on Monday when I don't just have my phone.

Best Wishes,
Kevin

---------

Kevin Hammond
Professor in Computer Science
University of St Andrews

### Conal Elliott

Jul 23, 2011, 1:15:34 PM7/23/11
If I'm understanding your intent, you want to parallelize by re-associating, which is only correct for an associative operation, in which case you'll have the narrower type a -> a -> a.  - Conal

### Daniel Peebles

Jul 23, 2011, 2:04:04 PM7/23/11
It doesn't really need to be commutative, unless you want to process the accumulators in an arbitrary order later on. It seems like it's generally not too difficult to process them in the correct order though.

Dan

### Kevin Hammond

Jul 23, 2011, 3:24:32 PM7/23/11
Processing "in the correct order" usually works against parallelism.. But, as you say, associativity may be enough.

Best Wishes,
Kevin

--------

Kevin Hammond, Professor of Computer Science, University of St Andrews

In accordance with University policy on electronic mail, this email reflects the opinions of the individual concerned, may contain confidential or copyright information that should not be copied or distributed without permission, may be of a private or personal nature unless explicitly indicated otherwise, and should not under any circumstances be taken as an official statement of University policy or procedure (see http://www.st-and.ac.uk).

The University of St Andrews is a charity registered in Scotland : No SC013532

### Sebastian Fischer

Jul 24, 2011, 8:15:24 AM7/24/11
On Sun, Jul 24, 2011 at 4:24 AM, Kevin Hammond wrote:
Processing "in the correct order" usually works against parallelism..  But, as you say, associativity may be enough.

I think the crucial point is indeed reassociating the calls to the operator in a balanced way. Or is there an implementation of a parallel fold that benefits from (additionally) making use of commutativity?

Sebastian

### Conal Elliott

Jul 27, 2011, 7:53:04 PM7/27/11
I've seen commutativity exploited in CUDA code that use an *atomic* load-add-store (like C's "+=") operation and fast on-chip memory. Commutativity implies that these operations can happen in any order and still yield a correct result.

- Conal

### Sebastian Fischer

Jul 27, 2011, 8:19:37 PM7/27/11
>>> Processing "in the correct order" usually works against parallelism..
>>>  But, as you say, associativity may be enough.
>>
>> I think the crucial point is indeed reassociating the calls to the
>> operator in a balanced way. Or is there an implementation of a parallel fold
>> that benefits from (additionally) making use of commutativity?

> I've seen commutativity exploited in CUDA code that use an *atomic*

> Commutativity implies that these operations can happen in any order and
> still yield a correct result.

That's an interesting point. If side effects are involved, not only
the results matter but also the order in which they are *computed*
(rather than combined).

So, for a (hypothetical) monadic parallel fold, commutativity (of the
combining operation with respect to effects) would be important.

Sebastian

### Kevin Hammond

Jul 28, 2011, 5:08:40 AM7/28/11
to Conal Elliott, parallel...@googlegroups.com, Prabhat Totoo
Conal,

Yes but remember that this is pure code anyway. Execution order is not an issue.

Best Wishes,
Kevin

---------

Kevin Hammond
Professor in Computer Science
University of St Andrews

### Kevin Hammond

Jul 28, 2011, 5:07:16 AM7/28/11
to Sebastian Fischer, parallel...@googlegroups.com, Prabhat Totoo
Sebastian et al,

Generally it's better to avoid side effects in parallel computations. If the code is pure then we have complete freedom over execution order modulo dependencies.

Commutativity Is a nice property because it allows me to break some of those dependencies which may allow better task granularity.

I checked my book chapter and I'd omitted parallel folds because I hadn't often seen them used (scans are a different matter). You've motivated me to add a section :)

Best Wishes,
Kevin

---------

Kevin Hammond
Professor in Computer Science
University of St Andrews

### Sebastian Fischer

Jul 28, 2011, 8:04:02 AM7/28/11
Hello,

On Thu, Jul 28, 2011 at 6:07 PM, Kevin Hammond <k...@cs.st-andrews.ac.uk> wrote:
> I checked my book chapter and I'd omitted parallel folds because I hadn't often seen them used (scans are a different matter).

Specializing Conal's post on composable parallel scanning [1] reveals
that parallel scans can be implemented in terms of a parallel
fold(Map):

parfold :: (a -> a -> a) -> a -> [a] -> a
-- definition omitted

parscanr :: (a -> a -> a) -> a -> [a] -> [a]
parscanr (<>) e l = z:zs
where
(z,zs) = parfold (\ (x,xs) (y,ys) -> (x<>y,map (<>y) xs++ys)) (e,[])
\$ map (\x -> (x,[e])) l

parscanl :: (a -> a -> a) -> a -> [a] -> [a]
parscanl (<>) e l = zs++[z]
where
(z,zs) = parfold (\ (x,xs) (y,ys) -> (x<>y,xs++map (x<>) ys)) (e,[])
\$ map (\x -> (x,[e])) l

However, these are not as efficient as their sequential counterparts
`scanr` and `scanl`. How are parallel scans implemented efficiently?
(Sorry if I'm not smart enough to use Google properly but it didn't
help me..)

Sebastian

### Conal Elliott

Jul 28, 2011, 1:43:22 PM7/28/11
to Kevin Hammond, parallel...@googlegroups.com, Prabhat Totoo
I was also assuming pure functions and so wasn't talking about execution order, but rather accumulation/combination order. If two threads want to accumulate with something like an atomic "total += x", for a shared 'total' and different 'x', then the order of += doesn't matter only because + is commutative, even though + is pure. For a pure-but-noncommutative operation, one would have to change algorithms.

Regards,  - Conal

### Conal Elliott

Jul 28, 2011, 2:46:51 PM7/28/11
Hi Sebastian,

I don't think one would choose lists for parallel folding, as they're expensive to split and to re-append. Balanced trees work well, especially bottom-up ones, as explored in some of my recent blog posts. Guy Blelloch's work-efficient parallel scan for arrays can be understood elegantly as a simple optimization of the parallel scan on bottom-up trees that falls out of composable parallel scanning, as described in http://conal.net/blog/posts/parallel-tree-scanning-by-composition . In contrast, the much more common choice of top-down trees leads to a work-inefficient algorithm (also discussed in that post).

- Conal

### Sebastian Fischer

Jul 28, 2011, 10:04:14 PM7/28/11
sorry, I used a mail address not registered with this group when sending this:

On Fri, Jul 29, 2011 at 9:55 AM, Sebastian Fischer <ma...@sebfisch.de> wrote:

> On Fri, Jul 29, 2011 at 2:43 AM, Conal Elliott <co...@conal.net> wrote:
>> I was also assuming pure functions and so wasn't talking about execution
>> order, but rather accumulation/combination order. If two threads want to
>> accumulate with something like an atomic "total += x", for a shared 'total'
>> and different 'x', then the order of += doesn't matter only because + is
>> commutative, even though + is pure. For a pure-but-noncommutative operation,
>> one would have to change algorithms.
>

> I see. My monadic fold example did not fit what you described because
> in your example the *computation* of intermediate results does not use
> side effects, only their *accumulation* does.
>
> Sebastian
>

### Sebastian Fischer

Jul 29, 2011, 1:17:56 AM7/29/11
Hi Conal,

thank you for the pointers. I'll write what I learned in case others
are interested.

I think my previous implementations are exactly the top-down tree
scans of your post on parallel tree scanning by composition.

Work-efficient scans can be implemented in terms of a function that
merges list elements pairwise:

mergePairs :: (a -> a -> a) -> a -> [a] -> [a]
mergePairs _ _ [] = []
mergePairs _ _ [x] = [x]
mergePairs (<>) u (x:y:zs) = (x<>y) : mergePairs (<>) u zs

With a parallel implementation of this function one can implement both
parallel folds and scans in a bottom-up way:

parfold :: (a -> a -> a) -> a -> [a] -> a

parfold _ u [] = u
parfold (<>) u ms = case mergePairs (<>) u ms of
[m] -> m
ns -> parfold (<>) u ns

I think the following definition corresponds to Guy Blelloch's
algorithm (and your bottom-up prefix scan):

-- only correct for input of length 2^n
parscanl :: (a -> a -> a) -> a -> [a] -> (a,[a])
parscanl _ u [m] = (m,[u])
parscanl (<>) u ms = (m,concatMap (\(n,e) -> [n,n<>e]) (zip ns
(dropOdds ms)))
where
(m,ns) = parscanl (<>) u \$ mergePairs (<>) u ms

dropOdds :: [a] -> [a]
dropOdds [] = []
dropOdds [x] = [x]
dropOdds (x:_:xs) = x : dropOdds xs

It may be even possible to implement `mergePairs` with a (parallel)
fold using a tupling transformation but it's probably more efficient
not to.

Sebastian

### Kevin Hammond

Jul 29, 2011, 4:34:04 AM7/29/11
Sebastian,

Parallel scans work well with SIMD implementations. Basically, each level of the scan can be run in parallel, and a full scan proceeds over a few iterations (accumulating down and up).
I've tried to show this in my book chapter. This may not fit well with the usual definition of a scan, though (e.g. mergePairs below), since you probably introduce
additional dependencies. Conal, yes, these classically are applied to array operators, but a flattened list works equally well as an input (building the output non-serially and
dealing with intermediates are usually the issues if lists are used throughout, but you can always build a better internal structure [tree, array, ...] and then convert to/from a list if you need
it externally, provided this doesn't happen too often!).

I'll make a note to look at Guy's definitions again. Note that most data parallel operations are strict/hyper-strict. This usually isn't a problem in real code that I've seen. but you
can't just naively replace all folds/scans by parallel versions in Haskell.

I didn't quite understand the mergePairs definition: the u parameter is unused (should be the merging operator and applied to the results instead of cons?).
Essentially, this is a strategy I think - your <> operator is an evaluation strategy (here parallel composition) that you're applying pairwise across
the results.

gph-book.pdf

### Sebastian Fischer

Jul 29, 2011, 7:23:37 AM7/29/11
Kevin,

> Parallel scans work well with SIMD implementations.  Basically, each level of the scan can be run in parallel, and a full scan proceeds over a few iterations (accumulating down and up).
> I've tried to show this in my book chapter.  This may not fit well with the usual definition of a scan, though (e.g. mergePairs below), since you probably introduce

I'm afraid I cannot follow you here.. What dependencies do you mean?

> I didn't quite understand the mergePairs definition: the u parameter is unused (should be the merging operator and applied to the results instead of cons?).

Sorry, this parameter can be removed. I mechanically replaced a Monoid
type-class context with explicit parameters and realized too late that
the identity element is not used.

> Essentially, this is a strategy I think - your <> operator is an evaluation strategy (here parallel composition) that you're applying pairwise across
> the results.

No, <> is just the associative operator used for scanning. For example
(eliding the unused argument)

mergePairs (+) [1,2,3,4] = [3,7]

Here is a derivation that shows how the scan proceeds. The parallelism
is in the first three steps when the list is incrementally reduced
using mergePairs:

parscanl (+) 0 [1,2,3,4]
parscanl (+) 0 [3,7]
parscanl (+) 0 [10]
(10,[0])
(10,concatMap (\(n,e) -> [n,n+e]) [(0,3)])
(10,concat [[0,0+3]])
(10,[0,3])
(10,concatMap (\(n,e) -> [n,n+e]) [(0,1),(3,3)])
(10,concat [[0,0+1][3,3+3]])
(10,[0,1,3,6])

Looks like magic, but Conal's post convinces me that it does what it
says it does..

Sebastian

### Sebastian Fischer

Jul 29, 2011, 10:08:41 AM7/29/11
Kevin,

> Parallel scans work well with SIMD implementations.  Basically, each level of the scan can be run in parallel, and a full scan proceeds over a few iterations (accumulating down and up).

I translated your Figure 2 to Haskell. The result is a scan that does
not use an identity element:

parscanl :: (a -> a -> a) -> [a] -> [a]

parscanl (<>) = go 1
where
go n l = let (xs,ys) = splitAt n l
in if null ys then xs
else go (2*n) (xs ++ zipWith (<>) l ys)

I think SIMD instructions are like the parallel zipWith function that
would be necessary for this implementation.

The parscanl based on mergePairs has linear run time as is. The
version above needs to be implemented using a list type with more
efficient split and append functions.

Sebastian

### Sebastian Fischer

Jul 29, 2011, 6:22:12 PM7/29/11
> The parscanl based on mergePairs has linear run time as is. The
> version above needs to be implemented using a list type with more
> efficient split and append functions.

I realized that this would not change it's complexity because the
version based on zipWith performs

(n-1) + (n-2) + (n-4) + (n-8) + ... + (n-n) (which is in O(nlog(n)))

applications of the associative operator.

So, both versions of parscanl don't benefit (regarding complexity)
from more efficient lists and only the one based on mergePairs is work
efficient.

Hope I got it right this time..

Sebastian

### Kevin Hammond

Aug 3, 2011, 5:21:04 AM8/3/11
Hi Sebastian,

On 29 Jul 2011, at 15:08, Sebastian Fischer wrote:

> Kevin,
>
>> Parallel scans work well with SIMD implementations. Basically, each level of the scan can be run in parallel, and a full scan proceeds over a few iterations (accumulating down and up).
>
> I translated your Figure 2 to Haskell. The result is a scan that does
> not use an identity element:
>
> parscanl :: (a -> a -> a) -> [a] -> [a]
> parscanl (<>) = go 1
> where
> go n l = let (xs,ys) = splitAt n l
> in if null ys then xs
> else go (2*n) (xs ++ zipWith (<>) l ys)
>
> I think SIMD instructions are like the parallel zipWith function that
> would be necessary for this implementation.

As a model, what you have looks OK, but it will be basically sequential, I think:
your version has a number of sequential dependencies (e.g. the splitAt, the append, and the
embedded zipWith as an argument to the recursive). Eliminating dependencies like these is the key
to getting good parallel performance.

Basically, you need to turn your definition inside-out.

We have defined a parallel zipWith as a strategic function, but I haven't used it much. As you've noted, it's basically a limited scan pattern.

> The parscanl based on mergePairs has linear run time as is. The
> version above needs to be implemented using a list type with more
> efficient split and append functions

I don't think that's the real issue (unless you're worried about sequential performance)...

### Sebastian Fischer

Aug 3, 2011, 8:33:21 PM8/3/11
Hi Kevin,

As a model, what you have looks OK, but it will be basically sequential,

Yes, both functions are only meant as a model. I basically wrote them to clarify for myself how the different algorithms work and do not propose to actually use such implementations based on normal Haskell lists.

Basically, you need to turn your definition inside-out.

Wouldn't it suffice to have a list type with constant time splitAt, append and parallel zipWith?

> The parscanl based on mergePairs has linear run time as is. The
> version above needs to be implemented using a list type with more
> efficient split and append functions

I don't think that's the real issue (unless you're worried about sequential performance)...

I did not run my functions, just used them to compare how often the combining operator is used (so probably my use of the term "run time" was misleading). One algorithm calls the combining function O(n) times, the other O(n*log(n)) times and I expect the linear algorithm to have better performance (after properly implementing both).

Best regards,
Sebastian

### Kevin Hammond

Aug 4, 2011, 4:13:05 AM8/4/11
Hi Sebastian,

On 4 Aug 2011, at 01:33, Sebastian Fischer wrote:

> Basically, you need to turn your definition inside-out.
>
> Wouldn't it suffice to have a list type with constant time splitAt, append and parallel zipWith?

That would help (these operations are usually used on in-place arrays), but you'd need to pull some tricks with the append (pre-construct the result
and then fill in the elements, for example). You'd also need to make the go function strict, and even then you couldn't set up the later phases in
parallel with the earlier ones. I think this is because your null test is probably wrong - you know the length of the input list, so you could use this to bound the
number of stages. I'm sure I can have constant time length :) You could then spark off the parzipwith for each stage in parallel with activating the next stage.

Ideally, I'd want a hardware implementation of this using SIMD vector ops, in which case you probably need to split the input into fixed size chunks (hardware limit on vector size),
then pipeline each of the stages internally, then sort out boundary conditions to rejoin the results (I think this can also be done using a scan, though it's a while since
I looked at the literature!).

> > The parscanl based on mergePairs has linear run time as is. The
> > version above needs to be implemented using a list type with more

> > efficient split and append functions
>
> I don't think that's the real issue (unless you're worried about sequential performance)...
>
> I did not run my functions, just used them to compare how often the combining operator is used (so probably my use of the term "run time" was misleading). One algorithm calls the combining function O(n) times, the other O(n*log(n)) times and I expect the linear algorithm to have better performance (after properly implementing both).

It can be dangerous to do this kind of complexity analysis for parallel algorithms, since you can introduce a sequential bottleneck. log time algorithms can fit parallel hardware better (in
fact parallel array access is arguably O(log n) anyway). In this case, isn't the parallel scan algorithm inherently O(n*log n) (log n stages over input of size n).
The log n is a small fixed constant that allows the use of multiple pipeline stages, and the cost of the actual operation is usually small.

Best Wishes,
Kevin

--------

Kevin Hammond, Professor of Computer Science, University of St Andrews

In accordance with University policy on electronic mail, this email reflects the opinions of the individual concerned, may contain confidential or copyright information that should not be copied or distributed without permission, may be of a private or personal nature unless explicitly indicated otherwise, and should not under any circumstances be taken as an official statement of University policy or procedure (see http://www.st-and.ac.uk).

The University of St Andrews is a charity registered in Scotland : No SC013532

Best Wishes,