52 views

Skip to first unread message

Jul 23, 2011, 11:15:39 AM7/23/11

to parallel-haskell

My aim is to have a parallel foldr function. At first, it seemed

rather straight forward to achieve and this is what I had in mind:

First break up the input list into partitions based on the number of

cores (numCapabilities). Then apply foldr to each partition, which

will result in a list of folded values for each partition. Then do a

foldr again on that list to obtain the final value.

listChunkSize = numCapabilities

chunk n [] = []

chunk n xs = ys : chunk n zs

where (ys,zs) = splitAt n xs

parfoldr f z [] = z

parfoldr f z xs = res

where

parts = chunk listChunkSize xs

partsRs = map (foldr f z) parts `using` parList rdeepseq

res = foldr f z partsRs

The above code does not work because obviously the definition of

foldr, (a -> b -> b) -> b -> [a] -> b, implies that the input list

type is (well, can be) different from the accumulator and result type.

For example,

1) foldr (+) 0 [1..10] => list type = accumulator type (Integer)

2) foldr (\i acc -> (i>5) && acc) True [1..10] => list type (Integer) !

= accumulator type (Bool)

So, looking at my code above, the map will generate a list of type `b`

which is then passed as argument to the second foldr. But the second

foldr accepts list of type `a`. So, that won't work.

An ugly solution would be to provide a different type signature for

the parfoldr, e.g.

parfoldr :: (NFData a) => (a -> a -> a) -> a -> [a] -> a

This will work but then it is not exactly equivalent to foldr. Example

1 above will do just fine, but not example 2.

So, question 1 is: how to define parfoldr to have same type signature

as foldr?

Comparing the 2 folds:

input = [1..1000000]

seqfold = foldr (+) 0

parfold = parfoldr (+) 0

I get the foll. times on a dual core machine:

(no -threaded flag)

$ ./test

seqfold: 4.99s

parfold: 25.16s

(-threaded flag on)

$ ./test

seqfold: 5.32s

parfold: 25.55s

$ ./test +RTS -N1

seqfold: 5.32s

parfold: 25.53s

$ ./test +RTS -N2

seqfold: 3.48s

parfold: 3.68s

$ ./test +RTS -N3

seqfold: 3.57s

parfold: 2.36s

$ ./test +RTS -N4

seqfold: 3.03s

parfold: 1.70s

Observations from these measurements:

- foldr seems to give lower runtime when num of cores is increased.

why is that?

- parfold gives better runtimes for N => 3.

Any suggestions and ideas for improvement is appreciated :)

Prabhat

rather straight forward to achieve and this is what I had in mind:

First break up the input list into partitions based on the number of

cores (numCapabilities). Then apply foldr to each partition, which

will result in a list of folded values for each partition. Then do a

foldr again on that list to obtain the final value.

listChunkSize = numCapabilities

chunk n [] = []

chunk n xs = ys : chunk n zs

where (ys,zs) = splitAt n xs

parfoldr f z [] = z

parfoldr f z xs = res

where

parts = chunk listChunkSize xs

partsRs = map (foldr f z) parts `using` parList rdeepseq

res = foldr f z partsRs

The above code does not work because obviously the definition of

foldr, (a -> b -> b) -> b -> [a] -> b, implies that the input list

type is (well, can be) different from the accumulator and result type.

For example,

1) foldr (+) 0 [1..10] => list type = accumulator type (Integer)

2) foldr (\i acc -> (i>5) && acc) True [1..10] => list type (Integer) !

= accumulator type (Bool)

So, looking at my code above, the map will generate a list of type `b`

which is then passed as argument to the second foldr. But the second

foldr accepts list of type `a`. So, that won't work.

An ugly solution would be to provide a different type signature for

the parfoldr, e.g.

parfoldr :: (NFData a) => (a -> a -> a) -> a -> [a] -> a

This will work but then it is not exactly equivalent to foldr. Example

1 above will do just fine, but not example 2.

So, question 1 is: how to define parfoldr to have same type signature

as foldr?

Comparing the 2 folds:

input = [1..1000000]

seqfold = foldr (+) 0

parfold = parfoldr (+) 0

I get the foll. times on a dual core machine:

(no -threaded flag)

$ ./test

seqfold: 4.99s

parfold: 25.16s

(-threaded flag on)

$ ./test

seqfold: 5.32s

parfold: 25.55s

$ ./test +RTS -N1

seqfold: 5.32s

parfold: 25.53s

$ ./test +RTS -N2

seqfold: 3.48s

parfold: 3.68s

$ ./test +RTS -N3

seqfold: 3.57s

parfold: 2.36s

$ ./test +RTS -N4

seqfold: 3.03s

parfold: 1.70s

Observations from these measurements:

- foldr seems to give lower runtime when num of cores is increased.

why is that?

- parfold gives better runtimes for N => 3.

Any suggestions and ideas for improvement is appreciated :)

Prabhat

Jul 23, 2011, 11:37:40 AM7/23/11

to Prabhat Totoo, parallel-haskell

My first reaction is that you should be defining parFold rather than parFoldr. This will naturally have a different type signature (the function f should be both commutative and associative). But I'll look on Monday when I don't just have my phone.

Best Wishes,

Kevin

---------

Kevin Hammond

Professor in Computer Science

University of St Andrews

E: k...@cs.st-andrews.ac.uk

T: +44-1334-473241

F: +44-1334-463278

Jul 23, 2011, 1:15:34 PM7/23/11

to parallel...@googlegroups.com

If I'm understanding your intent, you want to parallelize by re-associating, which is only correct for an associative operation, in which case you'll have the narrower type a -> a -> a. - Conal

Jul 23, 2011, 2:04:04 PM7/23/11

to parallel...@googlegroups.com, Prabhat Totoo

It doesn't really need to be commutative, unless you want to process the accumulators in an arbitrary order later on. It seems like it's generally not too difficult to process them in the correct order though.

Dan

Jul 23, 2011, 3:24:32 PM7/23/11

to parallel...@googlegroups.com, Prabhat Totoo

Processing "in the correct order" usually works against parallelism.. But, as you say, associativity may be enough.

Best Wishes,

Kevin

--------

Kevin Hammond, Professor of Computer Science, University of St Andrews

T: +44-1334 463241 F: +44-1334-463278 W: http://www.cs.st-andrews.ac.uk/~kh

In accordance with University policy on electronic mail, this email reflects the opinions of the individual concerned, may contain confidential or copyright information that should not be copied or distributed without permission, may be of a private or personal nature unless explicitly indicated otherwise, and should not under any circumstances be taken as an official statement of University policy or procedure (see http://www.st-and.ac.uk).

The University of St Andrews is a charity registered in Scotland : No SC013532

Jul 24, 2011, 8:15:24 AM7/24/11

to parallel...@googlegroups.com, Prabhat Totoo

On Sun, Jul 24, 2011 at 4:24 AM, Kevin Hammond <k...@cs.st-andrews.ac.uk> wrote:

Processing "in the correct order" usually works against parallelism.. But, as you say, associativity may be enough.

I think the crucial point is indeed reassociating the calls to the operator in a balanced way. Or is there an implementation of a parallel fold that benefits from (additionally) making use of commutativity?

Sebastian

Jul 27, 2011, 7:53:04 PM7/27/11

to parallel...@googlegroups.com, Prabhat Totoo

I've seen commutativity exploited in CUDA code that use an *atomic* load-add-store (like C's "+=") operation and fast on-chip memory. Commutativity implies that these operations can happen in any order and still yield a correct result.

- Conal

- Conal

Jul 27, 2011, 8:19:37 PM7/27/11

to parallel...@googlegroups.com, Prabhat Totoo

>>> Processing "in the correct order" usually works against parallelism..

>>> But, as you say, associativity may be enough.

>>

>> I think the crucial point is indeed reassociating the calls to the

>> operator in a balanced way. Or is there an implementation of a parallel fold

>> that benefits from (additionally) making use of commutativity?

>>> But, as you say, associativity may be enough.

>>

>> I think the crucial point is indeed reassociating the calls to the

>> operator in a balanced way. Or is there an implementation of a parallel fold

>> that benefits from (additionally) making use of commutativity?

> I've seen commutativity exploited in CUDA code that use an *atomic*

> load-add-store (like C's "+=") operation and fast on-chip memory.

> Commutativity implies that these operations can happen in any order and

> still yield a correct result.

That's an interesting point. If side effects are involved, not only

the results matter but also the order in which they are *computed*

(rather than combined).

So, for a (hypothetical) monadic parallel fold, commutativity (of the

combining operation with respect to effects) would be important.

Sebastian

Jul 28, 2011, 5:08:40 AM7/28/11

to Conal Elliott, parallel...@googlegroups.com, Prabhat Totoo

Conal,

Yes but remember that this is pure code anyway. Execution order is not an issue.

Best Wishes,

Kevin

---------

Kevin Hammond

Professor in Computer Science

University of St Andrews

Jul 28, 2011, 5:07:16 AM7/28/11

to Sebastian Fischer, parallel...@googlegroups.com, Prabhat Totoo

Sebastian et al,

Generally it's better to avoid side effects in parallel computations. If the code is pure then we have complete freedom over execution order modulo dependencies.

Commutativity Is a nice property because it allows me to break some of those dependencies which may allow better task granularity.

I checked my book chapter and I'd omitted parallel folds because I hadn't often seen them used (scans are a different matter). You've motivated me to add a section :)

Best Wishes,

Kevin

---------

Kevin Hammond

Professor in Computer Science

University of St Andrews

Jul 28, 2011, 8:04:02 AM7/28/11

to k...@cs.st-andrews.ac.uk, parallel...@googlegroups.com

Hello,

On Thu, Jul 28, 2011 at 6:07 PM, Kevin Hammond <k...@cs.st-andrews.ac.uk> wrote:

> I checked my book chapter and I'd omitted parallel folds because I hadn't often seen them used (scans are a different matter).

Specializing Conal's post on composable parallel scanning [1] reveals

that parallel scans can be implemented in terms of a parallel

fold(Map):

parfold :: (a -> a -> a) -> a -> [a] -> a

-- definition omitted

parscanr :: (a -> a -> a) -> a -> [a] -> [a]

parscanr (<>) e l = z:zs

where

(z,zs) = parfold (\ (x,xs) (y,ys) -> (x<>y,map (<>y) xs++ys)) (e,[])

$ map (\x -> (x,[e])) l

parscanl :: (a -> a -> a) -> a -> [a] -> [a]

parscanl (<>) e l = zs++[z]

where

(z,zs) = parfold (\ (x,xs) (y,ys) -> (x<>y,xs++map (x<>) ys)) (e,[])

$ map (\x -> (x,[e])) l

However, these are not as efficient as their sequential counterparts

`scanr` and `scanl`. How are parallel scans implemented efficiently?

(Sorry if I'm not smart enough to use Google properly but it didn't

help me..)

Sebastian

[1]: http://conal.net/blog/posts/composable-parallel-scanning

Jul 28, 2011, 1:43:22 PM7/28/11

to Kevin Hammond, parallel...@googlegroups.com, Prabhat Totoo

I was also assuming pure functions and so wasn't talking about execution order, but rather accumulation/combination order. If two threads want to accumulate with something like an atomic "total += x", for a shared 'total' and different 'x', then the order of += doesn't matter only because + is commutative, even though + is pure. For a pure-but-noncommutative operation, one would have to change algorithms.

Regards, - Conal

Regards, - Conal

Jul 28, 2011, 2:46:51 PM7/28/11

to parallel...@googlegroups.com, k...@cs.st-andrews.ac.uk

Hi Sebastian,

I don't think one would choose lists for parallel folding, as they're expensive to split and to re-append. Balanced trees work well, especially bottom-up ones, as explored in some of my recent blog posts. Guy Blelloch's work-efficient parallel scan for arrays can be understood elegantly as a simple optimization of the parallel scan on bottom-up trees that falls out of composable parallel scanning, as described in http://conal.net/blog/posts/parallel-tree-scanning-by-composition . In contrast, the much more common choice of top-down trees leads to a work-inefficient algorithm (also discussed in that post).

- Conal

I don't think one would choose lists for parallel folding, as they're expensive to split and to re-append. Balanced trees work well, especially bottom-up ones, as explored in some of my recent blog posts. Guy Blelloch's work-efficient parallel scan for arrays can be understood elegantly as a simple optimization of the parallel scan on bottom-up trees that falls out of composable parallel scanning, as described in http://conal.net/blog/posts/parallel-tree-scanning-by-composition . In contrast, the much more common choice of top-down trees leads to a work-inefficient algorithm (also discussed in that post).

- Conal

Jul 28, 2011, 10:04:14 PM7/28/11

to parallel...@googlegroups.com

sorry, I used a mail address not registered with this group when sending this:

On Fri, Jul 29, 2011 at 9:55 AM, Sebastian Fischer <ma...@sebfisch.de> wrote:

> On Fri, Jul 29, 2011 at 2:43 AM, Conal Elliott <co...@conal.net> wrote:

>> I was also assuming pure functions and so wasn't talking about execution

>> order, but rather accumulation/combination order. If two threads want to

>> accumulate with something like an atomic "total += x", for a shared 'total'

>> and different 'x', then the order of += doesn't matter only because + is

>> commutative, even though + is pure. For a pure-but-noncommutative operation,

>> one would have to change algorithms.

>

> I see. My monadic fold example did not fit what you described because

> in your example the *computation* of intermediate results does not use

> side effects, only their *accumulation* does.

>

> Sebastian

>

Jul 29, 2011, 1:17:56 AM7/29/11

to parallel...@googlegroups.com

Hi Conal,

thank you for the pointers. I'll write what I learned in case others

are interested.

I think my previous implementations are exactly the top-down tree

scans of your post on parallel tree scanning by composition.

Work-efficient scans can be implemented in terms of a function that

merges list elements pairwise:

mergePairs :: (a -> a -> a) -> a -> [a] -> [a]

mergePairs _ _ [] = []

mergePairs _ _ [x] = [x]

mergePairs (<>) u (x:y:zs) = (x<>y) : mergePairs (<>) u zs

With a parallel implementation of this function one can implement both

parallel folds and scans in a bottom-up way:

parfold :: (a -> a -> a) -> a -> [a] -> a

parfold _ u [] = u

parfold (<>) u ms = case mergePairs (<>) u ms of

[m] -> m

ns -> parfold (<>) u ns

I think the following definition corresponds to Guy Blelloch's

algorithm (and your bottom-up prefix scan):

-- only correct for input of length 2^n

parscanl :: (a -> a -> a) -> a -> [a] -> (a,[a])

parscanl _ u [m] = (m,[u])

parscanl (<>) u ms = (m,concatMap (\(n,e) -> [n,n<>e]) (zip ns

(dropOdds ms)))

where

(m,ns) = parscanl (<>) u $ mergePairs (<>) u ms

dropOdds :: [a] -> [a]

dropOdds [] = []

dropOdds [x] = [x]

dropOdds (x:_:xs) = x : dropOdds xs

It may be even possible to implement `mergePairs` with a (parallel)

fold using a tupling transformation but it's probably more efficient

not to.

Sebastian

Jul 29, 2011, 4:34:04 AM7/29/11

to parallel...@googlegroups.com

Sebastian,

Parallel scans work well with SIMD implementations. Basically, each level of the scan can be run in parallel, and a full scan proceeds over a few iterations (accumulating down and up).

I've tried to show this in my book chapter. This may not fit well with the usual definition of a scan, though (e.g. mergePairs below), since you probably introduce

additional dependencies. Conal, yes, these classically are applied to array operators, but a flattened list works equally well as an input (building the output non-serially and

dealing with intermediates are usually the issues if lists are used throughout, but you can always build a better internal structure [tree, array, ...] and then convert to/from a list if you need

it externally, provided this doesn't happen too often!).

I'll make a note to look at Guy's definitions again. Note that most data parallel operations are strict/hyper-strict. This usually isn't a problem in real code that I've seen. but you

can't just naively replace all folds/scans by parallel versions in Haskell.

I didn't quite understand the mergePairs definition: the u parameter is unused (should be the merging operator and applied to the results instead of cons?).

Essentially, this is a strategy I think - your <> operator is an evaluation strategy (here parallel composition) that you're applying pairwise across

the results.

Jul 29, 2011, 7:23:37 AM7/29/11

to parallel...@googlegroups.com

Kevin,

> Parallel scans work well with SIMD implementations. Basically, each level of the scan can be run in parallel, and a full scan proceeds over a few iterations (accumulating down and up).

> I've tried to show this in my book chapter. This may not fit well with the usual definition of a scan, though (e.g. mergePairs below), since you probably introduce

> additional dependencies.

I'm afraid I cannot follow you here.. What dependencies do you mean?

> I didn't quite understand the mergePairs definition: the u parameter is unused (should be the merging operator and applied to the results instead of cons?).

Sorry, this parameter can be removed. I mechanically replaced a Monoid

type-class context with explicit parameters and realized too late that

the identity element is not used.

> Essentially, this is a strategy I think - your <> operator is an evaluation strategy (here parallel composition) that you're applying pairwise across

> the results.

No, <> is just the associative operator used for scanning. For example

(eliding the unused argument)

mergePairs (+) [1,2,3,4] = [3,7]

Here is a derivation that shows how the scan proceeds. The parallelism

is in the first three steps when the list is incrementally reduced

using mergePairs:

parscanl (+) 0 [1,2,3,4]

parscanl (+) 0 [3,7]

parscanl (+) 0 [10]

(10,[0])

(10,concatMap (\(n,e) -> [n,n+e]) [(0,3)])

(10,concat [[0,0+3]])

(10,[0,3])

(10,concatMap (\(n,e) -> [n,n+e]) [(0,1),(3,3)])

(10,concat [[0,0+1][3,3+3]])

(10,[0,1,3,6])

Looks like magic, but Conal's post convinces me that it does what it

says it does..

Sebastian

Jul 29, 2011, 10:08:41 AM7/29/11

to parallel...@googlegroups.com

Kevin,

> Parallel scans work well with SIMD implementations. Basically, each level of the scan can be run in parallel, and a full scan proceeds over a few iterations (accumulating down and up).

I translated your Figure 2 to Haskell. The result is a scan that does

not use an identity element:

parscanl :: (a -> a -> a) -> [a] -> [a]

parscanl (<>) = go 1

where

go n l = let (xs,ys) = splitAt n l

in if null ys then xs

else go (2*n) (xs ++ zipWith (<>) l ys)

I think SIMD instructions are like the parallel zipWith function that

would be necessary for this implementation.

The parscanl based on mergePairs has linear run time as is. The

version above needs to be implemented using a list type with more

efficient split and append functions.

Sebastian

Jul 29, 2011, 6:22:12 PM7/29/11

to parallel...@googlegroups.com

> The parscanl based on mergePairs has linear run time as is. The

> version above needs to be implemented using a list type with more

> efficient split and append functions.

> version above needs to be implemented using a list type with more

> efficient split and append functions.

I realized that this would not change it's complexity because the

version based on zipWith performs

(n-1) + (n-2) + (n-4) + (n-8) + ... + (n-n) (which is in O(nlog(n)))

applications of the associative operator.

So, both versions of parscanl don't benefit (regarding complexity)

from more efficient lists and only the one based on mergePairs is work

efficient.

Hope I got it right this time..

Sebastian

Aug 3, 2011, 5:21:04 AM8/3/11

to parallel...@googlegroups.com

Hi Sebastian,

On 29 Jul 2011, at 15:08, Sebastian Fischer wrote:

> Kevin,

>

>> Parallel scans work well with SIMD implementations. Basically, each level of the scan can be run in parallel, and a full scan proceeds over a few iterations (accumulating down and up).

>

> I translated your Figure 2 to Haskell. The result is a scan that does

> not use an identity element:

>

> parscanl :: (a -> a -> a) -> [a] -> [a]

> parscanl (<>) = go 1

> where

> go n l = let (xs,ys) = splitAt n l

> in if null ys then xs

> else go (2*n) (xs ++ zipWith (<>) l ys)

>

> I think SIMD instructions are like the parallel zipWith function that

> would be necessary for this implementation.

As a model, what you have looks OK, but it will be basically sequential, I think:

your version has a number of sequential dependencies (e.g. the splitAt, the append, and the

embedded zipWith as an argument to the recursive). Eliminating dependencies like these is the key

to getting good parallel performance.

Basically, you need to turn your definition inside-out.

We have defined a parallel zipWith as a strategic function, but I haven't used it much. As you've noted, it's basically a limited scan pattern.

> The parscanl based on mergePairs has linear run time as is. The

> version above needs to be implemented using a list type with more

> efficient split and append functions

I don't think that's the real issue (unless you're worried about sequential performance)...

Aug 3, 2011, 8:33:21 PM8/3/11

to parallel...@googlegroups.com

Hi Kevin,

As a model, what you have looks OK, but it will be basically sequential,

Yes, both functions are only meant as a model. I basically wrote them to clarify for myself how the different algorithms work and do not propose to actually use such implementations based on normal Haskell lists.

Basically, you need to turn your definition inside-out.

Wouldn't it suffice to have a list type with constant time splitAt, append and parallel zipWith?

> The parscanl based on mergePairs has linear run time as is. TheI don't think that's the real issue (unless you're worried about sequential performance)...

> version above needs to be implemented using a list type with more

> efficient split and append functions

I did not run my functions, just used them to compare how often the combining operator is used (so probably my use of the term "run time" was misleading). One algorithm calls the combining function O(n) times, the other O(n*log(n)) times and I expect the linear algorithm to have better performance (after properly implementing both).

Best regards,

Sebastian

Aug 4, 2011, 4:13:05 AM8/4/11

to parallel...@googlegroups.com

Hi Sebastian,

On 4 Aug 2011, at 01:33, Sebastian Fischer wrote:

> Basically, you need to turn your definition inside-out.

>

> Wouldn't it suffice to have a list type with constant time splitAt, append and parallel zipWith?

That would help (these operations are usually used on in-place arrays), but you'd need to pull some tricks with the append (pre-construct the result

and then fill in the elements, for example). You'd also need to make the go function strict, and even then you couldn't set up the later phases in

parallel with the earlier ones. I think this is because your null test is probably wrong - you know the length of the input list, so you could use this to bound the

number of stages. I'm sure I can have constant time length :) You could then spark off the parzipwith for each stage in parallel with activating the next stage.

Ideally, I'd want a hardware implementation of this using SIMD vector ops, in which case you probably need to split the input into fixed size chunks (hardware limit on vector size),

then pipeline each of the stages internally, then sort out boundary conditions to rejoin the results (I think this can also be done using a scan, though it's a while since

I looked at the literature!).

> > The parscanl based on mergePairs has linear run time as is. The

> > version above needs to be implemented using a list type with more

> > efficient split and append functions

>

> I don't think that's the real issue (unless you're worried about sequential performance)...

>

> I did not run my functions, just used them to compare how often the combining operator is used (so probably my use of the term "run time" was misleading). One algorithm calls the combining function O(n) times, the other O(n*log(n)) times and I expect the linear algorithm to have better performance (after properly implementing both).

It can be dangerous to do this kind of complexity analysis for parallel algorithms, since you can introduce a sequential bottleneck. log time algorithms can fit parallel hardware better (in

fact parallel array access is arguably O(log n) anyway). In this case, isn't the parallel scan algorithm inherently O(n*log n) (log n stages over input of size n).

The log n is a small fixed constant that allows the use of multiple pipeline stages, and the cost of the actual operation is usually small.

Best Wishes,

Kevin

--------

Kevin Hammond, Professor of Computer Science, University of St Andrews

In accordance with University policy on electronic mail, this email reflects the opinions of the individual concerned, may contain confidential or copyright information that should not be copied or distributed without permission, may be of a private or personal nature unless explicitly indicated otherwise, and should not under any circumstances be taken as an official statement of University policy or procedure (see http://www.st-and.ac.uk).

The University of St Andrews is a charity registered in Scotland : No SC013532

Best Wishes,

Reply all

Reply to author

Forward

0 new messages

Search

Clear search

Close search

Google apps

Main menu