Best Wishes,
Kevin
---------
Kevin Hammond
Professor in Computer Science
University of St Andrews
E: k...@cs.st-andrews.ac.uk
T: +44-1334-473241
F: +44-1334-463278
Best Wishes,
Kevin
--------
Kevin Hammond, Professor of Computer Science, University of St Andrews
T: +44-1334 463241 F: +44-1334-463278 W: http://www.cs.st-andrews.ac.uk/~kh
In accordance with University policy on electronic mail, this email reflects the opinions of the individual concerned, may contain confidential or copyright information that should not be copied or distributed without permission, may be of a private or personal nature unless explicitly indicated otherwise, and should not under any circumstances be taken as an official statement of University policy or procedure (see http://www.st-and.ac.uk).
The University of St Andrews is a charity registered in Scotland : No SC013532
Processing "in the correct order" usually works against parallelism.. But, as you say, associativity may be enough.
> I've seen commutativity exploited in CUDA code that use an *atomic*
> load-add-store (like C's "+=") operation and fast on-chip memory.
> Commutativity implies that these operations can happen in any order and
> still yield a correct result.
That's an interesting point. If side effects are involved, not only
the results matter but also the order in which they are *computed*
(rather than combined).
So, for a (hypothetical) monadic parallel fold, commutativity (of the
combining operation with respect to effects) would be important.
Sebastian
Generally it's better to avoid side effects in parallel computations. If the code is pure then we have complete freedom over execution order modulo dependencies.
Commutativity Is a nice property because it allows me to break some of those dependencies which may allow better task granularity.
I checked my book chapter and I'd omitted parallel folds because I hadn't often seen them used (scans are a different matter). You've motivated me to add a section :)
Best Wishes,
Kevin
---------
Kevin Hammond
Professor in Computer Science
University of St Andrews
On Thu, Jul 28, 2011 at 6:07 PM, Kevin Hammond <k...@cs.st-andrews.ac.uk> wrote:
> I checked my book chapter and I'd omitted parallel folds because I hadn't often seen them used (scans are a different matter).
Specializing Conal's post on composable parallel scanning [1] reveals
that parallel scans can be implemented in terms of a parallel
fold(Map):
parfold :: (a -> a -> a) -> a -> [a] -> a
-- definition omitted
parscanr :: (a -> a -> a) -> a -> [a] -> [a]
parscanr (<>) e l = z:zs
where
(z,zs) = parfold (\ (x,xs) (y,ys) -> (x<>y,map (<>y) xs++ys)) (e,[])
$ map (\x -> (x,[e])) l
parscanl :: (a -> a -> a) -> a -> [a] -> [a]
parscanl (<>) e l = zs++[z]
where
(z,zs) = parfold (\ (x,xs) (y,ys) -> (x<>y,xs++map (x<>) ys)) (e,[])
$ map (\x -> (x,[e])) l
However, these are not as efficient as their sequential counterparts
`scanr` and `scanl`. How are parallel scans implemented efficiently?
(Sorry if I'm not smart enough to use Google properly but it didn't
help me..)
Sebastian
[1]: http://conal.net/blog/posts/composable-parallel-scanning
On Fri, Jul 29, 2011 at 9:55 AM, Sebastian Fischer <ma...@sebfisch.de> wrote:
> On Fri, Jul 29, 2011 at 2:43 AM, Conal Elliott <co...@conal.net> wrote:
>> I was also assuming pure functions and so wasn't talking about execution
>> order, but rather accumulation/combination order. If two threads want to
>> accumulate with something like an atomic "total += x", for a shared 'total'
>> and different 'x', then the order of += doesn't matter only because + is
>> commutative, even though + is pure. For a pure-but-noncommutative operation,
>> one would have to change algorithms.
>
> I see. My monadic fold example did not fit what you described because
> in your example the *computation* of intermediate results does not use
> side effects, only their *accumulation* does.
>
> Sebastian
>
thank you for the pointers. I'll write what I learned in case others
are interested.
I think my previous implementations are exactly the top-down tree
scans of your post on parallel tree scanning by composition.
Work-efficient scans can be implemented in terms of a function that
merges list elements pairwise:
mergePairs :: (a -> a -> a) -> a -> [a] -> [a]
mergePairs _ _ [] = []
mergePairs _ _ [x] = [x]
mergePairs (<>) u (x:y:zs) = (x<>y) : mergePairs (<>) u zs
With a parallel implementation of this function one can implement both
parallel folds and scans in a bottom-up way:
parfold :: (a -> a -> a) -> a -> [a] -> a
parfold _ u [] = u
parfold (<>) u ms = case mergePairs (<>) u ms of
[m] -> m
ns -> parfold (<>) u ns
I think the following definition corresponds to Guy Blelloch's
algorithm (and your bottom-up prefix scan):
-- only correct for input of length 2^n
parscanl :: (a -> a -> a) -> a -> [a] -> (a,[a])
parscanl _ u [m] = (m,[u])
parscanl (<>) u ms = (m,concatMap (\(n,e) -> [n,n<>e]) (zip ns
(dropOdds ms)))
where
(m,ns) = parscanl (<>) u $ mergePairs (<>) u ms
dropOdds :: [a] -> [a]
dropOdds [] = []
dropOdds [x] = [x]
dropOdds (x:_:xs) = x : dropOdds xs
It may be even possible to implement `mergePairs` with a (parallel)
fold using a tupling transformation but it's probably more efficient
not to.
Sebastian
Parallel scans work well with SIMD implementations. Basically, each level of the scan can be run in parallel, and a full scan proceeds over a few iterations (accumulating down and up).
I've tried to show this in my book chapter. This may not fit well with the usual definition of a scan, though (e.g. mergePairs below), since you probably introduce
additional dependencies. Conal, yes, these classically are applied to array operators, but a flattened list works equally well as an input (building the output non-serially and
dealing with intermediates are usually the issues if lists are used throughout, but you can always build a better internal structure [tree, array, ...] and then convert to/from a list if you need
it externally, provided this doesn't happen too often!).
I'll make a note to look at Guy's definitions again. Note that most data parallel operations are strict/hyper-strict. This usually isn't a problem in real code that I've seen. but you
can't just naively replace all folds/scans by parallel versions in Haskell.
I didn't quite understand the mergePairs definition: the u parameter is unused (should be the merging operator and applied to the results instead of cons?).
Essentially, this is a strategy I think - your <> operator is an evaluation strategy (here parallel composition) that you're applying pairwise across
the results.
> Parallel scans work well with SIMD implementations. Basically, each level of the scan can be run in parallel, and a full scan proceeds over a few iterations (accumulating down and up).
> I've tried to show this in my book chapter. This may not fit well with the usual definition of a scan, though (e.g. mergePairs below), since you probably introduce
> additional dependencies.
I'm afraid I cannot follow you here.. What dependencies do you mean?
> I didn't quite understand the mergePairs definition: the u parameter is unused (should be the merging operator and applied to the results instead of cons?).
Sorry, this parameter can be removed. I mechanically replaced a Monoid
type-class context with explicit parameters and realized too late that
the identity element is not used.
> Essentially, this is a strategy I think - your <> operator is an evaluation strategy (here parallel composition) that you're applying pairwise across
> the results.
No, <> is just the associative operator used for scanning. For example
(eliding the unused argument)
mergePairs (+) [1,2,3,4] = [3,7]
Here is a derivation that shows how the scan proceeds. The parallelism
is in the first three steps when the list is incrementally reduced
using mergePairs:
parscanl (+) 0 [1,2,3,4]
parscanl (+) 0 [3,7]
parscanl (+) 0 [10]
(10,[0])
(10,concatMap (\(n,e) -> [n,n+e]) [(0,3)])
(10,concat [[0,0+3]])
(10,[0,3])
(10,concatMap (\(n,e) -> [n,n+e]) [(0,1),(3,3)])
(10,concat [[0,0+1][3,3+3]])
(10,[0,1,3,6])
Looks like magic, but Conal's post convinces me that it does what it
says it does..
Sebastian
> Parallel scans work well with SIMD implementations. Basically, each level of the scan can be run in parallel, and a full scan proceeds over a few iterations (accumulating down and up).
I translated your Figure 2 to Haskell. The result is a scan that does
not use an identity element:
parscanl :: (a -> a -> a) -> [a] -> [a]
parscanl (<>) = go 1
where
go n l = let (xs,ys) = splitAt n l
in if null ys then xs
else go (2*n) (xs ++ zipWith (<>) l ys)
I think SIMD instructions are like the parallel zipWith function that
would be necessary for this implementation.
The parscanl based on mergePairs has linear run time as is. The
version above needs to be implemented using a list type with more
efficient split and append functions.
Sebastian
I realized that this would not change it's complexity because the
version based on zipWith performs
(n-1) + (n-2) + (n-4) + (n-8) + ... + (n-n) (which is in O(nlog(n)))
applications of the associative operator.
So, both versions of parscanl don't benefit (regarding complexity)
from more efficient lists and only the one based on mergePairs is work
efficient.
Hope I got it right this time..
Sebastian
On 29 Jul 2011, at 15:08, Sebastian Fischer wrote:
> Kevin,
>
>> Parallel scans work well with SIMD implementations. Basically, each level of the scan can be run in parallel, and a full scan proceeds over a few iterations (accumulating down and up).
>
> I translated your Figure 2 to Haskell. The result is a scan that does
> not use an identity element:
>
> parscanl :: (a -> a -> a) -> [a] -> [a]
> parscanl (<>) = go 1
> where
> go n l = let (xs,ys) = splitAt n l
> in if null ys then xs
> else go (2*n) (xs ++ zipWith (<>) l ys)
>
> I think SIMD instructions are like the parallel zipWith function that
> would be necessary for this implementation.
As a model, what you have looks OK, but it will be basically sequential, I think:
your version has a number of sequential dependencies (e.g. the splitAt, the append, and the
embedded zipWith as an argument to the recursive). Eliminating dependencies like these is the key
to getting good parallel performance.
Basically, you need to turn your definition inside-out.
We have defined a parallel zipWith as a strategic function, but I haven't used it much. As you've noted, it's basically a limited scan pattern.
> The parscanl based on mergePairs has linear run time as is. The
> version above needs to be implemented using a list type with more
> efficient split and append functions
I don't think that's the real issue (unless you're worried about sequential performance)...
As a model, what you have looks OK, but it will be basically sequential,
Basically, you need to turn your definition inside-out.
> The parscanl based on mergePairs has linear run time as is. TheI don't think that's the real issue (unless you're worried about sequential performance)...
> version above needs to be implemented using a list type with more
> efficient split and append functions
On 4 Aug 2011, at 01:33, Sebastian Fischer wrote:
> Basically, you need to turn your definition inside-out.
>
> Wouldn't it suffice to have a list type with constant time splitAt, append and parallel zipWith?
That would help (these operations are usually used on in-place arrays), but you'd need to pull some tricks with the append (pre-construct the result
and then fill in the elements, for example). You'd also need to make the go function strict, and even then you couldn't set up the later phases in
parallel with the earlier ones. I think this is because your null test is probably wrong - you know the length of the input list, so you could use this to bound the
number of stages. I'm sure I can have constant time length :) You could then spark off the parzipwith for each stage in parallel with activating the next stage.
Ideally, I'd want a hardware implementation of this using SIMD vector ops, in which case you probably need to split the input into fixed size chunks (hardware limit on vector size),
then pipeline each of the stages internally, then sort out boundary conditions to rejoin the results (I think this can also be done using a scan, though it's a while since
I looked at the literature!).
> > The parscanl based on mergePairs has linear run time as is. The
> > version above needs to be implemented using a list type with more
> > efficient split and append functions
>
> I don't think that's the real issue (unless you're worried about sequential performance)...
>
> I did not run my functions, just used them to compare how often the combining operator is used (so probably my use of the term "run time" was misleading). One algorithm calls the combining function O(n) times, the other O(n*log(n)) times and I expect the linear algorithm to have better performance (after properly implementing both).
It can be dangerous to do this kind of complexity analysis for parallel algorithms, since you can introduce a sequential bottleneck. log time algorithms can fit parallel hardware better (in
fact parallel array access is arguably O(log n) anyway). In this case, isn't the parallel scan algorithm inherently O(n*log n) (log n stages over input of size n).
The log n is a small fixed constant that allows the use of multiple pipeline stages, and the cost of the actual operation is usually small.
Best Wishes,
Kevin
--------
Kevin Hammond, Professor of Computer Science, University of St Andrews
In accordance with University policy on electronic mail, this email reflects the opinions of the individual concerned, may contain confidential or copyright information that should not be copied or distributed without permission, may be of a private or personal nature unless explicitly indicated otherwise, and should not under any circumstances be taken as an official statement of University policy or procedure (see http://www.st-and.ac.uk).
The University of St Andrews is a charity registered in Scotland : No SC013532
Best Wishes,