An alternative determinstic pure API for parallel computation of repa array elements?

47 views
Skip to first unread message

Rob Stewart

unread,
Apr 16, 2014, 7:39:33 PM4/16/14
to haskel...@googlegroups.com
Hi,

I have a short question on the monadic API for parallel repa
computations.

computeP :: (Load r1 sh e, Target r2 e, Source r2 e, Monad m) => Array r1 sh e -> m (Array r2 sh e)
computeS :: (Load r1 sh e, Target r2 e) => Array r1 sh e -> Array r2 sh e

I understand that `computeP` must be monadic, due to the use of `forkOn`
in Data.Array.Repa.Eval.Gang.forkGang . My question is: have the authors
of repa considered a deterministic pure library for parallel computation
of repa arrays? For example, using `runPar` in the monad-par library, or
using `runEval` in the parallel library:

runPar  :: Par a  -> a
runEval :: Eval a -> a

I realise that you lose the ability to pin array segments on to specific
processor PEs with `forkOn`, and in both cases rely on runtime
scheduling to (maybe) achieve parallel computation of array elements.
I have introduced repa code throughout a reasonably large code base in
recent months, picking and choosing between `computeS` and `computP`
where I deem appropriate at the time. I find it difficult to refactor my
decisions out of my code, i.e. when I want to swap a `computeS` with a
`computeP`, the code changes needed for this monadic call are sometimes
quite invasive on surrounding code. So:

1. Have any of the deterministic pure libraries for parallelism been
   considered as an alternative pure `computeP` implementation?

2. Is it a feasible idea?

3. What advise can you give for people in my position, who are forced to
   commit at an early stage whether it is acceptable to use the monadic
   repa API or use the sequential pure `computeS`?

Thanks,

--
Rob

Ben Lippmeier

unread,
Apr 16, 2014, 9:15:31 PM4/16/14
to Rob Stewart, haskel...@googlegroups.com

On 17 Apr 2014, at 9:39 , Rob Stewart <robste...@gmail.com> wrote:

> I have a short question on the monadic API for parallel repa
> computations.
>
> computeP :: (Load r1 sh e, Target r2 e, Source r2 e, Monad m) => Array r1 sh e -> m (Array r2 sh e)
> computeS :: (Load r1 sh e, Target r2 e) => Array r1 sh e -> Array r2 sh e
>
> I understand that `computeP` must be monadic, due to the use of `forkOn`
> in Data.Array.Repa.Eval.Gang.forkGang . My question is: have the authors
> of repa considered a deterministic pure library for parallel computation
> of repa arrays?

The implementation of computeP is already pure and deterministic. Note that its type is polymorphic in the monad constructor, so (almost) any monad will do. The monad constraint in the signature is not to encapsulate a particular user-visible effect.

computeP has a monadic wrapper to enforce sequencing between consecutive parallel operations, which fights off Haskell's default lazy evaluation mechanism. This is described in Section 6.2 of the paper "Guiding parallel array fusion with indexed types".


> I have introduced repa code throughout a reasonably large code base in
> recent months, picking and choosing between `computeS` and `computP`
> where I deem appropriate at the time. I find it difficult to refactor my
> decisions out of my code, i.e. when I want to swap a `computeS` with a
> `computeP`, the code changes needed for this monadic call are sometimes
> quite invasive on surrounding code. So:
>
> 1. Have any of the deterministic pure libraries for parallelism been
> considered as an alternative pure `computeP` implementation?

This is provided by the suspendedComputeP function in Data.Array.Repa.Eval, but pay attention to the warning in the comments.


> 2. Is it a feasible idea?
>
> 3. What advise can you give for people in my position, who are forced to
> commit at an early stage whether it is acceptable to use the monadic
> repa API or use the sequential pure `computeS`?

I would just write all the code that invokes compute{P,S} in the IO monad. It's a sledgehammer, but it works.

Haskell's default lazy evaluation mechanism is fine when one only considers "functional correctness", but is a disaster with respect to "performance correctness".

You might instead want to use "meta-repa", as described in the paper "An EDSL Approach to High Performance Haskell Programming", but I don't know how well baked it is.

Ben.


Rob Stewart

unread,
Apr 17, 2014, 4:53:19 AM4/17/14
to Ben Lippmeier, haskel...@googlegroups.com
Hi Ben,

Thank you for your comments. More below...

On 17 April 2014 02:15, Ben Lippmeier <be...@ouroborus.net> wrote:

> This is provided by the suspendedComputeP function in Data.Array.Repa.Eval, but pay attention to the warning in the comments.
>
> I would just write all the code that invokes compute{P,S} in the IO monad. It's a sledgehammer, but it works.

So my ongoing work is about writing a DSL for image processing, with
repa as a backend for parallel image processing. Consider the
following pseudo code:

a := readFile("lolcats.png")
b := convolve(a,Sobel)
c := blur(b)
d := darken(c)
writeFile("processed-meow.png",d)

There interpretation of this code is serial in that no Haskell
concurrency primitives are used to fork threads, meaning that any
parallelism comes from repa implementations of "convolve", "blur", and
"darken". What I'd like to do is take advantage of repa's delayed
array for these three functions, using computeP only in "writeFile".
That is:

readFile :: FIlePath -> IO (Array D DIM2 (Word8,Word8,Word8))
convolve, blur, darken :: Array D DIM2 (Word8,Word8,Word8) -> Array D
DIM2 (Word8,Word8,Word8)
writeFile :: FIlePath -> Array D DIM2 (Word8,Word8,Word8) -> IO (Array
U DIM2 (Word8,Word8,Word8))

My thinking is that delaying `computeP` to the very end i.e in the
`writeFile` call, then the would be no synchronisation between the
execution of convolve, blur and darken. So I'm happy to just have all
my code in the IO monad. My question has changed to: when using repa,
why not delay `computeP` calls to the very last function call over an
array?

Thanks,

--
Rob

Ben Lippmeier

unread,
Apr 29, 2014, 7:12:56 AM4/29/14
to Rob Stewart, haskel...@googlegroups.com

On 17 Apr 2014, at 18:53 , Rob Stewart <robste...@gmail.com> wrote:

> Hi Ben,
>
> Thank you for your comments. More below...
>
> On 17 April 2014 02:15, Ben Lippmeier <be...@ouroborus.net> wrote:
>
>> This is provided by the suspendedComputeP function in Data.Array.Repa.Eval, but pay attention to the warning in the comments.
>>
>> I would just write all the code that invokes compute{P,S} in the IO monad. It's a sledgehammer, but it works.
>
> My thinking is that delaying `computeP` to the very end i.e in the
> `writeFile` call, then the would be no synchronisation between the
> execution of convolve, blur and darken. So I'm happy to just have all
> my code in the IO monad. My question has changed to: when using repa,
> why not delay `computeP` calls to the very last function call over an
> array?

computeP is also used to control sharing of results.

Consider:

arr0 = ...
arr1 = map f arr0
arr2 = map g arr1
arr3 = map h arr1
arr4 = zipWith i arr2 arr3

Note that arr1 is consumed in two places. In the Repa fusion model, if you don't computeP arr1 then all its elements will be evaluated twice.

In most cases you'll want something like:

arr0 = ...
arr1 <- computeP $ map f arr0
arr2 = map g arr1
arr3 = map h arr1
arr4 <- computeP $ zipWith i arr2 arr3

This forces arr1 to be materialised at that intermediate point.

However, if computing arr1 is very cheap, then maybe you'll just want to compute it twice. It really needs to be under user control.

The Data Flow Fusion model goes a step further, in that it can evaluate all of arr[1..4] in one go without recomputation.

Ben.

Rob Stewart

unread,
Apr 29, 2014, 7:44:30 AM4/29/14
to Ben Lippmeier, haskel...@googlegroups.com
Hi Ben,

On 29 April 2014 12:12, Ben Lippmeier <be...@ouroborus.net> wrote:

> In most cases you'll want something like:
>
> arr0 = ...
> arr1 <- computeP $ map f arr0
> arr2 = map g arr1
> arr3 = map h arr1
> arr4 <- computeP $ zipWith i arr2 arr3
>
> This forces arr1 to be materialised at that intermediate point.

Thanks, understood. An interesting project would be a repa-specific
refactorer, one that searched for arrays whose value is later used
more than once in seperate computeS/computeP calls, and suggested a
refactoring to materialise those values at an earlier intermediate
point. Likewise, a refactorer could also suggest the eliminate (or
delay the use of) computeP on arrays whose values are used only once
and where fusion on this now delayed array could be exploited e.g.
with zipWith. Is a repa-specific Haskell refactorer something that has
been considered?

--
Rob

Ben Lippmeier

unread,
Apr 29, 2014, 7:54:53 AM4/29/14
to Rob Stewart, haskel...@googlegroups.com
I think it would be easy to write simple a tool that suggested refactorings, but wasn't actually useful. For the suggestion to be useful it would need to be based on some cost model of how expensive it is to compute each array -- but that's a whole research area with an existing line of work. Eg see [1] and [2].

Ben.


[1]
@inproceedings{megiddo1998optimal,
  author = {Nimrod Megiddo and
               Vivek Sarkar},
  title = {Optimal Weighted Loop Fusion for Parallel Programs},
  booktitle = {SPAA: Symposium on Parallel Algorithms and Architectures},
  year = {1997}
}

[2]
@inproceedings{ashby2006iterative,
  author = {Thomas J. Ashby and
               Michael F. P. O'Boyle},
  title = {Iterative Collective Loop Fusion},
  booktitle = {CC: Compiler Construction},
  year = {2006}
}



Reply all
Reply to author
Forward
0 new messages