transition of accelerate-cuda to accelerate-llvm-ptx

Henning Thielemann

unread,

Dec 7, 2017, 4:07:40 PM12/7/17

to Accelerate Project

I am adapting my "accelerate" related packages from 0.15 to 1.0. So far,
most things were renamings. Now I want to update accelerate-cufft.
accelerate-cuda has gone, accelerate-llvm-ptx is its replacement. Is there
some transition guide? I always try to refer to Changes.md for this
purpose but have not found the required information there. :-(

E.g. if I simply replace import Data.Array.Accelerate.CUDA.Foreign by
Data.Array.Accelerate.LLVM.PTX.Foreign, these things are missing: CIO,
allocateArray, devicePtrsOfArray.

Henning Thielemann

unread,

Dec 7, 2017, 4:53:07 PM12/7/17

to Accelerate Project

On Thu, 7 Dec 2017, Henning Thielemann wrote:

> E.g. if I simply replace import Data.Array.Accelerate.CUDA.Foreign by
> Data.Array.Accelerate.LLVM.PTX.Foreign, these things are missing: CIO,
> allocateArray, devicePtrsOfArray.

It seems that CIO is now LLVM PTX, allocateArray must be imported from
Sugar, devicePtrsOfArray is now certainly withDevicePtr, but I still have
to find out how to convert between ArrayData and Array sh.

Trevor McDonell

unread,

Dec 7, 2017, 9:26:58 PM12/7/17

to accelerat...@googlegroups.com

Hi Henning,

You probably want to use the allocateRemote function, to get an array on the GPU.

Array is defined in Sugar.hs as:

data Array sh e where
  Array :: (Shape sh, Elt e)
        => EltRepr sh                 -- extent of dimensions = shape
        -> ArrayData (EltRepr e)      -- array payload
        -> Array sh e

Some relevant examples for how to use it might be in accelerate-fft here or accelerate-blas here. If you point me to your code I could give you a more specific example.

Hope that helps!
-Trevor

--
You received this message because you are subscribed to the Google Groups "Accelerate" group.
To unsubscribe from this group and stop receiving emails from it, send an email to accelerate-hask...@googlegroups.com.
Visit this group at https://groups.google.com/group/accelerate-haskell.
For more options, visit https://groups.google.com/d/optout.

Henning Thielemann

unread,

Dec 8, 2017, 2:46:41 AM12/8/17

to Accelerate Project

On Fri, 8 Dec 2017, Trevor McDonell wrote:

> You probably want to use the allocateRemote function, to get an array on the GPU.
>
> Array is defined in Sugar.hs as:
>
> data Array sh e where
> Array :: (Shape sh, Elt e)
> => EltRepr sh -- extent of dimensions = shape
> -> ArrayData (EltRepr e) -- array payload
> -> Array sh e
>
> Some relevant examples for how to use it might be in accelerate-fft here or accelerate-blas here.

ok

> If you point me to your code I could give you a more specific example.

I am currently updating this module:
https://hub.darcs.net/thielema/accelerate-cufft/browse/src/Data/Array/Accelerate/CUFFT/Private.hs

Many thanks for the hints!
Henning

Henning Thielemann

unread,

Dec 8, 2017, 3:15:16 AM12/8/17

to Accelerate Project

On Fri, 8 Dec 2017, Henning Thielemann wrote:

> On Fri, 8 Dec 2017, Trevor McDonell wrote:
>
>> Some relevant examples for how to use it might be in accelerate-fft here or
>> accelerate-blas here.
>
> ok

Btw. the unqualified imports make it unnecessary hard to track the
identifiers. E.g. Data.Array.Accelerate.Math.FFT.LLVM.PTX.withArrayData
calls a function 'checkpoint' and I see 20 unqualified imports, each is a
potential source for 'checkpoint'. :-(

Henning Thielemann

unread,

Dec 8, 2017, 5:38:25 AM12/8/17

to Accelerate Project

On Fri, 8 Dec 2017, Trevor McDonell wrote:

> Some relevant examples for how to use it might be in accelerate-fft here
> or accelerate-blas here.

Both packages provide helper functions like withArray, withScalarArrayPtr.
Wouldn't these be nice additions to accelerate-llvm-ptx?

Henning Thielemann

unread,

Dec 8, 2017, 6:35:47 AM12/8/17

to Accelerate Project

On Fri, 8 Dec 2017, Trevor McDonell wrote:

> Some relevant examples for how to use it might be in accelerate-fft here
> or accelerate-blas here. If you point me to your code I could give you a
> more specific example.

What has happened to CUDA.Foreign.inDefaultContext? It seems to have
vanished. Can I simply and safely remove it from my code?

Henning Thielemann

unread,

Dec 8, 2017, 7:46:55 AM12/8/17

to Accelerate Project

On Fri, 8 Dec 2017, Trevor McDonell wrote:

> Some relevant examples for how to use it might be in accelerate-fft here
> or accelerate-blas here. If you point me to your code I could give you a
> more specific example.

I have managed to convince the type-checker of my adapted modules, but the
program crashes. I have attached a minimal example. Can you reproduce the
problem?

Separate.hs

Trevor McDonell

unread,

Dec 9, 2017, 3:20:16 AM12/9/17

to accelerat...@googlegroups.com

Hi Henning,

Yes, I get the error "failed to execute an FFT on the GPU", which I assume is what you mean.

Looking at the code this is more or less what I expect. The cuFFT context created in line 57 is probably being created within the context used by the CUDA Runtime API (at a guess).

You need a cuFFT context associated with whichever device context Accelerate happens to be running with when it calls your function. You might be able to use something like what I did in this module: https://github.com/tmcdonell/accelerate-blas/blob/master/Data/Array/Accelerate/Numeric/LinearAlgebra/LLVM/PTX/Context.hs

Enjoy the weekend (:

-Trev

Henning Thielemann

unread,

Dec 9, 2017, 4:43:38 AM12/9/17

to Accelerate Project

Hi Trevor,

On Sat, 9 Dec 2017, Trevor McDonell wrote:

> Yes, I get the error "failed to execute an FFT on the GPU", which I assume is what you mean.

Right.

> Looking at the code this is more or less what I expect. The cuFFT
> context created in line 57 is probably being created within the context
> used by the CUDA Runtime API (at a guess).

So far I wrapped plan1D in inDefaultContext, which does not exist anymore.
:-(

> You need a cuFFT context associated with whichever device context
> Accelerate happens to be running with when it calls your function. You
> might be able to use something like what I did in this module:
> https://github.com/tmcdonell/accelerate-blas/blob/master/Data/Array/Accelerate/Numeric/LinearAlgebra/LLVM/PTX/Context.hs

I am not sure I understand the problem and your solution. Does run1 create
its own context? Or is there still some default context that I can re-use
for the preceding CUFFT.plan1D? How would I bind the CUFFT.plan1D to the
right context, if I have one?

Best,
Henning

Henning Thielemann

unread,

Dec 9, 2017, 5:07:15 AM12/9/17

to Accelerate Project

On Sat, 9 Dec 2017, Henning Thielemann wrote:

> Hi Trevor,
>
>
> On Sat, 9 Dec 2017, Trevor McDonell wrote:
>
>> Yes, I get the error "failed to execute an FFT on the GPU", which I assume
>> is what you mean.
>
> Right.
>
>> Looking at the code this is more or less what I expect. The cuFFT context
>> created in line 57 is probably being created within the context used by the
>> CUDA Runtime API (at a guess).
>
> So far I wrapped plan1D in inDefaultContext, which does not exist anymore.
> :-(

I see that defaultContext is now replaced by defaultTarget, but
defaultTarget is not exported anymore and there is no replacement for
inDefaultContext. :-(

Is there still a way to do something on a default device without carrying
a Handle around?

Trevor McDonell

unread,

Dec 10, 2017, 9:49:13 PM12/10/17

to accelerat...@googlegroups.com

Hi Henning,

CUDA operations are executed within a given context. Code and data are specific to a context. Two contexts might exist on the same device, or separate devices, but in both cases they are entirely distinct; addresses are not transferable.

Exporting inDefaultContext was not a good idea, I can’t remember when or why that was done. I think we have always had the run*With functions, which allow you to supply to context in which to run, so assuming that you are running on the default context is not valid.

The context that you are actually running on is part of the state of the LLVM PTX state monad which all operations are executed within. The code I linked above just queries what the current context is (L47), and then uses this as a key into a map structure which is used as a cache (L74).

For your particular use case, you’ll probably need to key not only on the execution context but also the size & type the cuFFT was created for.

I hope that helps explain it?

-Trev

Henning Thielemann

unread,

Dec 11, 2017, 5:20:10 AM12/11/17

to Accelerate Project

Hi Trevor,

On Mon, 11 Dec 2017, Trevor McDonell wrote:

> For your particular use case, you’ll probably need to key not only on
> the execution context but also the size & type the cuFFT was created
> for.

That's what I use the Plan/Handle structure for.

> I hope that helps explain it?

Yes, although I have still that question: If I want to run a cuFFT based
program on simply the best available device, such as 'run' does, how to
get the according context? I want to plan an FFT once and re-use it in
several 'run's. Would it hurt to export defaultContext/defaultTarget and
an inContext function (but not inDefaultContext)? That way, I could assert
that all plan's and runWith's are performed in the same context, without
assuming that 'run' runs in the defaultContext.

Btw. I am not convinced by the solution with a global context cache. This
way, all BLAS functions have to synchronize access to that global cache,
where in principle no synchronization is necessary. Sure, the accesses are
short and infrequent, but in principle it feels not right.

Trevor McDonell

unread,

Dec 11, 2017, 9:27:20 PM12/11/17

to accelerat...@googlegroups.com

Hi Henning,

If I want to run a cuFFT based program on simply the best available device, such as ‘run’ does, how to get the according context?

If you want to (outside of accelerate) determine what the best available device is, there are functions from the cuda ffi bindings you can use. “Best” is pretty easy to estimate based on the device properties, but “available” is tricker. Once you have figured out which device you want, you can create a context for it which you can pass to the run*With functions (createTargetForDevice or createTargetFromContext).

If you just wanted to get the context that accelerate decided was best (at that particular time), you can do it in the way I showed previously.

Would it hurt to export defaultContext/defaultTarget

I think it would.

There is no reason to expect that run will always use the same context. For example, once we encounter an error (even minor ones) generally the only way to recover from this is to destroy the context and start again. This just seems to be a limitation of the CUDA API; once a call fails, all subsequent calls in that context fail. So exporting a default context either ties us into an unfortunate position, or isn’t constant and thus not useful for you anyway.

Perhaps what you want for your library is to provide your own PTX target and tell your users “you must use the run*With functions with this particular context, because all the FFT state is tied to it”… I don’t know, maybe that design works better for you?

Btw. I am not convinced by the solution with a global context cache. This way, all BLAS functions have to synchronize access to that global cache, where in principle no synchronization is necessary. Sure, the accesses are short and infrequent, but in principle it feels not right.

I am open to suggestions.

Cheers,
-Trev

Henning Thielemann

unread,

Dec 12, 2017, 2:58:28 AM12/12/17

to Accelerate Project

On Tue, 12 Dec 2017, Trevor McDonell wrote:

> Would it hurt to export defaultContext/defaultTarget
>
> I think it would.

Then, how about exporting a function that searches for a good default
context for me? I think this is useful for wrappers to all CUDA libraries
(cuFFT, cuBLAS, cuRAND ...)

> Perhaps what you want for your library is to provide your own PTX target
> and tell your users “you must use the run*With functions with this
> particular context, because all the FFT state is tied to it”… I don’t
> know, maybe that design works better for you?

How could we then write algorithms that use, say, both cuFFT and cuBLAS?

So far, I use explicit handles for cuFFT. This is natural for cuFFT
because the handles are bound to particular data sizes. I will try to
stick to this scheme. The problem I see is, how can I assert that the plan
creation and transformation are performed in the same context.

Best,
Henning

Trevor McDonell

unread,

Dec 12, 2017, 8:28:02 AM12/12/17

to accelerat...@googlegroups.com

Hey Henning,

Then, how about exporting a function that searches for a good default
context for me?

Just selecting device 0 is usually fine. I believe the CUDA driver already orders the devices for you. However… [continues after break]

So far, I use explicit handles for cuFFT. This is natural for cuFFT
because the handles are bound to particular data sizes. I will try to
stick to this scheme. The problem I see is, how can I assert that the plan
creation and transformation are performed in the same context.

…If you are not doing automatic plan management, then I don’t think you should be doing automatic context management either.

I think you just want your plan creation function to be explicitly tied to a given target; i.e.:

createPlanForTarget :: PTX -> {- FFT parameters -} -> IO Plan

And then the user supplies that Plan together with its associated PTX target when they call run*With. This seems to be closer to what you want, and as you mention, avoids any overheads associated with the automatic management methods I use.

All the best,
-Trev

Henning Thielemann

unread,

Dec 13, 2017, 8:27:50 AM12/13/17

to Accelerate Project

Hi Trevor,

On Tue, 12 Dec 2017, Trevor McDonell wrote:

> Then, how about exporting a function that searches for a good default
> context for me?
>
> Just selecting device 0 is usually fine. I believe the CUDA driver already orders the devices for you. However…
> [continues after break]

I was not aware that Device is simply a number and that 0 has special
meaning. However, the function behind defaultTarget is not as simple.

I have adapted my example according to your advice, see attachment. I
would fuse inContext with CuFFT.plan in the library. Still, I would prefer
that bestTarget (the function behind defaultTarget) would be exported by
accelerate-llvm-ptx. Only this way I can assert that cuFFT based code and
code with cuFFT use the same device selection strategy - although they do
not necessarily use the same device.

Separate.hs

Reply all

Reply to author

Forward