Concurrent GPU operations with accelerate-cuda ?

Rob Stewart

unread,

Jun 27, 2014, 10:01:55 AM6/27/14

to accelerat...@googlegroups.com

Hi,

Does the accelerate-cuda backend support concurrent kernel execution and memory transfers? I was interested to read that the XKaapi [1] framework does.

"Recent GPUs, such as NVIDIA's Fermi and Kepler, support new features for asynchronisation. For instance, Fermi GPUS have one execution engine and two copy engines, enabling it to concurrently perform a kernel execution and memory-transfers (two-way host-to-device and device-to-host), under the condition that no explicit nor implicit synchronisation occurs."

Moreover...

"Once a task implementation has launched computation on a GPU, the [XKaapi] scheduler starts the execution of the next selected tak by sending its input data in advance. This enables it to overlap data transfers with kernel executions .. We empirically found that the best performance gain is obtained when having two tasks being processed per GPU".

How does this behaviour compare with `runAsync` in accelerate-cuda? Is the use of this library able to specify whether or not they intend for kernel execution and memory-transfer to take place concurrently? Does the library allow the user to specify how many tasks should be processed per GPU when the hardware supports it e.g. Fermi and Kepler? Or does the CUDA backend not currently support asynchronuous concurrency?

[1] XKaapi: A Runtime System for Data-Flow Task Programming on Heterogeneous Architectures. Thierry Gautier. IEEE IPDPS, 2013.

http://moais.imag.fr/membres/joao.lima/publications/joao-lima-ipdps2013.pdf

Thanks,

--

Rob

Trevor L. McDonell

unread,

Jun 28, 2014, 9:01:26 PM6/28/14

to accelerat...@googlegroups.com

Hi Rob,

I haven’t read through the XKaapi paper yet, so can’t provide a comparison to that, but just answering your questions…

Does the accelerate-cuda backend support concurrent kernel execution and memory transfers? I was interested to read that the XKaapi [1] framework does.

Yes and no, respectively.

1) Kernels are executed concurrently with each other. Accelerate-CUDA does not place any limit as to how many kernels are scheduled to execute concurrently, I just provide the information to the CUDA driver and let it decide how best to execute. This all happens automatically, and has been part of accelerate-cuda for a while now. We haven’t talked about this in any of our papers yet, but here’s a screen shot of it in action though:

https://raw.githubusercontent.com/tmcdonell/thesis/master/images/sec-4/concurrent_kernels.png

2) Memory transfers aren’t overlapped, because we transfer all data before beginning the computation. This is done asynchronously with other CPU side tasks, but by the time the computation phase begins, all the data is on the device already. I think this is in contrast to XKaapi, which looks like it is streaming data in and out during computation (based on your quote below). So for us, this is not so critical.

I haven’t yet experimented with multiple concurrent memory transfers in this first phase — for devices with multiple copy engines — because there are some restrictions at the moment because of the separation between the front-end accelerate language and the back end accelerate-cuda execution engine. Specifically, the former is the thing doing memory allocations, but to get proper asynchronous host/device transfers, you need the special allocator from the CUDA library.

"Recent GPUs, such as NVIDIA's Fermi and Kepler, support new features for asynchronisation. For instance, Fermi GPUS have one execution engine and two copy engines, enabling it to concurrently perform a kernel execution and memory-transfers (two-way host-to-device and device-to-host), under the condition that no explicit nor implicit synchronisation occurs."

Moreover...

"Once a task implementation has launched computation on a GPU, the [XKaapi] scheduler starts the execution of the next selected tak by sending its input data in advance. This enables it to overlap data transfers with kernel executions .. We empirically found that the best performance gain is obtained when having two tasks being processed per GPU".

How does this behaviour compare with `runAsync` in accelerate-cuda? Is the use of this library able to specify whether or not they intend for kernel execution and memory-transfer to take place concurrently? Does the library allow the user to specify how many tasks should be processed per GPU when the hardware supports it e.g. Fermi and Kepler? Or does the CUDA backend not currently support asynchronuous concurrency?

When you use ‘run’ or ‘run1’, there is an implicit synchronisation at the end of the computation to copy the final result back to the host.

In ‘runAsync’, we don’t have this synchronisation point. Instead we return as soon as possible, and it is the programmer’s job to check when the result is actually available. I expect that could use multiple ‘runAsync’`s to increase the amount of concurrent kernel execution.

[1] XKaapi: A Runtime System for Data-Flow Task Programming on Heterogeneous Architectures. Thierry Gautier. IEEE IPDPS, 2013.
http://moais.imag.fr/membres/joao.lima/publications/joao-lima-ipdps2013.pdf

Thanks for the link, looks interesting.

Hope that helps!

-Trev

Rob Stewart

unread,

Jul 1, 2014, 12:38:21 PM7/1/14

to accelerat...@googlegroups.com

Hi Tev,

A very clear response, thanks. More below...

2) Memory transfers aren’t overlapped, because we transfer all data before beginning the computation. This is done asynchronously with other CPU side tasks, but by the time the computation phase begins, all the data is on the device already. I think this is in contrast to XKaapi, which looks like it is streaming data in and out during computation (based on your quote below). So for us, this is not so critical.

Is it possible to bypass this constraint? That is, is there an API call I could use to transfer memory to a GPU in advance of the run* family of primitives? For example, I could imagine an image processing program that starts with a number of readFile calls on image files. Rather than waiting for accelerate's `run` call to move the data, is there a staging function that preempts executions of kernels on this data? In this case, by transferring all images across to the GPU as soon as is read from file by the CPU, avoiding memory transfer latency when `run` is eventually called? Something analogous to `par :: a -> b -> b`, which gives a hint to the RTS to maybe spark `a`, some support for a similar primitive `stage :: (Arrays a) => a -> b -> b`, or something?

I expect that could use multiple ‘runAsync’`s to increase the amount of concurrent kernel execution.

OK. So the authors of the XKaapi paper found that executing two kernels concurrently on a Fermi or a Kepler turned out to result in the shortest runtimes. As you say Trev, this could be mirrored using `runAsync`, e.g. by implementing a simple double ended taskpool of `Acc a` computations on the CPU, which would throttle concurrent kernel executions by forking two threads that: 1) pop the leftmost task, 2) call `runAsync` on the `Acc a` computation, 3) Call a blocking `wait` on the `Async a`, and 4) go back to step (1). In the meantime, other threads would push `Acc a` tasks at the right of the taskpool, rather than using `runAsync` directly. Increasing the number of threads that consume from the left increases the number of concurrent kernel executions.

Have you explored such a CPU-based taskpool for throttling accelerate computations to the GPU? What were your results, or what might you expect the results to look like? Is this kind of mechanism something that should be built around `runAsync` calls, or instead might taskpool dispatching be something you could conceive to be an optional addition to accelerate-cuda itself?

--

Rob

Trevor L. McDonell

unread,

Jul 1, 2014, 8:57:21 PM7/1/14

to accelerat...@googlegroups.com

Hi Rob,

Is it possible to bypass this constraint? That is, is there an API call I could use to transfer memory to a GPU in advance of the run* family of primitives? For example, I could imagine an image processing program that starts with a number of readFile calls on image files. Rather than waiting for accelerate's `run` call to move the data, is there a staging function that preempts executions of kernels on this data? In this case, by transferring all images across to the GPU as soon as is read from file by the CPU, avoiding memory transfer latency when `run` is eventually called? Something analogous to `par :: a -> b -> b`, which gives a hint to the RTS to maybe spark `a`, some support for a similar primitive `stage :: (Arrays a) => a -> b -> b`, or something?

There is nothing like that at the moment, but your idea for this par/seq-like stage function sounds interesting.

You could hack it together yourself though. Once data is transferred to the device, it will remain there until the host side array it was copied from gets garbage collected. So, this has a larger scope than `run`, but you still need to call `run` at least once to trigger it. This might work for you …

stage :: Arrays a => a -> b -> b

stage arrs next = CUDA.run (use arrs) `seq` next

OK. So the authors of the XKaapi paper found that executing two kernels concurrently on a Fermi or a Kepler turned out to result in the shortest runtimes. As you say Trev, this could be mirrored using `runAsync`, e.g. by implementing a simple double ended taskpool of `Acc a` computations on the CPU, which would throttle concurrent kernel executions by forking two threads that: 1) pop the leftmost task, 2) call `runAsync` on the `Acc a` computation, 3) Call a blocking `wait` on the `Async a`, and 4) go back to step (1). In the meantime, other threads would push `Acc a` tasks at the right of the taskpool, rather than using `runAsync` directly. Increasing the number of threads that consume from the left increases the number of concurrent kernel executions.

To clarify, a single `run` will already execute kernels concurrently, so depending on your application you might have >1 kernel running at any given time. The example I linked earlier was a single `run` operation that resulted in 9 concurrent kernels. Mind you, it was a very contrived example program (essentially the same program used to demonstrate the concept in the CUDA SDK examples, and in order to keep it simple I disabled fusion). I don’t have a “real world” application that demonstrates the effect so cleanly.

Anyway, having multiple `runAsync` calls in the manner that you describe should definitely work to have more kernels being executed at once, which you could control by changing the number of threads that pop tasks from the work pool. That sounds like a solid plan.

Have you explored such a CPU-based taskpool for throttling accelerate computations to the GPU? What were your results, or what might you expect the results to look like? Is this kind of mechanism something that should be built around `runAsync` calls, or instead might taskpool dispatching be something you could conceive to be an optional addition to accelerate-cuda itself?

I have not experimented with this myself, but I think building the task pool out of `runAsync` should be fine. I can’t think of anything that would be substantially better by having it baked in, but do let us know if you think of / encounter something that should be done better. If you do find it useful, then we should definitely package it up somehow.

Using something like monad-par to describe the program in task-graph style might be nicer in some cases than using the explicit task pool. We could look into having better integration with monad-par (or similar) so that it uses `runAsync` internally. I’m not sure on the details of monad-par though, so don’t know where this falls on the spectrum of ‘nothing to do’ to ‘almost impossible’.

Cheers,

-Trev

Ryan Newton

unread,

Jul 2, 2014, 11:13:53 AM7/2/14

to accelerat...@googlegroups.com

One thing I didn't catch from this is if you have multiple completely separate "run"s on different Haskell IO threads, then is there anything stopping the data-transfer of one overlapping the computer kernels of the other? (I.e. is there a global lock they're effectively competing for in the underlying accelerate-cuda runtime?)

--
Sie erhalten diese Nachricht, weil Sie in Google Groups E-Mails von der Gruppe "Accelerate" abonniert haben.
Wenn Sie sich von dieser Gruppe abmelden und keine E-Mails mehr von dieser Gruppe erhalten möchten, senden Sie eine E-Mail an accelerate-hask...@googlegroups.com.
Weitere Optionen finden Sie unter https://groups.google.com/d/optout.

Trevor L. McDonell

unread,

Jul 2, 2014, 6:43:25 PM7/2/14

to accelerat...@googlegroups.com

On 3 Jul 2014, at 1:13 am, Ryan Newton <rrne...@gmail.com> wrote:

One thing I didn't catch from this is if you have multiple completely separate "run"s on different Haskell IO threads, then is there anything stopping the data-transfer of one overlapping the computer kernels of the other? (I.e. is there a global lock they're effectively competing for in the underlying accelerate-cuda runtime?)

There is no global lock in the Accelerate-CUDA runtime, but the memory transfers are currently assigned to the default CUDA stream (0). I _think_ stream zero is restricted from overlapping with anything else, but I’m hazy on the details at the moment. All compute kernels happen in non-default streams.

-Trev

Ryan Newton

unread,

Jul 6, 2014, 8:20:54 PM7/6/14

to accelerat...@googlegroups.com

So it sounds like bumping memory transfers to stream "1" is the easy fix then ;-)?

Trevor L. McDonell

unread,

Jul 6, 2014, 9:21:09 PM7/6/14

to accelerat...@googlegroups.com

well, ye-es…

To make this actually useful, a couple things need to be done:

(a) the front-end accelerate package needs to provide some hook that a backend can later use to override what allocator is used. We need to use the CUDA runtime to allocate pinned host memory. Without that, an asynchronous transfer is really done in two stages: (1) the CUDA runtime allocates new pinned host memory and copies the data to that (synchronous); then (2) the GPU does DMA (asynchronous). Actually, all transfers that don’t originate from pinned memory are done in this manner.

(b) there is a phase distinction here. The first compilation phase that does these asynchronous transfers needs to communicate the appropriate event information to the later execution phase. I’d just use AST tagging, I think.

(c) of course, always using stream 1 won’t work, but we should be able to reuse / extend the existing machine that manages execution streams and asynchronous events.