Does the accelerate-cuda backend support concurrent kernel execution and memory transfers? I was interested to read that the XKaapi [1] framework does.
"Recent GPUs, such as NVIDIA's Fermi and Kepler, support new features for asynchronisation. For instance, Fermi GPUS have one execution engine and two copy engines, enabling it to concurrently perform a kernel execution and memory-transfers (two-way host-to-device and device-to-host), under the condition that no explicit nor implicit synchronisation occurs."Moreover..."Once a task implementation has launched computation on a GPU, the [XKaapi] scheduler starts the execution of the next selected tak by sending its input data in advance. This enables it to overlap data transfers with kernel executions .. We empirically found that the best performance gain is obtained when having two tasks being processed per GPU".How does this behaviour compare with `runAsync` in accelerate-cuda? Is the use of this library able to specify whether or not they intend for kernel execution and memory-transfer to take place concurrently? Does the library allow the user to specify how many tasks should be processed per GPU when the hardware supports it e.g. Fermi and Kepler? Or does the CUDA backend not currently support asynchronuous concurrency?
[1] XKaapi: A Runtime System for Data-Flow Task Programming on Heterogeneous Architectures. Thierry Gautier. IEEE IPDPS, 2013.
2) Memory transfers aren’t overlapped, because we transfer all data before beginning the computation. This is done asynchronously with other CPU side tasks, but by the time the computation phase begins, all the data is on the device already. I think this is in contrast to XKaapi, which looks like it is streaming data in and out during computation (based on your quote below). So for us, this is not so critical.
I expect that could use multiple ‘runAsync’`s to increase the amount of concurrent kernel execution.
Is it possible to bypass this constraint? That is, is there an API call I could use to transfer memory to a GPU in advance of the run* family of primitives? For example, I could imagine an image processing program that starts with a number of readFile calls on image files. Rather than waiting for accelerate's `run` call to move the data, is there a staging function that preempts executions of kernels on this data? In this case, by transferring all images across to the GPU as soon as is read from file by the CPU, avoiding memory transfer latency when `run` is eventually called? Something analogous to `par :: a -> b -> b`, which gives a hint to the RTS to maybe spark `a`, some support for a similar primitive `stage :: (Arrays a) => a -> b -> b`, or something?
OK. So the authors of the XKaapi paper found that executing two kernels concurrently on a Fermi or a Kepler turned out to result in the shortest runtimes. As you say Trev, this could be mirrored using `runAsync`, e.g. by implementing a simple double ended taskpool of `Acc a` computations on the CPU, which would throttle concurrent kernel executions by forking two threads that: 1) pop the leftmost task, 2) call `runAsync` on the `Acc a` computation, 3) Call a blocking `wait` on the `Async a`, and 4) go back to step (1). In the meantime, other threads would push `Acc a` tasks at the right of the taskpool, rather than using `runAsync` directly. Increasing the number of threads that consume from the left increases the number of concurrent kernel executions.
Have you explored such a CPU-based taskpool for throttling accelerate computations to the GPU? What were your results, or what might you expect the results to look like? Is this kind of mechanism something that should be built around `runAsync` calls, or instead might taskpool dispatching be something you could conceive to be an optional addition to accelerate-cuda itself?
--
Sie erhalten diese Nachricht, weil Sie in Google Groups E-Mails von der Gruppe "Accelerate" abonniert haben.
Wenn Sie sich von dieser Gruppe abmelden und keine E-Mails mehr von dieser Gruppe erhalten möchten, senden Sie eine E-Mail an accelerate-hask...@googlegroups.com.
Weitere Optionen finden Sie unter https://groups.google.com/d/optout.
One thing I didn't catch from this is if you have multiple completely separate "run"s on different Haskell IO threads, then is there anything stopping the data-transfer of one overlapping the computer kernels of the other? (I.e. is there a global lock they're effectively competing for in the underlying accelerate-cuda runtime?)