When the program runs, the Accelerate library evaluates the expression passed to
runto make a series of CUDA fragments (called kernels). Each kernel takes some arrays as inputs and produces arrays as outputs. In our example, each call to
stepwill produce a kernel, and when we compose a sequence of
stepcalls together, we get a series of kernels. Each kernel is a piece of CUDA code that has to be compiled and loaded onto the GPU; this can take a while, so Accelerate remembers the kernels it has seen before and tries to reuse them.
Our goal with
stepis to make a kernel that will be reused. If we don’t reuse the same kernel for each
step, the overhead of compiling new kernels will ruin the performance.
The general idea applies, but the actual caching was not part of the 1.0 release of the LLVM backends. That work is on this branch and almost complete though, so expect it soon. You get a similar result though if you can express your program in terms of
run1, just not across separate executions of your program. The
-ddump-phases debug flag will tell you how much time you are spending in compilation.
The graphviz output is actually a good place to look; each of the boxes on the graph corresponds to a kernel which will be compiled and executed (modulo a few operations such as
#i aN (for some integers
N) which don’t execute anything and are constant time), and so those are the operations which caching is going to cover.
Hope that helps.
You received this message because you are subscribed to the Google Groups "Accelerate" group.
To unsubscribe from this group and stop receiving emails from it, send an email to accelerate-hask...@googlegroups.com.
Visit this group at https://groups.google.com/group/accelerate-haskell.
For more options, visit https://groups.google.com/d/optout.
run1 a few hours after posting so this was poor research on my part, sorry!