Reusing kernels, does this apply to the LLVM backend?

Skip to first unread message


May 14, 2017, 6:23:29 PM5/14/17
to Accelerate

The book Parallel and Concurrent Programming in Haskell[1] discusses how to write Accelerate code that generates reusable CUDA kernels:

When the program runs, the Accelerate library evaluates the expression passed to run to make a series of CUDA fragments (called kernels). Each kernel takes some arrays as inputs and produces arrays as outputs. In our example, each call to step will produce a kernel, and when we compose a sequence of step calls together, we get a series of kernels. Each kernel is a piece of CUDA code that has to be compiled and loaded onto the GPU; this can take a while, so Accelerate remembers the kernels it has seen before and tries to reuse them.

Our goal with step is to make a kernel that will be reused. If we don’t reuse the same kernel for each step, the overhead of compiling new kernels will ruin the performance.

My first question is: does this also apply to the LLVM.Native backend?

My second question is an example. I have written the code (sorry for the complex signature and my newbie coding in general):

gradientDescent :: forall e is os . (Prelude.Floating e, A.Floating e, Lift Exp e, e ~ Plain e) => e -> Sing is -> Sing os -> SomeNeuralNetwork e is os -> ([PList ('(1, os) ': '[]) (ValueAndDerivative e)] -> Acc (Scalar (ValueAndDerivative e))) -> [Acc (Vector e)] -> Acc (Vector e) -> [Acc (Vector e)]
gradientDescent eta sis sos nn f i p =  let
    g = gradient sis sos nn f i p
= zipWith (updateParam (the $ unit $ constant $ eta)) p g
':(gradientDescent eta sis sos nn f i p')
:: Exp e -> Exp e -> Exp e -> Exp e
    updateParam eta p g
= p - eta * g

The type signature for gradient is:

gradient :: forall e is os . (Prelude.Floating e, A.Floating e, Lift Exp e, e ~ Plain e) => Sing is -> Sing os -> SomeNeuralNetwork e is os -> ([PList ('(1, os) ': '[]) (ValueAndDerivative e)] -> Acc (Scalar (ValueAndDerivative e))) -> [Acc (Vector e)] -> Acc (Vector e) -> Acc (Vector e)

So, my intention is that it produces an ever increasing sequence of Accelerate programs that compute repeated iterations of the gradient descent algorithm. My question is: how can I make sure that the gradient code is reused, say, 10 times for 10 iterations, rather than one gigantic Accelerate program generated for all 10 iterations? In particular, is the reusing of program fragments supposed to be reflected in the Graphviz file when using -ddump-(simpl-)dot, because right now it certainly is a gigantic graph.

Thank you, best regards, Panos


Trevor McDonell

May 14, 2017, 8:45:42 PM5/14/17

Hi Panos,

The general idea applies, but the actual caching was not part of the 1.0 release of the LLVM backends. That work is on this branch and almost complete though, so expect it soon. You get a similar result though if you can express your program in terms of run1, just not across separate executions of your program. The -ddump-phases debug flag will tell you how much time you are spending in compilation.

The graphviz output is actually a good place to look; each of the boxes on the graph corresponds to a kernel which will be compiled and executed (modulo a few operations such as reshape and #i aN (for some integers i and N) which don’t execute anything and are constant time), and so those are the operations which caching is going to cover.

Hope that helps.


You received this message because you are subscribed to the Google Groups "Accelerate" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
Visit this group at
For more options, visit


May 18, 2017, 5:55:33 AM5/18/17
to Accelerate
Hi Trev,

Thank you very much for your reply, I found out about run1 a few hours after posting so this was poor research on my part, sorry!

Best regards, Panos
Reply all
Reply to author
0 new messages