Thanks!
First off -- the enactor can be reused between reductions. (It's
intended to be reused, actually, as it must maintain some temporary
scratch which, for small problems, is relatively expensive to allocate
and free all of the time).
To answer your question: the reduction call is asynchronous (unless
you have DEBUG set), i.e., the host thread may have returned before
the GPU computation is done. However, asynchronicity won't give you
what you want. The GPU reduction will occur in the current CUDA
stream, and all events (e.g., kernel invocations) in that stream are
sequentially executed. Thus your independent reductions are not
concurrent because they exist in the same stream, regardless of when
the host thread returns.
There are a couple of alternatives. You could set up and tear down
multiple streams (see the Programming Guide). Eh.
Alternatively, it should actually be more efficient to do one giant
segmented-reduction, i.e., reduce-by-key. Basically you just set up a
"keys" vector that looks like
"00000000000111111222222222233333333333333333......" to demarcate your
segments and then call either thrust's reduce-by-key or my consecutive
reduction enactor. This would throw the entire GPU at your problem in
one fell swoop.
Hope that helps some,
Duane