Is reduction::Encator Reduce asynchronous?

16 views
Skip to first unread message

dev

unread,
Dec 30, 2011, 11:37:38 AM12/30/11
to B40C
Hello,

I have this small code snippest.

for(int i=0; i < rigid_body_count; i++)
{
Sum reduction_op;
reduction::Enactor reduction_enactor;
reduction_enactor.ENACTOR_DEBUG = false;
reduction_enactor.template Reduce<reduction::SMALL_SIZE>(total_force
+i, rigid_force + rigidBodyStartIndices[i], rigidBodyLengths[i],
reduction_op, 0);
reduction_enactor.template Reduce<reduction::SMALL_SIZE>(total_torque
+i, rigid_torque + rigidBodyStartIndices[i], rigidBodyLengths[i],
reduction_op, 0);
}

Since all the loop operates on non-overlapping data and the lengths of
each data is low (< 300), I want to make sure that the kernel launches
are asynchronous.
Is that possible?

Thanks for advance, and I really appreciate this library as it is much
faster than thrust.

Duane

unread,
Jan 5, 2012, 4:09:44 PM1/5/12
to B40C
Thanks!
First off -- the enactor can be reused between reductions.  (It's
intended to be reused, actually, as it must maintain some temporary
scratch which, for small problems, is relatively expensive to allocate
and free all of the time).
To answer your question: the reduction call is asynchronous (unless
you have DEBUG set), i.e., the host thread may have returned before
the GPU computation is done.  However, asynchronicity won't give you
what you want.  The GPU reduction will occur in the current CUDA
stream, and all events (e.g., kernel invocations) in that stream are
sequentially executed.  Thus your independent reductions are not
concurrent because they exist in the same stream, regardless of when
the host thread returns.
There are a couple of alternatives.  You could set up and tear down
multiple streams (see the Programming Guide). Eh.
Alternatively, it should actually be more efficient to do one giant
segmented-reduction, i.e., reduce-by-key.  Basically you just set up a
"keys" vector that looks like
"00000000000111111222222222233333333333333333......" to demarcate your
segments and then call either thrust's reduce-by-key or my consecutive
reduction enactor.  This would throw the entire GPU at your problem in
one fell swoop.
Hope that helps some,
Duane

dev

unread,
Jan 7, 2012, 4:55:30 AM1/7/12
to B40C
Thank you very much for your reply, it really helps.

But that makes me worry if I get the result right, I do this.

reduction_enactor.template Reduce(*d_problem_storage,
totalRigidParticles, p_num_compacted, gpu_reduce_num_compacted,
reduction_op, equality_op, 0);
mRigidBodyBuffers->Get(BufferTotalForce)-
>SetPtr<float_vec>(d_problem_storage->d_values[1]); //Set the pointer
to the result

Since the reduce call is asynchronous, I am worried that I get it
right.
I will post a new question about another issue with Reduce.

Thanks.
Reply all
Reply to author
Forward
0 new messages