Hello all,
I know that you can define a custom functor when calling CUB device reduce from HOST function, but I am trying to call sum from within a CUDA kernel and am unable to do something similar. Can anyone tell me if and how to do this? The following is code snippet from a defined CUDA kernel where I would like to replace all float-based call(s) with cuFloatComplex-based call(s) if possible:
__global__ myKernel(...){
// would like a single cuFloatComplex 'mag' variable rather than float 'mag_x' and 'mag_y'
cuFloatComplex mag;
float mag_x = 0.0f;
float mag_y = 0.0f;
...
typedef cub::BlockReduce<float, 256> BlockReduceT; // would like to make this 'cuFloatComplex' rather than 'float'
__shared__ typename BlockReduceT::TempStorage temp_storage;
...
// The following 'Sum' is where I would like to sum a single cuFloatComplex variable 'mag'
float aggregatex = BlockReduceT(temp_storage).Sum(mag_x);
float aggregatey = BlockReduceT(temp_storage).Sum(mag_y);
...
}
Thank you for any help or hints.