And another some questions.
1. Why do not you use #pragma unroll?
in scanwarp() instead of:
if (0 <= maxlevel) {sData [idx] = t = t sData [idx - 1];}
if (1 <= maxlevel) {sData [idx] = t = t sData [idx - 2];}
if (2 <= maxlevel) {sData [idx] = t = t sData [idx - 4];}
if (3 <= maxlevel) {sData [idx] = t = t sData [idx - 8];}
if (4 <= maxlevel) {sData [idx] = t = t sData [idx -16];}
Can be shorter:
#pragma unroll
for (int i = 0; i <maxlevel; ++i) sData [idx] = t = t sData [idx - (1 << i)];
2. And I done ability to change the ctasize as a template parameter . radixSortBlock <ctasize, nbits, startbit ... and rank4 <ctasize>
ctasize = 1 ... 1024 (array size 4 - 4096 Bytes)
3. And I get the full amount of the array directly in scan4
static __ device__ uint4 scan4 (uint4 idata, uint & wholesum) {
...
wholesum = ptr [31 WARP_SIZE]; / / my
__syncthreads ();
val = ptr [idx >> 5];
...
}
static __ device__ uint4 scan4 (uint4 idata)
{
uint wholesum; // will removed this parameter by optimizer
return scan4 (idata, wholesum);
}
static __ device__ uint4 rank4 (uint4 preds)
{
uint numtrue;
uint4 address = scan4 (preds, numtrue);
__syncthreads ();
...
}
All this is a bit faster and more convenient.
Or is there any reason not to use all it?
четверг, 4 октября 2012 г., 7:13:21 UTC+4 пользователь John Owens написал:
> > To unsubscribe from this group, send email to cudpp+un...@googlegroups.com (mailto:cudpp%2Bu...@googlegroups.com).