Re: [CUDPP][935] Can I sort unsigned int elements in shared memory with cudpp and where can I find examples for this?

106 views
Skip to first unread message

John Owens

unread,
Sep 23, 2012, 3:59:10 PM9/23/12
to cu...@googlegroups.com
On Sunday, September 23, 2012 at 12:18 PM, exploys wrote:
> Hi. Can I sort unsigned int elements in shared memory with cudpp and where can I find examples for this?
> I can't find.


Probably radixSortBlocks in radixsort_kernel.cuh would be a good place to start. We don't expose this in the API, though (and we won't do so).

JDO

exploys

unread,
Sep 26, 2012, 2:13:14 PM9/26/12
to cu...@googlegroups.com
Thank you. And can you tell me where to find or in which library has API for shared memory to execution IntraBlock of basic algorithms: stable_sort_by_key, inclusive_scan, histogram ...?
Like Thrust but for shared memory.

exploys

unread,
Sep 26, 2012, 7:08:28 PM9/26/12
to cu...@googlegroups.com
By the way, it will come up for my task?
template<uint nbits, uint startbit>

__device__ void radixSortBlock(uint4 & key,
uint4 & value 

http://cudpp.googlecode.com/svn/tags/2.0/doc/html/group__cudpp__cta.html#gad6197a166da7197ec317c4373a8ff232 


On Sunday, September 23, 2012 11:59:12 PM UTC+4, John Owens wrote:

John Owens

unread,
Sep 26, 2012, 11:07:16 PM9/26/12
to cu...@googlegroups.com
I don't understand what you're asking here.

JDO

John Owens

unread,
Sep 27, 2012, 12:29:41 AM9/27/12
to cu...@googlegroups.com
Those sound like Thrust algorithms. You might dig through their sources to see if they have single-block components of their algorithms.

JDO
> --
> You received this message because you are subscribed to the Google Groups "CUDPP" group.
> To view this discussion on the web visit https://groups.google.com/d/msg/cudpp/-/ZaWk6K15wqoJ.
> To post to this group, send email to cu...@googlegroups.com (mailto:cu...@googlegroups.com).
> To unsubscribe from this group, send email to cudpp+un...@googlegroups.com (mailto:cudpp+un...@googlegroups.com).
> For more options, visit this group at http://groups.google.com/group/cudpp?hl=en.



exploys

unread,
Oct 1, 2012, 7:56:47 PM10/1/12
to cu...@googlegroups.com
Thanks. I could use the radixSortBlock <nbits, startbit> (key, value);. But only for the size of 1024 bytes = (SORT_CTA_SIZE * 4). (SORT_CTA_SIZE = 256)
If I change the value of CTA_SIZE on more than 256, it is not working. For smaller values it works.
Show you how to use it for 4096 bytes in a single block?

I replaced
uint4 r = rank4 <256> (lsb);
on
uint4 r = rank4 <SORT_CTA_SIZE> (lsb);

Kernel I run so: MyKernel <<< 1, SORT_CTA_SIZE, 4 * SORT_CTA_SIZE * sizeof (uint) >>> (...);


On Sunday, September 23, 2012 11:59:12 PM UTC+4, John Owens wrote:

John Owens

unread,
Oct 2, 2012, 12:00:06 AM10/2/12
to cu...@googlegroups.com
You'd also have to increase the CTA size; the current SORT_CTA_SIZE is 256 (defined in cudpp_globals). But you can't multiply it by 4; that would lead to a CTA size of 1024 and that's not allowed for any devices before 2.0 compute capability. If I had to do this, I supposed I'd think about launching 512 threads per CTA and having 8 items per thread rather than 4 (which requires rank8 not rank4, etc.).

JDO

exploys

unread,
Oct 2, 2012, 10:00:38 AM10/2/12
to cu...@googlegroups.com
Okay, I'll try to use rank8 and 512 threads to sort 4096 elements.

By the way, I just use CC 2.0, but I still can not use either 512 or 1024 threads.
And what does means line in cudpp_globals.h "const int SORT_CTA_SIZE = 256; Number of threads per CTA for radix sort. Must equal 16 * number of radices.", Because I sort of uint (32 bits), (nbits = 32, startbit = 0), then radices = 32, and so should be SORT_CTA_SIZE = 16 * 32 = 512?
Но в cudpp_globals.h SORT_CTA_SIZE = 256.

John Owens

unread,
Oct 2, 2012, 10:38:46 AM10/2/12
to cu...@googlegroups.com
Right, so that comment says that the radix is 16, so if you go to 512 threads, you'll need to increase the radix to 32 (if you want to do 8 elements per thread, 4096 elements total).

If I were you, I'd strongly consider just writing my own, simple, fairly straightforward bitonic sort. Trying to take highly optimized code and change it to handle 4x as many elements is really tricky; there's a lot of interdependent pieces in the code, and I would certainly bet that writing a bitonic sort would be a lot quicker than changing this code. It does not seem likely to me that the performance of a block sorter would be critical.

http://www.nvidia.com/content/cudazone/cuda_sdk/Data-Parallel_Algorithms.html
Link #3 has a bitonic sort example.

JDO

exploys

unread,
Oct 2, 2012, 12:00:34 PM10/2/12
to cu...@googlegroups.com
I compared the stable sort for 4096 entries of 16-bit keys and values ​​to the GTX 460SE 288 Cores (6x48).
bitonic_sort: 211 MB / sec
RadixSort: 336 MB / sec

Unfortunately by your link is not downloaded bitonic_sort (bitonic.zip).
But I tried to use bitonic_sort from NVIDIA GPU Computing SDK Browser -> CUDA C Samples -> CUDA Sorting Networks.
But this implementation a non-stable sort, and i had to must the first 16 bits (32 bits uint) used to number the keys and next 16 bits for keys. I need to stable sort 16-bit keys and values.


And that I changed into RadixSort to it works for 4096 elements:
const int SORT_CTA_SIZE = 1024; / / const int SORT_CTA_SIZE = 256;

uint4 r = rank4 <SORT_CTA_SIZE> (lsb); / / uint4 r = rank4 <256> (lsb);

if (idx <WARP_SIZE)
  {
   ptr [idx] = scanwarp <uint, 4> (ptr [idx], ptr); / / ptr [idx] = scanwarp <uint, 2> (ptr [idx], ptr);
}
But you're right, it is better to replace rank4 on rank8 or rank16, because for each iteration used is only one warp (if (idx <WARP_SIZE)), which significantly slows down.

Or somewhere can I still download the implementation of a stable bitonic_sort?

John Owens

unread,
Oct 3, 2012, 12:04:47 AM10/3/12
to cu...@googlegroups.com
I might look at Batcher's odd-even merge sort. Don't rely on me that it's stable, but I think it's stable.

We spent a lot of time in this paper trying to keep things stable:

http://www.idav.ucdavis.edu/publications/print_pub?pub_id=1085

and Andrew is working on integrating this into CUDPP now.

JDO
> --
> You received this message because you are subscribed to the Google Groups "CUDPP" group.
> To view this discussion on the web visit https://groups.google.com/d/msg/cudpp/-/6ZtZA9wxAxUJ.

Erich

unread,
Oct 3, 2012, 12:29:36 AM10/3/12
to cu...@googlegroups.com
This a great paper!  Thanks for sharing.

exploys

unread,
Oct 3, 2012, 8:35:32 AM10/3/12
to cu...@googlegroups.com
It's interesting. An example for Variable-Length Keys (strings), but it's slower for fixed length keys than radix sort. Interesting will be look at this for sort intra block.

If it need, how can I commit the modified file radixsort_cta.cuh with my additional function scanwarp_nbit with which radixSortBlock 5~10% faster on Fermi?
I use here __ballot to scan by bit little values in a warp.

John Owens

unread,
Oct 3, 2012, 10:54:30 AM10/3/12
to cu...@googlegroups.com
Commit: Concerned about using an instruction that isn't available on all GPUs. So it'd be important that any patch has a path that WILL work on all GPUs, so make sure your patch has that. But I think we aren't quite ready to accept a patch at this time given that our efforts are concentrated on making a release; can you hold on until we move to github hopefully later this year?

JDO
> > > To post to this group, send email to cu...@googlegroups.com (javascript:) (mailto:cu...@googlegroups.com (javascript:)).
> > > To unsubscribe from this group, send email to cudpp+un...@googlegroups.com (javascript:) (mailto:cudpp+un...@googlegroups.com (javascript:)).
> > > For more options, visit this group at http://groups.google.com/group/cudpp?hl=en.
> >
>
> --
> You received this message because you are subscribed to the Google Groups "CUDPP" group.
> To view this discussion on the web visit https://groups.google.com/d/msg/cudpp/-/eQmDAv5PxJkJ.

exploys

unread,
Oct 3, 2012, 11:31:31 AM10/3/12
to cu...@googlegroups.com
Sure. May I ask, why you move from googlecode to github? First Thrust, now CUDPP.

And yes, I use __ CUDA_ARCH__ to choose between with __ ballot and without.

Mark Harris

unread,
Oct 3, 2012, 7:51:45 PM10/3/12
to cu...@googlegroups.com
One good reason is that git has rapidly become a more capable, easier to use, and more popular source control system than SVN. But googlecode supports git also. The reason for github is because we like the features it provides, and the strong community.  Github makes collaboration on projects easier -- forking and submitting pull requests are first-class features of the site. Github has also been rapidly improving and innovating, whereas googlecode is pretty static. 

A less important reason, but still valid, for me, is that half my life is already stored on The Google, and I prefer to diversify a bit and support some other small players. (Hopefully Google won't buy github very soon. :)

Mark

To view this discussion on the web visit https://groups.google.com/d/msg/cudpp/-/zftX9JhQA7kJ.

To post to this group, send email to cu...@googlegroups.com.
To unsubscribe from this group, send email to cudpp+un...@googlegroups.com.

John Owens

unread,
Oct 3, 2012, 11:13:17 PM10/3/12
to cu...@googlegroups.com
What Mark said. :)

Github is the best repository I've worked with to date.

JDO


On Wednesday, October 3, 2012 at 4:51 PM, Mark Harris wrote:

> One good reason is that git has rapidly become a more capable, easier to use, and more popular source control system than SVN. But googlecode supports git also. The reason for github is because we like the features it provides, and the strong community. Github makes collaboration on projects easier -- forking and submitting pull requests are first-class features of the site. Github has also been rapidly improving and innovating, whereas googlecode is pretty static.
>
> A less important reason, but still valid, for me, is that half my life is already stored on The Google, and I prefer to diversify a bit and support some other small players. (Hopefully Google won't buy github very soon. :)
>
> Mark
>
> > To unsubscribe from this group, send email to cudpp+un...@googlegroups.com (mailto:cudpp%2Bunsu...@googlegroups.com).
> > For more options, visit this group at http://groups.google.com/group/cudpp?hl=en.
>
>
> --
> You received this message because you are subscribed to the Google Groups "CUDPP" group.

exploys

unread,
Oct 20, 2012, 1:07:16 PM10/20/12
to cu...@googlegroups.com, har...@gmail.com
Clear :) It will be interesting to use it.

четверг, 4 октября 2012 г., 3:51:47 UTC+4 пользователь Mark Harris написал:

exploys

unread,
Oct 20, 2012, 1:09:13 PM10/20/12
to cu...@googlegroups.com
And another some questions.
1. Why do not you use #pragma unroll?
in scanwarp() instead of:
     if (0 <= maxlevel) {sData [idx] = t = t   sData [idx - 1];}
     if (1 <= maxlevel) {sData [idx] = t = t   sData [idx - 2];}
     if (2 <= maxlevel) {sData [idx] = t = t   sData [idx - 4];}
     if (3 <= maxlevel) {sData [idx] = t = t   sData [idx - 8];}
     if (4 <= maxlevel) {sData [idx] = t = t   sData [idx -16];}

Can be shorter:
#pragma unroll
for (int i = 0; i <maxlevel; ++i) sData [idx] = t = t   sData [idx - (1 << i)];


2. And I done ability to change the ctasize as a template parameter . radixSortBlock <ctasize, nbits, startbit ... and rank4 <ctasize>
ctasize = 1 ... 1024 (array size 4 - 4096 Bytes)


3. And I get the full amount of the array directly in scan4
static __ device__ uint4 scan4 (uint4 idata, uint & wholesum) {
...
wholesum = ptr [31   WARP_SIZE]; / / my
__syncthreads ();
val   = ptr [idx >> 5];
...
}

static __ device__ uint4 scan4 (uint4 idata)
{
uint wholesum; // will removed this parameter by optimizer 
return scan4 (idata, wholesum);
}

static __ device__ uint4 rank4 (uint4 preds)
{
  uint numtrue;
  uint4 address = scan4 (preds, numtrue);
  __syncthreads ();
...
}

All this is a bit faster and more convenient.
Or is there any reason not to use all it?

четверг, 4 октября 2012 г., 7:13:21 UTC+4 пользователь John Owens написал:
> > To unsubscribe from this group, send email to cudpp+un...@googlegroups.com (mailto:cudpp%2Bu...@googlegroups.com).
Reply all
Reply to author
Forward
0 new messages