Sorting a large number of small arrays

diogo4u

unread,

Nov 24, 2014, 12:06:14 PM11/24/14

to cu...@googlegroups.com

Hi,

I need to sort about m=30000 arrays of n=200 elements each. These arrays are already in device memory, in a large m*n*sizeof(double) linear array. Let's call it 'd_array'. The first 200 elements of 'd_array' are the first small array, the next 200 elements are the second small array, and so on.

I was hoping to somehow do the sorting in parallel, and eventually this brought me to CUDPP. The idea was to use CUDPP_SORT_RADIX with CUDPP_DOUBLE and CUDPP_OPTION_KEYS_ONLY.

However, if I do something like this...

for (int k=0; k<m; k++)

{

cudppRadixSort(plan, d_array + k*n, NULL, n);

}

cudaDeviceSynchronize();

.. I find that it is extremely slow (10-20 seconds) compared to sorting on the CPU with everything in host memory.

My question is: is there a smart way to call cudppRadixSort() only once to sort the 30 000 arrays?

John Owens

unread,

Nov 24, 2014, 2:35:18 PM11/24/14

to cu...@googlegroups.com

Your proposed solution, as you note, will be slow. It will call
cudppRadixSort m times, and none of those times will be large enough
to fill the machine, since the small arrays are so small. Alas,
cudppRadixSort isn't designed to do what you're trying to do.

What you want instead is a segmented sort that leverages the fact you
have a number of segments already and wants to only sort within
segments. For this, there's a very nice implementation in Sean
Baxter's moderngpu library, and I'd recommend that's probably your
best solution.

http://nvlabs.github.io/moderngpu/segsort.html

JDO

diogo4u

unread,

Nov 24, 2014, 4:26:58 PM11/24/14

to cu...@googlegroups.com

Thank you so much for the answer!
I recall seeing moderngpu as a submodule of CUDPP, I will have a deeper look into it.

John Owens

unread,

Nov 24, 2014, 6:00:28 PM11/24/14

to cu...@googlegroups.com

Yes, that's true, we use some low-level primitives from moderngpu in
the latest release of CUDPP, and submodule the entire moderngpu
library. So the code you need is already compiled in for you to use.

JDO

> --
> You received this message because you are subscribed to the Google Groups "CUDPP" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to cudpp+un...@googlegroups.com.
> To post to this group, send email to cu...@googlegroups.com.
> Visit this group at http://groups.google.com/group/cudpp.
> For more options, visit https://groups.google.com/d/optout.

diogo4u

unread,

Nov 26, 2014, 9:31:31 AM11/26/14

to cu...@googlegroups.com

Hi!

Just to leave a note that with moderngpu I was able to do the segmented sort about 10x-20x faster when compared to sorting the arrays in the CPU.

It took me some time to figure out moderngpu, but in the end the code turned out to be quite simple, just something like:

mgpu::ContextPtr context = mgpu::CreateCudaDevice(0);

mgpu::SegSortKeysFromIndices(d_array, m*n, d_segheads, m-1, *context);

Thanks!

Diogo

Reply all

Reply to author

Forward