about the speed of DeviceRadixSort

Feng Sun

unread,

Apr 21, 2014, 10:12:33 AM4/21/14

to

Hi~ I'm a newbie on parallel computing. The CUB DOC said that CUB radix sort can reach the speed of 1.41 billions of inputs per second for 32M {uint32 uint32} paris on Geforce GTX Titan device.

http://nvlabs.github.io/cub/structcub_1_1_device_radix_sort.html

Dose it mean that for an input of 32,000,000 {uint32 uint32} pairs, CUB can sort it in (32,000,000/1,410,000,000)*1000 = 22.69 ms? I've tested the DeviceRadixSort example(CUDA55_example_device_radix_sort) of CUB1.2.3 on my Geforce GTX Titan device. It costs 37.93ms for 32,000,000 input pairs. What's wrong?

Duane Merrill

unread,

May 6, 2014, 11:19:23 AM5/6/14

to cub-...@googlegroups.com

Hi Feng, how are you compiling the example program? Specifically:

What platform are you compiling on (Windows or Linux)?
What compute-capability are you compiling for? (The Titan is SM35.)
Are you compiling in Debug mode (MSVC) or with debug features enabled (-G nvcc flag)? (Debug instrumentation will slow your program down.)
Are you compiling 32-bit or 64-bit binaries? (64-bit addressing on the GPU incurs a 3-5% overhead)

My guess is that you've compiled the program for a lower compute-capability than SM35 (the compute-capability of the Titan.) NVIDIA GPUs are versioned by compute-capability (e.g., SM10, SM20, SM30, SM35, etc.). Although compute-capabilities are backwards-compatible (e.g., a SM35-based GeForce Titan can run a program compiled for SM10), you probably won't get the best performance from that executable. When you compile for a compute-capability that exactly matches your target device, the compiler (and the CUB library) know to take advantage of unique instructions and tuning configurations that are specific to that processor family.

On Linux/Cygwin, you specify the compute-capability (in hundreds format) as a Makefile parameter:

dumerrill@MoochBot /cygdrive/c/Dev/workspace/PrivateCub/examples/device
$ make example_device_radix_sort sm=350

On Windows, you specify the compute-capability in the project properties.

Let me know if that helps!

Duane

Duane Merrill

unread,

May 6, 2014, 11:33:38 AM5/6/14

to

Also, how are you measuring time? (The simple example program doesn't report elapsed time.)

If you are on Linux (or have Cygwin installed), you can use "quick test" version of our DeviceRadixSort unit-test program in the "test" folder to report GPU-only timing. The following shows how to make/run it for my laptop (you would substitute "350" as the SM architecture):

dumerrill@MoochBot /cygdrive/c/Dev/workspace/PrivateCub/test $ make test_device_radix_sort sm=350 quicktest=1 ... dumerrill@MoochBot /cygdrive/c/Dev/workspace/PrivateCub/test $ ./bin/test_device_radix_sort_sm300_nvvm_6.0_abi_nocdp_quick_i386 --i=100 --n=32000000 Using device 0: GeForce GTX 760M (PTX version 300, SM300, 4 SMs, 1966 free / 2048 total MB physmem, ECC off) CUB keys-only cub::DeviceRadixSort 32000000 items, unsigned int 4-byte keys 0-byte values, gen-mode RANDOM, descending 0, entropy_reduction 0, begin_bit 0, end_bit 32 ... 8857.827 elapsed ms, 88.578 avg ms, 0.361 billion items/s, 2.890 logical GB/s

Reply all

Reply to author

Forward