Hi Feng, how are you compiling the example program? Specifically:
- What platform are you compiling on (Windows or Linux)?
- What compute-capability are you compiling for? (The Titan is SM35.)
- Are you compiling in Debug mode (MSVC) or with debug features enabled (-G nvcc flag)? (Debug instrumentation will slow your program down.)
- Are you compiling 32-bit or 64-bit binaries? (64-bit addressing on the GPU incurs a 3-5% overhead)
My guess is that you've compiled the program for a lower compute-capability than SM35 (the compute-capability of the Titan.) NVIDIA GPUs are versioned by compute-capability (e.g., SM10, SM20, SM30, SM35, etc.). Although compute-capabilities are backwards-compatible (e.g., a SM35-based GeForce Titan can run a program compiled for SM10), you probably won't get the best performance from that executable. When you compile for a compute-capability that exactly matches your target device, the compiler (and the CUB library) know to take advantage of unique instructions and tuning configurations that are specific to that processor family.
On Linux/Cygwin, you specify the compute-capability (in hundreds format) as a Makefile parameter:
dumerrill@MoochBot /cygdrive/c/Dev/workspace/PrivateCub/examples/device
$ make example_device_radix_sort sm=350
On Windows, you specify the compute-capability in the project properties.
Let me know if that helps!
Duane