Your proposed solution, as you note, will be slow. It will call
cudppRadixSort m times, and none of those times will be large enough
to fill the machine, since the small arrays are so small. Alas,
cudppRadixSort isn't designed to do what you're trying to do.
What you want instead is a segmented sort that leverages the fact you
have a number of segments already and wants to only sort within
segments. For this, there's a very nice implementation in Sean
Baxter's moderngpu library, and I'd recommend that's probably your
best solution.
http://nvlabs.github.io/moderngpu/segsort.html
JDO