i also tried radix sort from this page
https://stackoverflow.com/questions/29019318/optimizing-qsort
maybe i repeste it in case this page will be down
typedef int I32;
typedef unsigned int UI32;
I32 * RadixSort(I32 * pData, I32 * pTemp, size_t count)
{
size_t mIndex[4][256] = {0}; // index matrix
UI32 *pDst, *pSrc, *pTmp;
size_t i,j,m,n;
UI32 u;
for(i = 0; i < count; i++){ // generate histograms
u = pData[i];
for(j = 0; j < 4; j++){
if(j != 3) // signed integer handling
mIndex[j][(size_t)(u & 0xff)]++;
else
mIndex[j][(size_t)((u^0x80) & 0xff)]++;
u >>= 8;
}
}
for(j = 0; j < 4; j++){ // convert to indices
n = 0;
for(i = 0; i < 256; i++){
m = mIndex[j][i];
mIndex[j][i] = n;
n += m;
}
}
pDst = (UI32 *)pTemp; // radix sort
pSrc = (UI32 *)pData;
for(j = 0; j < 4; j++){
for(i = 0; i < count; i++){
u = pSrc[i];
if(j != 3) // signed integer handling
m = (size_t)(u >> (j<<3)) & 0xff;
else
m = (size_t)((u >> (j<<3))^0x80) & 0xff;
pDst[mIndex[j][m]++] = u;
}
pTmp = pSrc;
pSrc = pDst;
pDst = pTmp;
}
return((I32 *)pSrc);
}
it takes 27.5 ms when the unsigned version is even faster and takes stable 26 ms (and frame in my testing enironment takes 3.5 ms byself as im clear frame and draw plots of times when i test, i also each frame copy the generated random filled input table into the input again and again for sort to sort it, this radix sort dont need it so it is even yet slightly faster) so in fact the times are
24 ms and 22.5 ms,
now tat is real improvement, especially i gues integer sorting is quite usable and most critical.. (dat is my oldskul style of coding (natural c low style) i would say, when i was not so burnt out) this case makes it usable in some cases when sorting was not in play at all