cub::DeviceRadixSort::SortPairs end

RaulPPelaez

unread,

Sep 15, 2017, 1:08:59 PM9/15/17

to cub-...@googlegroups.com

Hello!, I am struggling here with the meaning of the end_bit argument to SortPairs.

I want to sort keys that go from 0 to a certain number, lets say a power of 2. And I want to use the least bits possible to increase performance.

The way I understood end_bit, it should suffice with the most significant bit index of the biggest hash. So if max_hash = 2^16, the sort should be the same for any end_bit > 17. Is this reasoning wrong?

According to the docs, end_bit is: "The most-significant bit index (exclusive) needed for key comparison (e.g., sizeof(unsigned int) * 8)"

But I do not know what is "needed"!.
I have empirically noticed that "needed" is the most significant bit of the maximum number between the number of elements to sort and the biggest key value.

So if N= 1<<16 and the keys vary between (say) 0 and 1<<26, end_bit must be at least 26 for SortPairs to work.
However, if N =1<<26 and max_key = 1<<16, end_bit must still be at least 26.

Is this expected?
Why is this needed?

Thanks!

Here is a little unit test I wrote to test this.
I have tried to compile it with: nvcc -arch=sm_52 -std=c++11 RadixSort.cu
I only tested it on a GTX980.

It uses thrust to compare the sorted results. It tries to sort an array of "N" reverse-ordered values from 0 to "max_key".

"end_bit" is the end_bit variable cub asks for, now it is set to be the msb of the maximum between "max_key" and "N" plus a certain number "cub_extra_bits"

You can play around with these values and, if you get what I get, you will see that if end_bit is the msb of max_key while N>2*max_key, cub will refuse to perform the sort without complaining. And viceversa.

#include<thrust/device_vector.h>
#include<thrust/host_vector.h>
#include<thrust/extrema.h>
#include<thrust/sort.h>
#include<cub/cub.cuh>
#include<iostream>
#include<cmath>
#include<limits>

using namespace std;


int main(){

  //Number of elements to sort
  int N = 1<<12;
  //Keys will go from max_key to zero
  uint max_key = 1<<26;
  //Number of extra bits to sum to end_bit
  int cub_extra_bits = 1;
  //Most significant bit
  //msb must be that of the max between N and max_key!! WHY??
  int msb = int(std::log2(std::max((uint)N, max_key)+0.5));
  int end_bit = (msb + cub_extra_bits);
  end_bit = std::min(end_bit, 32);


  cerr<<"Number of elements: "<<N<<endl;
  thrust::device_vector<uint> key(N), key_alt(N);
  thrust::device_vector<int> value(N), value_alt(N);



  auto db_value = cub::DoubleBuffer<int>(thrust::raw_pointer_cast(value.data()),
                                         thrust::raw_pointer_cast(value_alt.data()));
  auto db_key  = cub::DoubleBuffer<uint>(thrust::raw_pointer_cast(key.data()),
                                          thrust::raw_pointer_cast(key_alt.data()));

  //Fill keys with max_key...0 and values with 0...N-1, so that after sorting values are N-1...0
  {
    thrust::host_vector<int> valueCPU(N);
    thrust::host_vector<int> keyCPU(N);

    for(int i = 0; i<N; i++){
      valueCPU[i]  = i;

      keyCPU[i] = ((N-i-1)/(double)N)*(max_key);
    }
    value = valueCPU;
    key = keyCPU;
  }
  //Correct sort with thrust
  auto value_thrust = value;
  auto key_thrust = key;
  thrust::stable_sort_by_key(key_thrust.begin(), key_thrust.end(), value_thrust.begin());


  //Try to sort with cub, select end_bit as the msb of the largest hash + cub_extra_bits
  int min_key = *(thrust::min_element(key.begin(), key.end()));


  cerr<<"Most significant bit in max key value ("<<max_key<<"): "<<msb<<endl;
  cerr<<"Min key: "<<min_key<<endl;

  cerr<<"end_bit to cub::SortPairs: "<<end_bit<<endl;


  size_t temp_storage_bytes = 0;
  void * d_temp_storage = nullptr;

  /*On first call, this function only computes the size of the required temporal storage*/
  auto err = cub::DeviceRadixSort::SortPairs(d_temp_storage, temp_storage_bytes,
                                  db_key,
                                  db_value,
                                  N,
                                  0, end_bit);

  /*Allocate temporary storage*/
  cudaMalloc(&d_temp_storage, temp_storage_bytes);

  cerr<<temp_storage_bytes<<" bytes Allocated for cub"<<endl;
  /**Perform the Radix sort on the value/key pair**/
  err = cub::DeviceRadixSort::SortPairs(d_temp_storage, temp_storage_bytes,
                                  db_key,
                                  db_value,
                                  N,
                                  0, end_bit);

  key.swap(key_alt);
  value.swap(value_alt);

  cudaFree(d_temp_storage);

  {
    thrust::host_vector<int> valueCPU = key;
    thrust::host_vector<int> value_thrustCPU = key_thrust;
    //Check result
    for(int i = 0; i<20; i++){
      cerr<<valueCPU[i]<<" ";
    }
    cerr<<endl;
    for(int i = 0; i<20; i++){
      cerr<<value_thrustCPU[i]<<" ";
    }
    cerr<<endl;

    for(int i = 0; i<N; i++){
      if(valueCPU[i] != value_thrustCPU[i]){
        cerr<<endl<<" ERROR in "<<i<<"th element!!"<<endl;
        cerr<<"Result is: "<<valueCPU[i]<<endl;
        cerr<<"Should be:"<<value_thrustCPU[i]<<endl;
        exit(1);
      }
    }
  }
  cerr<<"SUCCESS!!"<<endl;

  return 0;
}

EDIT:

Now Solved. The problem with this sample is not about end_bit not being high enough for RadixSort. The problem is that I was assuming that DoubleBuffer is always swapped after a sort, when according to the docs, this only happens under certain conditions (as a function of end_bit, actually).

So the lines:

 key.swap(key_alt);
 value.swap(value_alt);

Should only be called if db_keys.Current() and/or db_values.Current() are different before and after the sorting operation.

Thanks for your guidance txbob!

For the record, end_bit should be:

end_bit = msb(max(key)) +1

EDIT2:

You can find out if a DoubleBuffer has been swapped (in other words, if .Current() has changed) by accessing the int DoubleBuffer::selector, it will be 0 if Current is the same as when it was created and 1 otherwise. However I do not think this is documented and may very well not work in all cub versions...

PS: I do not know how to change the title to be more descriptive now that the issue has been solved...

graphicsRat

unread,

Nov 7, 2017, 6:54:35 PM11/7/17

to cub-...@googlegroups.com

I too am having the same problem specifying the begin/end bit. I get an incorrect result when I specify it.

#include <iostream>
#include <cub/cub.cuh>

// nvcc -ccbin g++ -g -I/path/to/cub-1.7.4 -m64 -gencode arch=compute_30,code=sm_30 -o CubRadixSortTest CubRadixSortTest.cu

// Helper Functions Decl
void print( short* data , unsigned size , const char* name );
void allocateDeviceMemory( void* devPtr , unsigned size , int lineNumber );
void copyDataToHost( void* hostPtr , void* devPtr , unsigned size , int lineNumber );
void copyDataToDevice( void* devPtr , void* hostPtr , unsigned size , int lineNumber );

#define CHECK_ERROR( error )                                                    \
    if( error != cudaSuccess )                                                  \
    {                                                                           \
        std::string msg( "[Error] " );                                          \
        msg += cudaGetErrorString( error );                                     \
        std::cerr << msg << std::endl;                                          \
        exit(-1);                                                               \
    }

////////////////////////////////////////////////////////////////////////////////////////////////

int main(int argc, char** argv)
{
    cudaError_t status;
    void* tmpStorage = 0;
    size_t tmpStorageSize = 0;

    static const unsigned COUNT = 20;
    short h_values[COUNT] = { 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0};
    short h_keys[COUNT] = { 17, 11, 16, 3, 10, 13, 1, 17, 14, 15, 18, 15, 10, 4, 8, 11, 15, 20, 10, 18 };

    short* d_keys = 0;
    short* d_values = 0;
    unsigned dataSize = COUNT * sizeof(short);

    allocateDeviceMemory( &d_keys , dataSize , __LINE__ );
    allocateDeviceMemory( &d_values , dataSize , __LINE__ ); 

    copyDataToDevice( d_keys , h_keys , dataSize , __LINE__ );
    copyDataToDevice( d_values , h_values , dataSize , __LINE__ );

    print( h_keys , COUNT , "Pre-sort Keys" );
    print( h_values , COUNT , "Pre-sort Values" );

    status = cub::DeviceRadixSort::SortPairs( tmpStorage , tmpStorageSize , d_keys , d_keys , d_values , d_values , COUNT );            // Correct
    //status = cub::DeviceRadixSort::SortPairs( tmpStorage , tmpStorageSize , d_keys , d_keys , d_values , d_values , COUNT , 0 , 4 );  // BUG
    CHECK_ERROR( status );

    allocateDeviceMemory( &tmpStorage , tmpStorageSize , __LINE__ );

    status = cub::DeviceRadixSort::SortPairs( tmpStorage , tmpStorageSize , d_keys , d_keys , d_values , d_values , COUNT );            // Correct
    //status = cub::DeviceRadixSort::SortPairs( tmpStorage , tmpStorageSize , d_keys , d_keys , d_values , d_values , COUNT , 0 , 4 );    // BUG
    CHECK_ERROR( status );

    copyDataToHost( h_keys , d_keys , dataSize , __LINE__ );
    copyDataToHost( h_values , d_values , dataSize , __LINE__ );
    
    print( h_keys , COUNT , "Post-sort Keys" );
    print( h_values , COUNT , "Post-sort Values" );

    return 0;
}

////////////////////////////////////////////////////////////////////////////////////////////////
// Helper Functions Impl

void print( short* data , unsigned size , const char* name )
{
    std::cout << "[" << name << "]\n[ " << data[0];
    for( unsigned i = 1; i < size; ++i )
    {
        std::cout << " , " << data[i];
    }
    std::cout << " ]" << std::endl;
}

void allocateDeviceMemory( void* devPtr , unsigned size , int lineNumber )
{
    cudaError_t error = cudaMalloc( (void**) devPtr , size );
    if( error != cudaSuccess )
    {
   
     std::cout << "[Line " << lineNumber << " -- Error
 " << error << " : Unable to allocate device memory] " 
<< cudaGetErrorString( error ) << std::endl;
        exit(-1);
    }
}

void copyDataToHost( void* hostPtr , void* devPtr , unsigned size , int lineNumber  )
{
    cudaError_t error = cudaMemcpy( hostPtr , devPtr , size , cudaMemcpyDeviceToHost );
    if( error != cudaSuccess )
    {
   
     std::cout << "[Line " << lineNumber << " -- Error
 " << error << " : Unable to copy device data to host 
memory] " << cudaGetErrorString( error ) << std::endl;
        exit(-1);
    }
}

void copyDataToDevice( void* devPtr , void* hostPtr , unsigned size , int lineNumber  )
{
    cudaError_t error = cudaMemcpy( devPtr , hostPtr , size , cudaMemcpyHostToDevice );
    if( error != cudaSuccess )
    {
   
     std::cout << "[Line " << lineNumber << " -- Error
 " << error << " : Unable to copy host data to device 
memory] " << cudaGetErrorString( error ) << std::endl;
        exit(-1);
    }
}

////////////////////////////////////////////////////////////////////////////////////////////////

Robert Crovella

unread,

Nov 7, 2017, 8:02:37 PM11/7/17

to cub-users, graphicsRat

specify the MSB as 1 higher than the most significant bit position.

as mentioned here:

https://devtalk.nvidia.com/default/topic/1026095/cuda-programming-and-performance/setting-begin-end-bit-parameters-in-cub-deviceradixsort-sortpairs-/post/5218889/#5218889

On Tuesday, November 7, 2017, 3:54:37 PM PST, graphicsRat <50...@web.de> wrote:

I too am having the same problem specifying the begin/end bit. I get an incorrect result when I specify it.

Enter code here...

--
http://nvlabs.github.com/cub
---
You received this message because you are subscribed to the Google Groups "cub-users" group.
To post to this group, send email to cub-...@googlegroups.com.
Visit this group at https://groups.google.com/group/cub-users.

RaulPPelaez

unread,

Nov 8, 2017, 5:40:00 AM11/8/17

to cub-...@googlegroups.com

On Wednesday, November 8, 2017 at 2:02:37 AM UTC+1, txbob wrote:

specify the MSB as 1 higher than the most significant bit position.

as mentioned here:

https://devtalk.nvidia.com/default/topic/1026095/cuda-programming-and-performance/setting-begin-end-bit-parameters-in-cub-deviceradixsort-sortpairs-/post/5218889/#5218889

The most significant bit position of what?, the largest hash in the array?.

It appears the end_bit has to be the MSB of the maximum number between the number of elements to sort and the biggest key value.

But this rule doesnt seem to hold in every case... Even with the +1.

Thanks for your answer!

EDIT:

For example, setting:

N = 1<<16 (number of elements)

max_key = 1<<15 (keys will go from 0 to max_key)

Should need and minimum end_bit of 16, but the sort is incorrect until end_bit = 20. You can try changing it in the first lines of my code.

Unless there is some bug in my test code, this is a behavior I do not understand in RadixSort!

Robert Crovella

unread,

Nov 8, 2017, 11:13:35 AM11/8/17

to cub-users, RaulPPelaez

it has nothing to do with the number of elements to sort.

If the maximum key value occupies bits 0 to 21, then you would use 22 as the MSB or "end bit"

On Wednesday, November 8, 2017, 2:40:02 AM PST, RaulPPelaez <raul....@uam.es> wrote:

On Wednesday, November 8, 2017 at 2:02:37 AM UTC+1, txbob wrote:

specify the MSB as 1 higher than the most significant bit position.

as mentioned here:

https://devtalk.nvidia.com/ default/topic/1026095/cuda- programming-and-performance/ setting-begin-end-bit- parameters-in-cub- deviceradixsort-sortpairs-/ post/5218889/#5218889

The most significant bit position of what?, the largest hash in the array?.

It appears the end_bit has to be the MSB of the maximum number between the number of elements to sort and the biggest key value.

But this rule doesnt seem to hold in every case... Even with the +1.

Thanks for your answer!

RaulPPelaez

unread,

Nov 8, 2017, 1:54:35 PM11/8/17

to cub-...@googlegroups.com

On Wednesday, November 8, 2017 at 5:13:35 PM UTC+1, txbob wrote:

it has nothing to do with the number of elements to sort.

Ok, you are absolutely right, this is what got me (from the docs):

"Upon completion, the sorting operation will update the "current" indicator within the DoubleBuffer wrapper to reference which of the two buffers now contains the sorted output sequence (a function of the number of key bits specified and the targeted device architecture)"

I am using doubleBuffer and I was assuming that RadixSort would always swap the references of the DoubleBuffer, so I was always treating the alternate array as the result.

So the sort was incorrect not because end_bit has to be more than msb+1, but because RadixSort decides to swap the references of a DoubleBuffer as a function of end_bit.

Thank you for your time txbob!

Reply all

Reply to author

Forward

cub::DeviceRadixSort::SortPairs end_bit usage?

RaulPPelaez

graphicsRat

Robert Crovella

RaulPPelaez

Robert Crovella

RaulPPelaez