I don't understand what you exactly mean...
I think input/output iterator could allocate no additional memory, so it should be not same as memory copy.
#include <cub/cub.cuh> // or equivalently <cub/device/device_radix_sort.cuh>
// Declare, allocate, and initialize device pointers for input and output
int num_items; // e.g., 7
int *d_in; // e.g., [8, 6, 7, 5, 3, 0, 9]
int *d_out; // e.g., [ ]
...
// Determine temporary device storage requirements
void *d_temp_storage = NULL;
size_t temp_storage_bytes = 0;
cub::DeviceReduce::Sum(d_temp_storage, temp_storage_bytes, d_in, d_sum, num_items);
// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);
// Run sum-reduction
cub::DeviceReduce::Sum(d_temp_storage, temp_storage_bytes, d_in, d_sum, num_items);
I would like to allocate d_in as 2D pitched memory and I think it could be possible to implement an iterator which only "translate an address" (skipping invalid parts of aligned memory) for cub::DeviceReduce::Sum.
So there is no need for a copy and there is no need for additional memory allocation!
Dne pondělí, 5. ledna 2015 14:29:51 UTC+1 Apostolis Glenis napsal(a):