execution on GPU targets

75 views
Skip to first unread message

Jie Zhao

unread,
Oct 6, 2016, 2:03:10 PM10/6/16
to cusp-users
Hello everyone,

I'm a beginner of CUSP.

I want to run an CUSP example, e.g., multiply.cu, on GPU accelerators.

The kernel computation of this example is

// compute y = A * x
cusp::multiply(A, x, y);

So, is the multiply function executed on CPU or GPUs?

And is it possible to specify dimGrid and dimBlock myself if it is executed on GPUS?

Thanks a million.

Jie

Steven Dalton

unread,
Oct 6, 2016, 3:52:21 PM10/6/16
to cusp-...@googlegroups.com
Hello Jie,

  The multiply is executed on the GPU if the device memory space of A maps to the CUDA backend, which is the default case but this can be changed to whatever the user wants.


  The current design of CUSP does not allow the user to explicitly set the dimensions of the kernels. This simplifies the implementation and is in agreement with the design of Thrust, the core library of primitives CUSP uses in many algorithms. You could always introduce your own implementation of the algorithms to add this information through the use of execution policies.

#include <cusp/array1d.h>
#include <cusp/csr_matrix.h>
#include <cusp/multiply.h>
#include <cusp/gallery/poisson.h>

struct my_policy : public cusp::cuda::execution_policy<my_policy>{};

template<typename IndexType, typename ValueType, typename MemorySpace>
void multiply(my_policy& exec,
              const cusp::csr_matrix<IndexType,ValueType,MemorySpace> & A,
              const cusp::array1d<ValueType,MemorySpace> & x,
                    cusp::array1d<ValueType,MemorySpace> & y)
{   
    typedef cusp::csr_matrix<IndexType,ValueType,MemorySpace> MatrixType;
    typedef cusp::array1d<ValueType,MemorySpace> VectorType1;
    typedef cusp::array1d<ValueType,MemorySpace> VectorType2;

    typedef cusp::constant_functor<ValueType> UnaryFunction;
    typedef thrust::multiplies<ValueType> BinaryFunction1;
    typedef thrust::plus<ValueType> BinaryFunction2;
    
    using namespace cusp::system::cuda;
    using namespace cusp::system::cuda::detail;
    
    UnaryFunction   initialize(0);
    BinaryFunction1 combine;
    BinaryFunction2 reduce;
    
    typedef typename MatrixType::row_offsets_array_type::const_iterator     RowIterator;
    typedef typename MatrixType::column_indices_array_type::const_iterator  ColumnIterator;
    typedef typename MatrixType::values_array_type::const_iterator          ValueIterator1;
    
    typedef typename VectorType1::const_iterator                            ValueIterator2;
    typedef typename VectorType2::iterator                                  ValueIterator3;
    
    const size_t THREADS_PER_BLOCK  = 128;
    const size_t THREADS_PER_VECTOR = 8;
    const size_t VECTORS_PER_BLOCK  = THREADS_PER_BLOCK / THREADS_PER_VECTOR;
    
    const size_t MAX_BLOCKS = cusp::system::cuda::detail::max_active_blocks(
                                  spmv_csr_vector_kernel<RowIterator, ColumnIterator, ValueIterator1, ValueIterator2, ValueIterator3,
                                  UnaryFunction, BinaryFunction1, BinaryFunction2,
                                  VECTORS_PER_BLOCK, THREADS_PER_VECTOR>, THREADS_PER_BLOCK, (size_t) 0);
    const size_t NUM_BLOCKS = std::min<size_t>(MAX_BLOCKS, DIVIDE_INTO(A.num_rows, VECTORS_PER_BLOCK));
    
    std::cout << "Calling my multiply with " << THREADS_PER_BLOCK << " threads and "
              << THREADS_PER_VECTOR << " threads-per-vector" << std::endl;
    std::cout << "Mapping yielded " << NUM_BLOCKS << " blocks" << std::endl;
    
    cudaStream_t s = stream(thrust::detail::derived_cast(exec));
    
    spmv_csr_vector_kernel<RowIterator, ColumnIterator, ValueIterator1, ValueIterator2, ValueIterator3,
                           UnaryFunction, BinaryFunction1, BinaryFunction2,
                           VECTORS_PER_BLOCK, THREADS_PER_VECTOR> <<<NUM_BLOCKS, THREADS_PER_BLOCK, 0, s>>>
                           (A.num_rows, A.row_offsets.begin(), A.column_indices.begin(), A.values.begin(), x.begin(), y.begin(),
                            initialize, combine, reduce);

int main(int argc, char** argv)
{
    typedef int                 IndexType;
    typedef double              ValueType;
    typedef cusp::device_memory MemorySpace;

    cusp::csr_matrix<IndexType,ValueType,MemorySpace> A;

    if (argc == 1)
    {
        std::cout << "Using default matrix (5-pt Laplacian stencil)" << std::endl;
        cusp::gallery::poisson5pt(A, 1000, 1000);
    }

    size_t N = A.num_rows;

    cusp::array1d<ValueType, MemorySpace> x(N,0);
    cusp::array1d<ValueType, MemorySpace> b(N,1);

    my_policy exec;
    cusp::multiply(exec, A, x, b);

    return 0;
}


--
You received this message because you are subscribed to the Google Groups "cusp-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cusp-users+unsubscribe@googlegroups.com.
To post to this group, send email to cusp-...@googlegroups.com.
Visit this group at https://groups.google.com/group/cusp-users.
For more options, visit https://groups.google.com/d/optout.

Jie Zhao

unread,
Oct 7, 2016, 9:13:11 AM10/7/16
to cusp-users
Thanks a lot Steve.

So how can I map the device memory space of A maps to the CUDA backend? I have to write a CUDA code based on the multiply example myself?

I noticed that there are some examples and application in the package.

In the "performance" directory, there are some benchmarks. Can these benchmarks be used to measure the performance of CUSP implementation?

If so, is there a manual guide or tutorial I can refer to?

Actually, I want to perform a simple SpMV compuation like y=A*x on GPU targets, and measure its performance (in time or GFLOPS) of CUSP implementation as a comparison of my work. Is there a simple way?

Thanks.

Jie

在 2016年10月6日星期四 UTC+2下午9:52:21,Steve写道:
To unsubscribe from this group and stop receiving emails from it, send an email to cusp-users+...@googlegroups.com.

Steven Dalton

unread,
Oct 7, 2016, 12:46:17 PM10/7/16
to cusp-...@googlegroups.com
By default CUSP will target the CUDA backend for all functions where the input resides in device_memory so you don't have to worry about that.

I think the best way to learn about CUSP is looking through some of the documentation in [1]. There is a long quick start guide (a little out of date).
Another good place to look in the Thrust wiki [2]. Understanding the structure of Thrust and the idea of execution policies will help you out quite a bit with CUSP.
Yes the performance directory has a few example applications to help benchmark the performance of the library on your system. You should take a good look at the SpMV benchmark folder [3].

Hope that helps.

Steve

[1] http://cusplibrary.github.io/

To unsubscribe from this group and stop receiving emails from it, send an email to cusp-users+unsubscribe@googlegroups.com.

Jie Zhao

unread,
Oct 8, 2016, 5:26:03 AM10/8/16
to cusp-users
Thanks a lot Steve.

I will learn further on what you mentioned.

Jie

在 2016年10月7日星期五 UTC+2下午6:46:17,Steve写道:

Jie Zhao

unread,
Oct 12, 2016, 11:44:48 AM10/12/16
to cusp-users
Dear Steve,


在 2016年10月7日星期五 UTC+2下午6:46:17,Steve写道:
By default CUSP will target the CUDA backend for all functions where the input resides in device_memory so you don't have to worry about that.

Yes the performance directory has a few example applications to help benchmark the performance of the library on your system. You should take a good look at the SpMV benchmark folder [3].



Is there a description on how to run the examples in performance directory? I looked into the folder [3] but didn't find one :-( Maybe I missed the important information.

Thanks.

Jie 

Jie Zhao

unread,
Oct 12, 2016, 12:11:30 PM10/12/16
to cusp-users
Hi Steve,


在 2016年10月7日星期五 UTC+2下午6:46:17,Steve写道:
By default CUSP will target the CUDA backend for all functions where the input resides in device_memory so you don't have to worry about that.

After a further study, I wrote a simple example as below.

#include <cusp/multiply.h>
#include <cusp/array2d.h>
#include <cusp/print.h>
int main(void)
{
    // initialize matrix
    // allocate storage for (4,3) matrix with 6 nonzeros
    cusp::csr_matrix<int,float,cusp::device_memory> A(4,3,6);
    // initialize matrix entries on host
    A.row_offsets[0] = 0;  // first offset is always zero
    A.row_offsets[1] = 2;
    A.row_offsets[2] = 2;
    A.row_offsets[3] = 3;
    A.row_offsets[4] = 6; // last offset is always num_entries
    A.column_indices[0] = 0; A.values[0] = 10;
    A.column_indices[1] = 2; A.values[1] = 20;
    A.column_indices[2] = 2; A.values[2] = 30;
    A.column_indices[3] = 0; A.values[3] = 40;
    A.column_indices[4] = 1; A.values[4] = 50;
    A.column_indices[5] = 2; A.values[5] = 60;
    // A now represents the following matrix
    //    [10  0 20]
    //    [ 0  0  0]
    //    [ 0  0 30]
    //    [40 50 60]

    // initialize input vector
    cusp::array1d<float, cusp::device_memory> x(3);
    x[0] = 1;
    x[1] = 2;
    x[2] = 3;
    // allocate output vector
    cusp::array1d<float, cusp::device_memory> y(4);

    cudaEvent_t start, stop;
    float elapsedtime;
    cudaEventCreate(&start);
    cudaEventRecord(start, 0);

    // compute y = A * x
    cusp::multiply(A, x, y);

    cudaEventCreate(&stop);
    cudaEventRecord(stop, 0);
    cudaEventSynchronize(stop);
    cudaEventElapsedTime(&elapsedtime, start, stop);
    printf("Elapsed time (without data transfer) : %f ms\n", elapsedtime);

    // print y
    cusp::print(y);
    return 0;
}

And I compile it with nvcc. The result is correct.

So, does it mean this example is run on GPUs automatically?

And if so, the elapsedtime I defined is the elapsed time by GPU devices, right?

Thanks.

Jie
 

Steven Dalton

unread,
Oct 12, 2016, 5:35:00 PM10/12/16
to cusp-...@googlegroups.com
Hello Jie,

  Your example appears to be correct.  You can build the spmv performance tests using the following commands on Linux.

  // change to spmv directory
  cd <cusp_dir>/performance/spmv
  // build spmv tester
  scons arch=sm_50
  // change to scripts directory
  cd scripts
  // get the test matrices
  // extract matrices.zip to matrices
  unzip matrices.zip -d matrices
  // change line 14 of benchmark.py to : unstructured_path = './matrices/'
  // change line 40 of benchmark.py to : trials = unstructured_mats
  // run the benchmark.py tester
  python benchmark.py

Steve


--
You received this message because you are subscribed to the Google Groups "cusp-users" group.

Jie Zhao

unread,
Oct 13, 2016, 3:52:31 AM10/13/16
to cusp-users
Dear Steve,

Thanks for your confirm on the correctness of the example and thank you so much on the explanation on the performance tests.

Jie

在 2016年10月12日星期三 UTC+2下午11:35:00,Steve写道:
To unsubscribe from this group and stop receiving emails from it, send an email to cusp-users+...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages