execution on GPU targets

84 views
Skip to first unread message

Jie Zhao

unread,
Oct 6, 2016, 2:03:10 PM10/6/16
to cusp-users
Hello everyone,

I'm a beginner of CUSP.

I want to run an CUSP example, e.g., multiply.cu, on GPU accelerators.

The kernel computation of this example is

// compute y = A * x
cusp::multiply(A, x, y);

So, is the multiply function executed on CPU or GPUs?

And is it possible to specify dimGrid and dimBlock myself if it is executed on GPUS?

Thanks a million.

Jie

Steven Dalton

unread,
Oct 6, 2016, 3:52:21 PM10/6/16
to cusp-...@googlegroups.com
Hello Jie,

Ā  The multiply is executed on the GPU if the device memory space of A maps to the CUDA backend, which is the default case but this can be changed to whatever the user wants.


Ā  The current design of CUSP does not allow the user to explicitly set the dimensions of the kernels. This simplifies the implementation and is in agreement with the design of Thrust, the core library of primitives CUSP uses in many algorithms. You could always introduce your own implementation of the algorithms to add this information through the use of execution policies.

#include <cusp/array1d.h>
#include <cusp/csr_matrix.h>
#include <cusp/multiply.h>
#include <cusp/gallery/poisson.h>

struct my_policy : public cusp::cuda::execution_policy<my_policy>{};

template<typename IndexType, typename ValueType, typename MemorySpace>
void multiply(my_policy& exec,
Ā  Ā  Ā  Ā  Ā  Ā  Ā  const cusp::csr_matrix<IndexType,ValueType,MemorySpace> & A,
Ā  Ā  Ā  Ā  Ā  Ā  Ā  const cusp::array1d<ValueType,MemorySpace> & x,
Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  cusp::array1d<ValueType,MemorySpace> & y)
{ Ā Ā 
Ā  Ā  typedef cusp::csr_matrix<IndexType,ValueType,MemorySpace> MatrixType;
Ā  Ā  typedef cusp::array1d<ValueType,MemorySpace> VectorType1;
Ā  Ā  typedef cusp::array1d<ValueType,MemorySpace> VectorType2;

Ā  Ā  typedef cusp::constant_functor<ValueType> UnaryFunction;
Ā  Ā  typedef thrust::multiplies<ValueType> BinaryFunction1;
Ā  Ā  typedef thrust::plus<ValueType> BinaryFunction2;
Ā  Ā Ā 
Ā  Ā  using namespace cusp::system::cuda;
Ā  Ā  using namespace cusp::system::cuda::detail;
Ā  Ā Ā 
Ā  Ā  UnaryFunction Ā  initialize(0);
Ā  Ā  BinaryFunction1 combine;
Ā  Ā  BinaryFunction2 reduce;
Ā  Ā Ā 
Ā  Ā  typedef typename MatrixType::row_offsets_array_type::const_iterator Ā  Ā  RowIterator;
Ā  Ā  typedef typename MatrixType::column_indices_array_type::const_iterator Ā ColumnIterator;
Ā  Ā  typedef typename MatrixType::values_array_type::const_iterator Ā  Ā  Ā  Ā  Ā ValueIterator1;
Ā  Ā Ā 
Ā  Ā  typedef typename VectorType1::const_iterator Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā ValueIterator2;
Ā  Ā  typedef typename VectorType2::iterator Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā ValueIterator3;
Ā  Ā Ā 
Ā  Ā  const size_t THREADS_PER_BLOCK Ā = 128;
Ā  Ā  const size_t THREADS_PER_VECTOR = 8;
Ā  Ā  const size_t VECTORS_PER_BLOCK Ā = THREADS_PER_BLOCK / THREADS_PER_VECTOR;
Ā  Ā Ā 
Ā  Ā  const size_t MAX_BLOCKS = cusp::system::cuda::detail::max_active_blocks(
Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  spmv_csr_vector_kernel<RowIterator, ColumnIterator, ValueIterator1, ValueIterator2, ValueIterator3,
Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  UnaryFunction, BinaryFunction1, BinaryFunction2,
Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  VECTORS_PER_BLOCK, THREADS_PER_VECTOR>, THREADS_PER_BLOCK, (size_t) 0);
Ā  Ā  const size_t NUM_BLOCKS = std::min<size_t>(MAX_BLOCKS, DIVIDE_INTO(A.num_rows, VECTORS_PER_BLOCK));
Ā  Ā Ā 
Ā  Ā  std::cout << "Calling my multiply with " << THREADS_PER_BLOCK << " threads and "
Ā  Ā  Ā  Ā  Ā  Ā  Ā  << THREADS_PER_VECTOR << " threads-per-vector" << std::endl;
Ā  Ā  std::cout << "Mapping yielded " << NUM_BLOCKS << " blocks" << std::endl;
Ā  Ā Ā 
Ā  Ā  cudaStream_t s = stream(thrust::detail::derived_cast(exec));
Ā  Ā Ā 
Ā  Ā  spmv_csr_vector_kernel<RowIterator, ColumnIterator, ValueIterator1, ValueIterator2, ValueIterator3,
Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā UnaryFunction, BinaryFunction1, BinaryFunction2,
Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā VECTORS_PER_BLOCK, THREADS_PER_VECTOR> <<<NUM_BLOCKS, THREADS_PER_BLOCK, 0, s>>>
Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā (A.num_rows, A.row_offsets.begin(), A.column_indices.begin(), A.values.begin(), x.begin(), y.begin(),
Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  initialize, combine, reduce);
}Ā 

int main(int argc, char** argv)
{
Ā  Ā  typedef int Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  IndexType;
Ā  Ā  typedef double Ā  Ā  Ā  Ā  Ā  Ā  Ā ValueType;
Ā  Ā  typedef cusp::device_memory MemorySpace;

Ā  Ā  cusp::csr_matrix<IndexType,ValueType,MemorySpace> A;

Ā  Ā  if (argc == 1)
Ā  Ā  {
Ā  Ā  Ā  Ā  std::cout << "Using default matrix (5-pt Laplacian stencil)" << std::endl;
Ā  Ā  Ā  Ā  cusp::gallery::poisson5pt(A, 1000, 1000);
Ā  Ā  }

Ā  Ā  size_t N = A.num_rows;

Ā  Ā  cusp::array1d<ValueType, MemorySpace> x(N,0);
Ā  Ā  cusp::array1d<ValueType, MemorySpace> b(N,1);

Ā  Ā  my_policy exec;
Ā  Ā  cusp::multiply(exec, A, x, b);

Ā  Ā  return 0;
}


--
You received this message because you are subscribed to the Google Groups "cusp-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cusp-users+unsubscribe@googlegroups.com.
To post to this group, send email to cusp-...@googlegroups.com.
Visit this group at https://groups.google.com/group/cusp-users.
For more options, visit https://groups.google.com/d/optout.

Jie Zhao

unread,
Oct 7, 2016, 9:13:11 AM10/7/16
to cusp-users
Thanks a lot Steve.

So how can I map the device memory space of A maps to the CUDA backend? I have to write a CUDA code based on the multiply example myself?

I noticed that there are some examples and application in the package.

In the "performance" directory, there are some benchmarks. Can these benchmarks be used to measure the performance of CUSP implementation?

If so, is there a manual guide or tutorial I can refer to?

Actually, I want to perform a simple SpMV compuation like y=A*x on GPU targets, and measure its performance (in time or GFLOPS) of CUSP implementation as a comparison of my work. Is there a simple way?

Thanks.

Jie

在 2016幓10月6ę—„ę˜ŸęœŸå›› UTC+2äø‹åˆ9:52:21,Steveå†™é“ļ¼š
To unsubscribe from this group and stop receiving emails from it, send an email to cusp-users+...@googlegroups.com.

Steven Dalton

unread,
Oct 7, 2016, 12:46:17 PM10/7/16
to cusp-...@googlegroups.com
By default CUSP will target the CUDA backend for all functions where the input resides in device_memory so you don't have to worry about that.

I think the best way to learn about CUSP is looking through some of the documentation in [1]. There is a long quick start guide (a little out of date).
Another good place to look in the Thrust wiki [2]. Understanding the structure of Thrust and the idea of execution policies will help you out quite a bit with CUSP.
Yes the performance directory has a few example applications to help benchmark the performance of the library on your system. You should take a good look at the SpMV benchmark folder [3].

Hope that helps.

Steve

[1] http://cusplibrary.github.io/

To unsubscribe from this group and stop receiving emails from it, send an email to cusp-users+unsubscribe@googlegroups.com.

Jie Zhao

unread,
Oct 8, 2016, 5:26:03 AM10/8/16
to cusp-users
Thanks a lot Steve.

I will learn further on what you mentioned.

Jie

在 2016幓10月7ę—„ę˜ŸęœŸäŗ” UTC+2äø‹åˆ6:46:17,Steveå†™é“ļ¼š

Jie Zhao

unread,
Oct 12, 2016, 11:44:48 AM10/12/16
to cusp-users
Dear Steve,


在 2016幓10月7ę—„ę˜ŸęœŸäŗ” UTC+2äø‹åˆ6:46:17,Steveå†™é“ļ¼š
By default CUSP will target the CUDA backend for all functions where the input resides in device_memory so you don't have to worry about that.

Yes the performance directory has a few example applications to help benchmark the performance of the library on your system. You should take a good look at the SpMV benchmark folder [3].



Is there a description on how to run the examples in performance directory? I looked into the folder [3] but didn't find one :-( Maybe I missed the important information.

Thanks.

JieĀ 

Jie Zhao

unread,
Oct 12, 2016, 12:11:30 PM10/12/16
to cusp-users
Hi Steve,


在 2016幓10月7ę—„ę˜ŸęœŸäŗ” UTC+2äø‹åˆ6:46:17,Steveå†™é“ļ¼š
By default CUSP will target the CUDA backend for all functions where the input resides in device_memory so you don't have to worry about that.

After a further study, I wrote a simple example as below.

#include <cusp/multiply.h>
#include <cusp/array2d.h>
#include <cusp/print.h>
int main(void)
{
Ā  Ā  // initialize matrix
Ā  Ā  // allocate storage for (4,3) matrix with 6 nonzeros
Ā  Ā  cusp::csr_matrix<int,float,cusp::device_memory> A(4,3,6);
Ā  Ā  // initialize matrix entries on host
Ā  Ā  A.row_offsets[0] = 0; Ā // first offset is always zero
Ā  Ā  A.row_offsets[1] = 2;
Ā  Ā  A.row_offsets[2] = 2;
Ā  Ā  A.row_offsets[3] = 3;
Ā  Ā  A.row_offsets[4] = 6; // last offset is always num_entries
Ā  Ā  A.column_indices[0] = 0; A.values[0] = 10;
Ā  Ā  A.column_indices[1] = 2; A.values[1] = 20;
Ā  Ā  A.column_indices[2] = 2; A.values[2] = 30;
Ā  Ā  A.column_indices[3] = 0; A.values[3] = 40;
Ā  Ā  A.column_indices[4] = 1; A.values[4] = 50;
Ā  Ā  A.column_indices[5] = 2; A.values[5] = 60;
Ā  Ā  // A now represents the following matrix
Ā  Ā  // Ā  Ā [10 Ā 0 20]
Ā  Ā  // Ā  Ā [ 0 Ā 0 Ā 0]
Ā  Ā  // Ā  Ā [ 0 Ā 0 30]
Ā  Ā  // Ā  Ā [40 50 60]

Ā  Ā  // initialize input vector
Ā  Ā  cusp::array1d<float, cusp::device_memory> x(3);
Ā  Ā  x[0] = 1;
Ā  Ā  x[1] = 2;
Ā  Ā  x[2] = 3;
Ā  Ā  // allocate output vector
Ā  Ā  cusp::array1d<float, cusp::device_memory> y(4);

Ā  Ā  cudaEvent_t start, stop;
Ā  Ā  float elapsedtime;
Ā  Ā  cudaEventCreate(&start);
Ā  Ā  cudaEventRecord(start, 0);

Ā  Ā  // compute y = A * x
Ā  Ā  cusp::multiply(A, x, y);

Ā  Ā  cudaEventCreate(&stop);
Ā  Ā  cudaEventRecord(stop, 0);
Ā  Ā  cudaEventSynchronize(stop);
Ā  Ā  cudaEventElapsedTime(&elapsedtime, start, stop);
Ā  Ā  printf("Elapsed time (without data transfer) : %f ms\n", elapsedtime);

Ā  Ā  // print y
Ā  Ā  cusp::print(y);
Ā  Ā  return 0;
}

And I compile it with nvcc. The result is correct.

So, does it mean this example is run on GPUs automatically?

And if so, the elapsedtime I defined is the elapsed time by GPU devices, right?

Thanks.

Jie
Ā 

Steven Dalton

unread,
Oct 12, 2016, 5:35:00 PM10/12/16
to cusp-...@googlegroups.com
Hello Jie,

Ā  Your example appears to be correct.Ā  You can build the spmvĀ performance tests using the following commands on Linux.

Ā  // change to spmv directory
Ā  cd <cusp_dir>/performance/spmv
Ā  // build spmv tester
Ā  scons arch=sm_50
Ā  // change to scripts directory
Ā  cd scripts
Ā  // get the test matrices
Ā  // extract matrices.zip to matrices
Ā Ā unzip matrices.zip -d matrices
Ā  // change line 14 of benchmark.py to : unstructured_path = './matrices/'
Ā  // change line 40 of benchmark.py to : trials = unstructured_mats
Ā  // run the benchmark.py tester
Ā  python benchmark.py

Steve


--
You received this message because you are subscribed to the Google Groups "cusp-users" group.

Jie Zhao

unread,
Oct 13, 2016, 3:52:31 AM10/13/16
to cusp-users
Dear Steve,

Thanks for your confirm on the correctness of the example and thank you so much on the explanation on the performance tests.

Jie

在 2016幓10月12ę—„ę˜ŸęœŸäø‰ UTC+2äø‹åˆ11:35:00,Steveå†™é“ļ¼š
To unsubscribe from this group and stop receiving emails from it, send an email to cusp-users+...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages