cuSPARSE csrmv overhead from cudaFree in each call

1 view
Skip to first unread message

afedy...@gmail.com

unread,
Jun 8, 2016, 6:23:41 AM6/8/16
to Numba Public Discussion - Public
Hello,

for a solver of an ODE system I need to use sparse matrix-vector multiplication. In the real application the matrices are
of the order of 8000x8000 with 1 - 5 % nnz. Using numbapro/accelerate in combination with nvprof, it turns out
that the cuSPARSE csrmv function spends all time in cudaFree (see nvprof output below). This doesn't happen
in the equivalent dense cuBlas calls where cudaLaunch is on top.

For me it seems like a bug, but maybe I'm mistaken...

Further down is a working minimal example which produces gibberish numbers, but equivalent performance profile. If someone
has an idea how to trace this issue, I will be very happy about an advice.

Thanks!
Anatoli


NVPROF output:

Time spent on CUDA stuff 42.7958910465
==16945== Profiling application: python minexample.py
==16945== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 99.97%  33.0468s      2000  16.523ms  11.713ms  23.977ms  void csrMvT_hyb_kernel<float, int=7, int=2, int=8, int=5, int=0>(cusparseCsrMvParams<float>, int*)
  0.01%  3.5220ms      2000  1.7610us  1.5040us  3.3600us  [CUDA memset]
  0.01%  2.3839ms      2000  1.1910us     970ns  1.9920us  void scal_kernel<float>(float*, int, float const *, float, bool)
  0.01%  2.1056ms      1000  2.1050us  1.7690us  2.7680us  void axpy_kernel_val<float, int=0>(cublasAxpyParamsVal<float>)
  0.00%  841.92us         8  105.24us  1.7600us  207.87us  [CUDA memcpy HtoD]

==16945== API calls:
Time(%)      Time     Calls       Avg       Min       Max  Name
 99.24%  33.6282s      2005  16.772ms  8.0420us  362.18ms  cudaFree
  0.26%  87.647ms         1  87.647ms  87.647ms  87.647ms  cuCtxCreate
  0.18%  59.783ms      5000  11.956us  7.4830us  68.203us  cudaLaunch
  0.16%  52.984ms       143  370.52us  2.3740us  2.9184ms  cuModuleUnload
  0.07%  23.415ms      2003  11.690us  10.488us  174.26us  cudaMalloc
  0.06%  18.830ms      2000  9.4140us  8.8650us  34.089us  cudaMemsetAsync
  0.01%  4.1967ms     10000     419ns     165ns  162.58us  cudaGetLastError
  0.01%  4.1307ms     15000     275ns     173ns  9.9920us  cudaSetupArgument
  0.01%  2.0417ms      5000     408ns     306ns  157.73us  cudaConfigureCall

import numpy as np
from timeit import default_timer as timer

def kern_CUDA_sparse(nsteps, dX, rho_inv, int_m, dec_m, phi):
   
    calc_precision = np.float32
    from numbapro.cudalib import cusparse
    from numbapro.cudalib.cublas import Blas
    from numbapro import cuda

    cusp = cusparse.Sparse()
    cubl = Blas()
    m, n = int_m.shape
    int_m_nnz = int_m.nnz
    int_m_csrValA = cuda.to_device(int_m.data.astype(calc_precision))
    int_m_csrRowPtrA = cuda.to_device(int_m.indptr)
    int_m_csrColIndA = cuda.to_device(int_m.indices)
   
    dec_m_nnz = dec_m.nnz
    dec_m_csrValA = cuda.to_device(dec_m.data.astype(calc_precision))
    dec_m_csrRowPtrA = cuda.to_device(dec_m.indptr)
    dec_m_csrColIndA = cuda.to_device(dec_m.indices)
   
    cu_curr_phi = cuda.to_device(phi.astype(calc_precision))
    cu_delta_phi = cuda.device_array(phi.shape, dtype=calc_precision)

    descr = cusp.matdescr()
    descr.indexbase = cusparse.CUSPARSE_INDEX_BASE_ZERO
   
    for step in xrange(nsteps):
        if not step % 500:
            print 'Solving for step', step
        cusp.csrmv(trans='T', m=m, n=n, nnz=int_m_nnz,
                   descr=descr,
                   alpha=calc_precision(1.0),
                   csrVal=int_m_csrValA,
                   csrRowPtr=int_m_csrRowPtrA,
                   csrColInd=int_m_csrColIndA,
                   x=cu_curr_phi, beta=calc_precision(0.0), y=cu_delta_phi)
        cusp.csrmv(trans='T', m=m, n=n, nnz=dec_m_nnz,
                   descr=descr,
                   alpha=calc_precision(rho_inv[step]),
                   csrVal=dec_m_csrValA,
                   csrRowPtr=dec_m_csrRowPtrA,
                   csrColInd=dec_m_csrColIndA,
                   x=cu_curr_phi, beta=calc_precision(1.0), y=cu_delta_phi)
        cubl.axpy(alpha=calc_precision(dX[step]), x=cu_delta_phi, y=cu_curr_phi)

if __name__ == '__main__':
    from scipy.sparse import rand as srand
    from scipy.sparse import csr_matrix

    nsteps = 1000
    dX = np.linspace(1,1000,nsteps)
    rho_inv = np.ones_like(dX)
    msize = 1024*5
    int_m = srand(msize, msize, density=0.01, dtype=np.float32)
    dec_m = srand(msize, msize, density=0.01, dtype=np.float32)
    int_m = csr_matrix(int_m)
    dec_m = csr_matrix(dec_m)

    phi = np.random.rand(msize)

    start = timer()
    kern_CUDA_sparse(nsteps, dX, rho_inv, int_m, dec_m, phi)
    print 'Time spent on CUDA stuff', timer() - start



stuart

unread,
Jun 8, 2016, 12:31:08 PM6/8/16
to Numba Public Discussion - Public, afedy...@gmail.com

Hi,

Thanks for the report. It would seem that nvprof reports 2005 calls to `cudaFree`, and the loop size is 1000 with 2 calls to `csrmv`, this is suspiciously close to the number of frees reported. It seems that if you set `trans` as "N" in the first call the number of frees drops to 1005 and then if you set the same in the second this drops down to 5. I think that cuSparse is probably allocating array temporaries for a transposed CSR format `csrmv` operation.

Further evidence to support this comes from compiling this example:
http://docs.nvidia.com/cuda/cusparse/#appendix-b-cusparse-library-c---example
line 234 can be set to have `CUSPARSE_OPERATION_NON_TRANSPOSE` or `CUSPARSE_OPERATION_TRANSPOSE`
running both options with nvprof indicates and additional allocation and free in the case of `CUSPARSE_OPERATION_TRANSPOSE`.

To get around this, perhaps store and address the sparse matrix in transposed form?

Hope this helps?

Kind Regards,

--
stuart

afedy...@gmail.com

unread,
Jun 9, 2016, 3:58:58 AM6/9/16
to Numba Public Discussion - Public, afedy...@gmail.com, sarch...@continuum.io
Dear Stuart,

thanks a lot for your quick and helpful reply.

Indeed, feeding trasposed matrices in csrmv and avoiding the transposed flag solves all performance problems. (The documentation actually tries to
suggest not use it). I get a huge speed-up now and can easily exceed Dual Xeon E5 performance.

Thanks,
Anatoli

среда, 8 июня 2016 г., 18:31:08 UTC+2 пользователь stuart написал:
Reply all
Reply to author
Forward
0 new messages