cuSPARSE csrmv overhead from cudaFree in each call

afedy...@gmail.com

unread,

Jun 8, 2016, 6:23:41 AM6/8/16

to Numba Public Discussion - Public

Hello,

for a solver of an ODE system I need to use sparse matrix-vector multiplication. In the real application the matrices are
of the order of 8000x8000 with 1 - 5 % nnz. Using numbapro/accelerate in combination with nvprof, it turns out
that the cuSPARSE csrmv function spends all time in cudaFree (see nvprof output below). This doesn't happen
in the equivalent dense cuBlas calls where cudaLaunch is on top.

For me it seems like a bug, but maybe I'm mistaken...

Further down is a working minimal example which produces gibberish numbers, but equivalent performance profile. If someone
has an idea how to trace this issue, I will be very happy about an advice.

Thanks!
Anatoli

NVPROF output:

Time spent on CUDA stuff 42.7958910465
==16945== Profiling application: python minexample.py
==16945== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max Name
99.97% 33.0468s      2000 16.523ms 11.713ms 23.977ms void csrMvT_hyb_kernel<float, int=7, int=2, int=8, int=5, int=0>(cusparseCsrMvParams<float>, int*)
0.01% 3.5220ms      2000 1.7610us 1.5040us 3.3600us [CUDA memset]
0.01% 2.3839ms      2000 1.1910us     970ns 1.9920us void scal_kernel<float>(float*, int, float const *, float, bool)
0.01% 2.1056ms      1000 2.1050us 1.7690us 2.7680us void axpy_kernel_val<float, int=0>(cublasAxpyParamsVal<float>)
0.00% 841.92us         8 105.24us 1.7600us 207.87us [CUDA memcpy HtoD]

==16945== API calls:
Time(%)      Time     Calls       Avg       Min       Max Name
99.24% 33.6282s      2005 16.772ms 8.0420us 362.18ms cudaFree
0.26% 87.647ms         1 87.647ms 87.647ms 87.647ms cuCtxCreate
0.18% 59.783ms      5000 11.956us 7.4830us 68.203us cudaLaunch
0.16% 52.984ms       143 370.52us 2.3740us 2.9184ms cuModuleUnload
0.07% 23.415ms      2003 11.690us 10.488us 174.26us cudaMalloc
0.06% 18.830ms      2000 9.4140us 8.8650us 34.089us cudaMemsetAsync
0.01% 4.1967ms     10000     419ns     165ns 162.58us cudaGetLastError
0.01% 4.1307ms     15000     275ns     173ns 9.9920us cudaSetupArgument
0.01% 2.0417ms      5000     408ns     306ns 157.73us cudaConfigureCall

import numpy as np
from timeit import default_timer as timer

def kern_CUDA_sparse(nsteps, dX, rho_inv, int_m, dec_m, phi):

    calc_precision = np.float32
    from numbapro.cudalib import cusparse
    from numbapro.cudalib.cublas import Blas
    from numbapro import cuda

    cusp = cusparse.Sparse()
    cubl = Blas()
    m, n = int_m.shape
    int_m_nnz = int_m.nnz
    int_m_csrValA = cuda.to_device(int_m.data.astype(calc_precision))
    int_m_csrRowPtrA = cuda.to_device(int_m.indptr)
    int_m_csrColIndA = cuda.to_device(int_m.indices)

    dec_m_nnz = dec_m.nnz
    dec_m_csrValA = cuda.to_device(dec_m.data.astype(calc_precision))
    dec_m_csrRowPtrA = cuda.to_device(dec_m.indptr)
    dec_m_csrColIndA = cuda.to_device(dec_m.indices)

    cu_curr_phi = cuda.to_device(phi.astype(calc_precision))
    cu_delta_phi = cuda.device_array(phi.shape, dtype=calc_precision)

    descr = cusp.matdescr()
    descr.indexbase = cusparse.CUSPARSE_INDEX_BASE_ZERO

    for step in xrange(nsteps):
        if not step % 500:
            print 'Solving for step', step
        cusp.csrmv(trans='T', m=m, n=n, nnz=int_m_nnz,
                   descr=descr,
                   alpha=calc_precision(1.0),
                   csrVal=int_m_csrValA,
                   csrRowPtr=int_m_csrRowPtrA,
                   csrColInd=int_m_csrColIndA,
                   x=cu_curr_phi, beta=calc_precision(0.0), y=cu_delta_phi)
        cusp.csrmv(trans='T', m=m, n=n, nnz=dec_m_nnz,
                   descr=descr,
                   alpha=calc_precision(rho_inv[step]),
                   csrVal=dec_m_csrValA,
                   csrRowPtr=dec_m_csrRowPtrA,
                   csrColInd=dec_m_csrColIndA,
                   x=cu_curr_phi, beta=calc_precision(1.0), y=cu_delta_phi)
        cubl.axpy(alpha=calc_precision(dX[step]), x=cu_delta_phi, y=cu_curr_phi)

if __name__ == '__main__':
    from scipy.sparse import rand as srand
    from scipy.sparse import csr_matrix

    nsteps = 1000
    dX = np.linspace(1,1000,nsteps)
    rho_inv = np.ones_like(dX)
    msize = 1024*5
    int_m = srand(msize, msize, density=0.01, dtype=np.float32)
    dec_m = srand(msize, msize, density=0.01, dtype=np.float32)
    int_m = csr_matrix(int_m)
    dec_m = csr_matrix(dec_m)

    phi = np.random.rand(msize)

    start = timer()
    kern_CUDA_sparse(nsteps, dX, rho_inv, int_m, dec_m, phi)
    print 'Time spent on CUDA stuff', timer() - start

stuart

unread,

Jun 8, 2016, 12:31:08 PM6/8/16

to Numba Public Discussion - Public, afedy...@gmail.com

Hi,

Thanks for the report. It would seem that nvprof reports 2005 calls to `cudaFree`, and the loop size is 1000 with 2 calls to `csrmv`, this is suspiciously close to the number of frees reported. It seems that if you set `trans` as "N" in the first call the number of frees drops to 1005 and then if you set the same in the second this drops down to 5. I think that cuSparse is probably allocating array temporaries for a transposed CSR format `csrmv` operation.

Further evidence to support this comes from compiling this example:
http://docs.nvidia.com/cuda/cusparse/#appendix-b-cusparse-library-c---example

line 234 can be set to have `CUSPARSE_OPERATION_NON_TRANSPOSE` or `CUSPARSE_OPERATION_TRANSPOSE`

running both options with nvprof indicates and additional allocation and free in the case of `CUSPARSE_OPERATION_TRANSPOSE`.

To get around this, perhaps store and address the sparse matrix in transposed form?

Hope this helps?

Kind Regards,

--

stuart

afedy...@gmail.com

unread,

Jun 9, 2016, 3:58:58 AM6/9/16

to Numba Public Discussion - Public, afedy...@gmail.com, sarch...@continuum.io

Dear Stuart,

thanks a lot for your quick and helpful reply.

Indeed, feeding trasposed matrices in csrmv and avoiding the transposed flag solves all performance problems. (The documentation actually tries to
suggest not use it). I get a huge speed-up now and can easily exceed Dual Xeon E5 performance.

Thanks,
Anatoli

среда, 8 июня 2016 г., 18:31:08 UTC+2 пользователь stuart написал:

Reply all

Reply to author

Forward