Hello,
for a solver of an ODE system I need to use sparse matrix-vector multiplication. In the real application the matrices are
of the order of 8000x8000 with 1 - 5 % nnz. Using numbapro/accelerate in combination with nvprof, it turns out
that the cuSPARSE csrmv function spends all time in cudaFree (see nvprof output below). This doesn't happen
in the equivalent dense cuBlas calls where cudaLaunch is on top.
For me it seems like a bug, but maybe I'm mistaken...
Further down is a working minimal example which produces gibberish numbers, but equivalent performance profile. If someone
has an idea how to trace this issue, I will be very happy about an advice.
Thanks!
Anatoli
NVPROF output:
Time spent on CUDA stuff 42.7958910465
==16945== Profiling application: python minexample.py
==16945== Profiling result:
Time(%) Time Calls Avg Min Max Name
99.97% 33.0468s 2000 16.523ms 11.713ms 23.977ms void csrMvT_hyb_kernel<float, int=7, int=2, int=8, int=5, int=0>(cusparseCsrMvParams<float>, int*)
0.01% 3.5220ms 2000 1.7610us 1.5040us 3.3600us [CUDA memset]
0.01% 2.3839ms 2000 1.1910us 970ns 1.9920us void scal_kernel<float>(float*, int, float const *, float, bool)
0.01% 2.1056ms 1000 2.1050us 1.7690us 2.7680us void axpy_kernel_val<float, int=0>(cublasAxpyParamsVal<float>)
0.00% 841.92us 8 105.24us 1.7600us 207.87us [CUDA memcpy HtoD]
==16945== API calls:
Time(%) Time Calls Avg Min Max Name
99.24% 33.6282s 2005 16.772ms 8.0420us 362.18ms cudaFree
0.26% 87.647ms 1 87.647ms 87.647ms 87.647ms cuCtxCreate
0.18% 59.783ms 5000 11.956us 7.4830us 68.203us cudaLaunch
0.16% 52.984ms 143 370.52us 2.3740us 2.9184ms cuModuleUnload
0.07% 23.415ms 2003 11.690us 10.488us 174.26us cudaMalloc
0.06% 18.830ms 2000 9.4140us 8.8650us 34.089us cudaMemsetAsync
0.01% 4.1967ms 10000 419ns 165ns 162.58us cudaGetLastError
0.01% 4.1307ms 15000 275ns 173ns 9.9920us cudaSetupArgument
0.01% 2.0417ms 5000 408ns 306ns 157.73us cudaConfigureCall
import numpy as np
from timeit import default_timer as timer
def kern_CUDA_sparse(nsteps, dX, rho_inv, int_m, dec_m, phi):
calc_precision = np.float32
from numbapro.cudalib import cusparse
from numbapro.cudalib.cublas import Blas
from numbapro import cuda
cusp = cusparse.Sparse()
cubl = Blas()
m, n = int_m.shape
int_m_nnz = int_m.nnz
int_m_csrValA = cuda.to_device(int_m.data.astype(calc_precision))
int_m_csrRowPtrA = cuda.to_device(int_m.indptr)
int_m_csrColIndA = cuda.to_device(int_m.indices)
dec_m_nnz = dec_m.nnz
dec_m_csrValA = cuda.to_device(dec_m.data.astype(calc_precision))
dec_m_csrRowPtrA = cuda.to_device(dec_m.indptr)
dec_m_csrColIndA = cuda.to_device(dec_m.indices)
cu_curr_phi = cuda.to_device(phi.astype(calc_precision))
cu_delta_phi = cuda.device_array(phi.shape, dtype=calc_precision)
descr = cusp.matdescr()
descr.indexbase = cusparse.CUSPARSE_INDEX_BASE_ZERO
for step in xrange(nsteps):
if not step % 500:
print 'Solving for step', step
cusp.csrmv(trans='T', m=m, n=n, nnz=int_m_nnz,
descr=descr,
alpha=calc_precision(1.0),
csrVal=int_m_csrValA,
csrRowPtr=int_m_csrRowPtrA,
csrColInd=int_m_csrColIndA,
x=cu_curr_phi, beta=calc_precision(0.0), y=cu_delta_phi)
cusp.csrmv(trans='T', m=m, n=n, nnz=dec_m_nnz,
descr=descr,
alpha=calc_precision(rho_inv[step]),
csrVal=dec_m_csrValA,
csrRowPtr=dec_m_csrRowPtrA,
csrColInd=dec_m_csrColIndA,
x=cu_curr_phi, beta=calc_precision(1.0), y=cu_delta_phi)
cubl.axpy(alpha=calc_precision(dX[step]), x=cu_delta_phi, y=cu_curr_phi)
if __name__ == '__main__':
from scipy.sparse import rand as srand
from scipy.sparse import csr_matrix
nsteps = 1000
dX = np.linspace(1,1000,nsteps)
rho_inv = np.ones_like(dX)
msize = 1024*5
int_m = srand(msize, msize, density=0.01, dtype=np.float32)
dec_m = srand(msize, msize, density=0.01, dtype=np.float32)
int_m = csr_matrix(int_m)
dec_m = csr_matrix(dec_m)
phi = np.random.rand(msize)
start = timer()
kern_CUDA_sparse(nsteps, dX, rho_inv, int_m, dec_m, phi)
print 'Time spent on CUDA stuff', timer() - start