Hi Mark,
Thanks for your response. No I didn't mean that. As I measured, data allocation and transfer on and to the device is quite fast so I don't see that much improvement if the data is on the device and not transferred from the host at each iteration when MAGMA LAPACK routines are called. What I meant was to allocate data on the device and try to access each cell and modify that. This is because we are working on a frequency domain EM solver where only parts of the coefficient matrix need to be updated as the frequency is updated in each iteration. Therefore, it is unnecessary to transfer the whole coefficient matrix from the host to the device each time the equation is solved. But like I said, the transfer time is negligible so we can live with that.
There are some more issues so I will write here if that would be OK.
1. magma_xsetmatrix always crashes in my case while I believe that I have set all parameters correctly and allocated the memory needed in advance. This is how I call:
magma_queue_t q;
magma_int_t ldda = magma_roundup(nCells, 32);
cuFloatComplex* h_A;
h_A = reinterpret_cast<cuFloatComplex*>(MKL_RL);
magma_csetmatrix(nCells, nCells, h_A, nCells, d_A, ldda, q);
2. MAGMA's xgetri for matrix inversion is only for the case when the matrix is in the device's memory and not when the matrix is in the host's memory. For this, I need to call magma_xgetmatrix which crashes all the time. Is there any specific reason for that? It is more convenient to call MAGMA routines when the matrix is in the host's memory and let MAGMA do the job.
3. Whenever MAGMA routines are running, I see that most of the time most of the CPUs are idle. I am running on a machine with 24 corea and 32 threads and when MAGMA is running, only 1-2 cores are busy. I am not sure if batched routines will split the load over host and device because as far as I remember, batched routines were optimized for several small matrices. I think MAGMA already uses block LU factorization to achieve parallelism, but is it limited to device memory only and implemented for all xgetrf routines or there are routines that use both GPU and CPU at the same time, called tiling I guess?
4. The last issue is that according to my tests, MAGMA is ~4 times slower than CUDA LAPACK. Is it already known by the MAGMA team?
Regards,
Dan