Hi,
I have a question about dgetf2_nopiv_internal_batched kernels. Actually I am not understanding why ntcol is going to be calculated like this, or shared memory:
const magma_int_t ntcol = (m1 > 32) ? 1 : (2 * (32/m1));
magma_int_t shmem = ntcol * magma_ceilpow2(n) * sizeof(double);
magma_int_t gridx = magma_ceildiv(batchCount, ntcol);
dim3 threads(m1, ntcol, 1);
dim3 grid(gridx, 1, 1);
Why ntcol=1? and why is it different for small m1?
Also e question about the template like this:
case n: dgetf2_nopiv_batched_kernel< n, magma_ceilpow2( n)><<<grid, threads, shmem, queue->cuda_stream()>>>(m1, dA_array, ai, aj, ldda, info_array, gbstep, batchCount); break;
Is < n, magma_ceilpow2( n)> for register buffer? So why < n, magma_ceilpow2( n)>?
Aran