Hi,
I am reading a paper that describes the idea and implementation of Linear Solvers in MAGMA. In one part I am reading this:
The kernel driver makes sure that all thread blocks are
simultaneously live in order to avoid deadlocks. This is
done by launching a number of thread blocks that is less
than or equal to the number of multiprocessors on the GPU.
We also force the runtime to schedule exactly one thread
block per multiprocessor by allocating more than half the
shared memory available. The driver reads, at run time, the
number of multiprocessors of the GPU, and tunes the value
of rnb accordingly.
I have some questions here. Can I find the driver in
dgetf2.cu ? or is it "magma_dscal_dger"? How can I check how many thread blocks have been launched?
Based on that I am understanding that just one thread is working but most of the GPU memory (registers and shared memory) is occupied?
Is it possible to reduce the number of multiprocessors but do the operation in a longer way?
Best regards,
Aran