thread blocks in dscal_dger

10 views
Skip to first unread message

aran nokan

unread,
Sep 16, 2021, 1:22:09 PM9/16/21
to MAGMA User
Hi,

I am reading a paper that describes the idea and implementation of Linear Solvers in MAGMA. In one part I am reading this:

The kernel driver makes sure that all thread blocks are
simultaneously live in order to avoid deadlocks. This is
done by launching a number of thread blocks that is less
than or equal to the number of multiprocessors on the GPU.
We also force the runtime to schedule exactly one thread
block per multiprocessor by allocating more than half the
shared memory available. The driver reads, at run time, the
number of multiprocessors of the GPU, and tunes the value
of rnb accordingly.

I have some questions here. Can I find the driver in dgetf2.cu ? or is it "magma_dscal_dger"? How can I check how many thread blocks have been launched?
Based on that I am understanding that just one thread is working but most of the GPU memory (registers and shared memory) is occupied?

Is it possible to reduce the number of multiprocessors but do the operation in a longer way?

Best regards,
Aran

Ahmad Abdelfattah

unread,
Sep 16, 2021, 1:36:38 PM9/16/21
to aran nokan, MAGMA User
This kernel is under magmablas/zgetf2_native_kernel.cu. The high level driver mentioned in the paper is under src/zgetf2_native.cpp 

The kernel is usually launched with 32 thread-blocks, with each one caching one column of the LU panel. If there are less than 32 multiprocessors on the GPU, the high level driver goes for thinner panels. 

Based on that I am understanding that just one thread is working but most of the GPU memory (registers and shared memory) is occupied? 

No, each thread block is typically configured with 512 threads. Each thread-block works on one column of the panel.

Is it possible to reduce the number of multiprocessors but do the operation in a longer way? 

I think you mean thread-blocks instead of “multiprocessors”, since the latter is the number of physical processing units on the GPU. As I mentioned above, the high level driver goes for thinner panels when the number of multiprocessors on the GPU is less than the default value (which is 32). 

Ahmad


--
You received this message because you are subscribed to the Google Groups "MAGMA User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to magma-user+...@icl.utk.edu.
To view this discussion on the web visit https://groups.google.com/a/icl.utk.edu/d/msgid/magma-user/CAKHt_YYv20PqW-T53hZHaa-j99qdWY6yWU2Du7b7PB14PgtfhA%40mail.gmail.com.

aran nokan

unread,
Sep 16, 2021, 1:53:54 PM9/16/21
to Ahmad Abdelfattah, MAGMA User
Thanks Ahmad for a fast reply.

Actually I need to run some other operation in parallel with dscal_dger, so I am trying to find a way to keep some parts of the GPU free for other tasks. My problem is that even with small dimensions I am not able to run other kernels in parallel. So in that case by reducing 32 to something like 16 can I keep some SMs free?

Ahmad Abdelfattah

unread,
Sep 16, 2021, 2:25:29 PM9/16/21
to aran nokan, MAGMA User
Reducing the width of the panel might help, but keep in mind that the fused kernel requires special care before launching it. For example, you must make sure that there are 16 free multiprocessors when you launch it on a panel of width 16. Otherwise, you could run into deadlocks or launch failures. 

Ahmad

aran nokan

unread,
Sep 16, 2021, 2:38:58 PM9/16/21
to Ahmad Abdelfattah, MAGMA User
Thanks Ahmad.

Now for width 32 are you checking the free multiprocessors? If so, in which part of the code? maybe I can read and modify that part to handle a width of 16.

Do you have any other ideas to keep some multiprocessors free and provide a lighter and slower version of dscal_dger?

Regards,
Aran

Ahmad Abdelfattah

unread,
Sep 16, 2021, 2:56:10 PM9/16/21
to aran nokan, MAGMA User
I don’t think there is a way to check for free multiprocessors. I think the best way is to synchronize with respect to all streams before launching the kernel. After that, you should launch any other kernels. 

Ahmad

aran nokan

unread,
Sep 24, 2021, 6:21:52 AM9/24/21
to Ahmad Abdelfattah, MAGMA User
Our previous conversation got to the point that "Reducing the width of the panel might help". But changing the width of the panel will change other kernel sizes (TRSM and GEMM) also. How can I reduce the Grid size of dscal_dger without modification of panel width?

Regards,
Aran



Ahmad Abdelfattah

unread,
Sep 24, 2021, 9:08:53 AM9/24/21
to aran nokan, MAGMA User
The dscal_dger kernel has a grid configuration that is proportional to the height of the panel. If you want to use fewer thread-blocks, you will need to write a new kernel or modify the existing one so that every thread block is assigned to a larger part of the panel. 

Ahmad

aran nokan

unread,
Oct 15, 2021, 1:13:43 PM10/15/21
to Ahmad Abdelfattah, MAGMA User
In this paper I have some doubts. I would be happy if someone could help me to understand.


Page 7:
The kernel driver makes sure that all thread blocks are simultaneously live in order to avoid deadlocks.

How could deadlocks happen here and why? How is the kernel sure that thread blocks are alive?

We also force the runtime to schedule exactly one thread block per multiprocessor by allocating more than half the shared memory available.

Why not allocate more than half (e.g. 3/4)?

Do we have both implementations which are mentioned in that paper in MAGMA code? For example I am seeing sgetf2_native_kernel, is this D2 or D1?

Regards,
Aran

Ahmad Abdelfattah

unread,
Oct 16, 2021, 5:35:11 PM10/16/21
to aran nokan, MAGMA User


On Oct 15, 2021, at 1:13 PM, aran nokan <noka...@gmail.com> wrote:


In this paper I have some doubts. I would be happy if someone could help me to understand.


Page 7:
The kernel driver makes sure that all thread blocks are simultaneously live in order to avoid deadlocks.

How could deadlocks happen here and why? How is the kernel sure that thread blocks are alive?


The kernel uses inter-block communication among all thread blocks. Therefore, if one thread block is not active, other thread blocks may wait for some information from that particular thread block, hence the deadlock. Remember that each block will do the pivot search step for its column. All other blocks must wait for the pivot to be found before progressing. 

The D2 kernel was developed before CUDA cooperative groups became mature. If you want to make sure that all thread blocks are live simultaneously, please checkout CUDA cooperative groups. We might switch to it in the future.

We also force the runtime to schedule exactly one thread block per multiprocessor by allocating more than half the shared memory available.

Why not allocate more than half (e.g. 3/4)?


Well, 3/4 is more than half. I don’t see any contradiction here :)

Do we have both implementations which are mentioned in that paper in MAGMA code? For example I am seeing sgetf2_native_kernel, is this D2 or D1?


Both implementations are available. The routine you mentioned is for D2. 

Ahmad 
Reply all
Reply to author
Forward
0 new messages