npages & shared memory in getf2_native_fused

2 views
Skip to first unread message

Aran Nokan

unread,
Nov 9, 2021, 6:11:43 PM11/9/21
to MAGMA User
Hi,

I have two questions about the getf2_native_fused kernel.

1) As I understand it, you are dividing panels by row as some pages.
npages = magma_ceildiv(m, ntx);
Can I call the page a sub-matrix?
Here ntx is e.g. 512 so if the number of rows of the panel is 12000 and number of columns is 32 then we will have 24 pages (512x32). Is this deduction correct?

______                           ______
| xxxxx |                          | xxxxx |
                                       ---------
| xxxxx |                          | xxxxx |
                 ---->               ---------
| xxxxx |                          | xxxxx |
                 ---->               ---------
| xxxxx |                          | xxxxx |
                                       ---------
| xxxxx |                          | xxxxx |
 ______                          ______

Here I have used "---------" for showing the splitting panel by row.

Or we have 32 grids and each grid contains 24 pages? I think this one is correct, if so why we are not using the previous config  24 pages (512x32)?

I think In the original paper of this algorithm the number of columns is 8 or 16 and not 32.


2) As I see it, for working on each element of a page (sub-matrix!?) you have a register named "rA". For reading them from global memory we have this:

    for(int i = 0; i < NPAGES-1; i++){
        rA[i] = dA[ i * TX ];
    }


What is the warp and threadblock configuration?

Why don't we read page elements to shared memory and we are trying to increase the pressure on the registers memory? 

Best regards,
Aran

Ahmad Abdelfattah

unread,
Nov 9, 2021, 9:46:08 PM11/9/21
to Aran Nokan, MAGMA User
The kernel assigns one column per thread-bock. The number of columns can be 8, 16 or 32. We can modify our software after publishing the paper :)

If the #rows = 12000, and using 512 threads, then we need ceiling(12000/512) = 24 pages. The panel is distributed by column across thread-blocks, and each column is stored across 512 threads using the register variable rA[24]. 

The register file is larger than shared memory, so we can use it to cache more data. 

Ahmad



--
You received this message because you are subscribed to the Google Groups "MAGMA User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to magma-user+...@icl.utk.edu.
To view this discussion on the web visit https://groups.google.com/a/icl.utk.edu/d/msgid/magma-user/CAKHt_YYU1hFSe%2BieYO_Uye8SJk3DuW4HGzcP-MfO3AaPBLb1Jg%40mail.gmail.com.

Reply all
Reply to author
Forward
0 new messages