Hi,
I have two questions about the getf2_native_fused kernel.
1) As I understand it, you are dividing panels by row as some pages.
npages = magma_ceildiv(m, ntx);
Can I call the page a sub-matrix?
Here ntx is e.g. 512 so if the number of rows of the panel is 12000 and number of columns is 32 then we will have 24 pages (512x32). Is this deduction correct?
______ ______
| xxxxx | | xxxxx |
---------
| xxxxx | | xxxxx |
----> ---------
| xxxxx | | xxxxx |
----> ---------
| xxxxx | | xxxxx |
---------
| xxxxx | | xxxxx |
______ ______
Here I have used "---------" for showing the splitting panel by row.
Or we have 32 grids and each grid contains 24 pages? I think this one is correct, if so why we are not using the previous config 24 pages (512x32)?
I think In the original paper of this algorithm the number of columns is 8 or 16 and not 32.
2) As I see it, for working on each element of a page (sub-matrix!?) you have a register named "rA". For reading them from global memory we have this:
for(int i = 0; i < NPAGES-1; i++){
rA[i] = dA[ i * TX ];
}
What is the warp and threadblock configuration?
Why don't we read page elements to shared memory and we are trying to increase the pressure on the registers memory?
Best regards,
Aran