Hello,
I'm looking for some feedback to see if I'm thinking about this the right way.
- I have two large matrices (NxN with N = 125k) that I need to multiply. Each matrix takes up about 120 GB.
- I have four Nvidia A100s available, each with 40GB memory = 160 GB.
- To get AB, I need a total of 360 GB on the CPU side.
- If I cut each matrix into three 40 GB parts, then I need 120 GB on the GPU side.
To accomplish AB, I can break A into three submatrices N/3 x N and break B into three submatrices N x N/3.
[A1] [B1 B2 B3] [A1B1 A1B2 A1B3]
[A2] = [A2B1 A2B2 A2B3]
[A3] [A3B1 A3B2 A3B3]
Thus, I can do 9 matrix multiplications and collect the results on the CPU.
To code this up, can I use something like magma_dsetmatrix_1D_col_bcyclic to copy each piece of A and B over to the GPUs and then use magma_dgemm to multiply?
Thanks for the help!
Cheers,
tom