Multinode Eigendecomposition

Robert Langefeld

unread,

Jul 18, 2024, 12:20:23 PM7/18/24

to SLATE User

I wanted to ask, is there a reason that the symmetric eigendecomposition algorithm performs the hb2st operation (and possibly sterf based on comments, though it looks different in the implementation) on the rank 0 process rather than as a distributed operation? Right now, I'm working through running eigendecomposition on a 350k x 350k matrix. Things correctly work for smaller matrices. I wanted to check (1) whether only running that operation on rank 0 is responsible for slower performance as I scale and (2) how that may affect my memory usage as I scale to a matrix of this size. As it stands, I've been running out of memory during the run, which I plan to troubleshoot but wanted to make sure it wasn't due to aggregation to rank 0. Also, are there plans to scale this to multinode in the future? I believe I saw a reference to this in one of the other posts. I am not currently using GPUs for this software. Happy to provide more detail if it's helpful.

Thanks,

Robert

Mark Gates

unread,

Jul 18, 2024, 1:37:26 PM7/18/24

to Robert Langefeld, SLATE User

Yes, the band to tri/bi-diagonal (hb2st / tb2bd) is the likely bottleneck at this point.

It's future work to implement those on multiple nodes. It's not a trivial algorithm to distribute because updates are not aligned with tile boundaries. Each operation updates portions of 4 tiles, which can span 2 nodes, requiring more fine-grained communication than in other distributed linear algebra algorithms.

Even on a single node, I think there's room for better multi-threading.

Mark

Interim Director, Innovative Computing Laboratory (ICL)

Research Assistant Professor, University of Tennessee, Knoxville

https://icl.utk.edu/~mgates3/

Robert Langefeld

unread,

Jul 18, 2024, 2:16:36 PM7/18/24

to SLATE User, mga...@icl.utk.edu, SLATE User, Robert Langefeld

As a follow up,

1) Is there a timeline for that kind of implementation? Even ballpark like within a couple of months or is it more like a year or two or more? Understand if there isn't an estimate yet, of course. Trying to determine whether what I've been trying to implement is feasible at present or will be in the near future or if I need to shift focus to something else or a different approach to this problem.

2) Is there a sense of how the memory grows with the matrix size and with the number of OMP threads? My guess was that the band to tri/bi-diagonal took significantly less memory than step 1 even though it's on rank 0. I had also used 4 OMP threads per process in my run that crashed. I'm not sure, based on how it's implemented, if decreasing OMP threads while keeping memory per MPI process constant will help this.