Hello everyone,
Please find our minutes from meeting number 8 below.
Participants (in no particular order):
- Josef Weidendorfer
- Ralph Castain
- Norbert Eicker
- Simon Pickartz
- Isaias Compres
Dagstuhl Seminar on Dynamic Resource Management
- Martin Schulz' presentation about malleability with MPI Sessions
- Proposal covers the resource adaptation of a running MPI application
- 3 states: current allocation, transition, new allocation:
1.- Application runs as normal with its current Session
2.- A new session is opened to peek at allocation changes
- Low overhead operation; a local handle to the MPI runtime
- If there are no changes to allocation metadata, the Session is closed and the applications continues as normal
- If allocation metadata changes, the application parses the data and makes a decision
3.a - To reject the changes, the new Sessions is closed; original session is kept and the application continues as normal
3.b - To accept the allocation update, the application holds both sessions momentarily, does a repartition of its domain, and then closes the original Sessions while keeping the new one to continue its progress.
- No current proposal around negotiation: either RM driven or Application driven
- We have to work with this proposal while the standarization efforts are ongoing.
- No clear timeline for new malleable API to be approved.
- Flux resource manager: deeply hierarchical design with graph-based job requirements
- Co-scheduling discussions: need to utilizy ever larger, more parallel, single-nodes
- Cloud computing RMs (e.g. Kubernetes) or traditional schedulers for Supercomputing, an open question
Slurm fork for PMIx, Open PMIx integration and testing:
- Ralph has created a new for of Slurm for rapid development
- Decouples our experimentation from upstream approval of patches
- Ralph brings Open PMIx expertise, while other help with Slurm internals
- Aligned with DEEP-SEA activities
- Ralph integrated previous work and early testing is done
- Need to help with testing first
- Some issues are known and marked
- Need to sort out some PMIx standard violations on caller rules
- May need to do large refactorings to make the existing code more manageable
- For malleability: need to revice threading and copying of allocation metadata
- Isaias will look at threading and 'agent' use in Slurm
- Isaias: will identify which plugins can have PMIx versions, such as the 'launch' plugin
Organization:
- Incomming holiday season: next meeting on 12th of January, 2022
- In that meeting: rescehdule on a monthly basis
* I wish you all happy holidays, and we continue our meetings after the new year! *