The time slot for the calls remains unmodified: first Wednesday of the month, at 17:00 German time. Our next meeting is therefore Wednesday, February 7th.
Today we had a conversation about preemption, and how it is relevant to dynamic workflows:
- Compute centers may define preemption policies, and PMIx needs to support them when defined.
- Can be combined with checkpointing, to ensure a preempted job preservers its progress .
- Time allowed for checkpoint may be defined by the system, or negotiated with the job
- We need to define how the job is notified, e.g. event from pmix or signal, etc.
- How can the application respond? An event back?
- Which attributes are there? Do we need new ones?
- Consider use cases, such as urgent computing, "power corridor"/band and/or billing situations (e.g. SuperMUC), etc.
- Some research has been done on the scheduling front.
- Power controls can introduce load imbalances and this can bring side effects, such as overloading message queues and/or memory.
- Homogeneous performance at different power per unit, or
- Same power per unit, but heterogeneous performance.
- Can benefit mixed-model workloads: e.g. MPI+X where X tends to be dynamic such as map-reduce or AI
- Use resources for epochs instead of holding them for the entirety of the excecution
- Interest to ensure higher utilization by compute centers.
We will make a small survey about preemption in supercomputing and related fields (Isaias volunteers to start it).
Participants: