Batches ‘R Us
A university wants to take full advantage of its various research clusters. They currently manage ~2K nodes across 50 clusters with an average utilization of 30%, but regularly get complaints that there are no resources available in a given cluster. The goal is to decrease the time customers spend waiting for resources, increase overall utilization, and reduce management overhead.
Users: ~50 teams with 5-15 people each
Requirements:
Isolation from noisy neighbors
Minimum throughput needs to be guaranteed
Workloads are:
Distributed/parallelized (10-10,000 tasks)
Tasks take 30 min - 30 hours
Prioritized. Higher priority work should push aside lower priority work
Run 10s/100s of times with different datasets
Lots of research work, so rapid iteration on code changes is imperative
Users need to be able to differentiate and aggregate these workloads
Workloads can’t install dependencies on nodes
Additional Context:
assume work is idempotent
there is plenty of network storage available
GPU support is a future use-case