Batch processing kata

Skip to first unread message

Leon Rosenshein

Aug 23, 2019, 12:56:18 PM8/23/19
to ArchitecturalKatas

Batches ‘R Us

A university wants to take full advantage of its various research clusters. They currently manage ~2K nodes across 50 clusters with an average utilization of 30%, but regularly get complaints that there are no resources available in a given cluster. The goal is to decrease the time customers spend waiting for resources, increase overall utilization, and reduce management overhead.

  • Users: ~50 teams with 5-15 people each

  • Requirements:

    • Isolation from noisy neighbors

    • Minimum throughput needs to be guaranteed

    • Workloads are:

      • Distributed/parallelized (10-10,000 tasks)

      • Tasks take 30 min - 30 hours

      • Prioritized. Higher priority work should push aside lower priority work

      • Run 10s/100s of times with different datasets

    • Lots of research work, so rapid iteration on code changes is imperative

    • Users need to be able to differentiate and aggregate these workloads

    • Workloads can’t install dependencies on nodes

  • Additional Context:

    • assume work is idempotent

    • there is plenty of network storage available

    • GPU support is a future use-case


Reply all
Reply to author
0 new messages