--
slurm-users mailing list -- slurm...@lists.schedmd.com
To unsubscribe send an email to slurm-us...@lists.schedmd.com
I think the idea of having a generous default timelimit is the
wrong way to go. In fact, I think any defaults for jobs are a bad
way to go. The majority of your users will just use that default
time limit, and backfill scheduling will remain useless to you.
Instead, I recommend you use your job_submit.lua to reject all
jobs that don't have a wallclock time and print out a helpful
error message to inform users they now need to specify a wallclock
time, and provide a link to documentation on how to do that.
Requiring users to specify a time limit themselves does two
things:
1. It reminds them that it's important to be conscious of timelimits when submitting jobs
2. If a job is killed before it's done and all the progress is
lost because the job wasn't checkpointing, they can't blame you as
the admin.
If you do this, it's easy to get the users on board by first
providing useful and usable documentation on why timelimits are
needed and how to set them. Be sure to hammer home the point that
effective timelimits can lead to their jobs running sooner, and
that effective timelimits can increase cluster
efficiency/utilization, helping them get a better return on their
investment (if they contribute to the clusters cost) or they'll
get more science done. I like to frame it that accurate wallclock
times will give them a competitive edge in getting their jobs
running before other cluster users. Everyone likes to think what
they're doing will give them an advantage!
My 4 cents (adjusted for inflation).
Prentice
-- Prentice Bisbal HPC Systems Engineer III Computational & Information Systems Laboratory (CISL) NSF National Center for Atmospheric Research (NSF NCAR) https://www.cisl.ucar.edu https://ncar.ucar.edu
Another 4 cents:
I think automatically increasing job time limits, or otherwise disabling job termination due to time, will cause you headaches down the road. Coupling Slurm’s behavior to the effort of the cluster, or other state, will be difficult to communicate to users
because the behavior of their jobs becomes non-deterministic. You’ll answer a lot of questions that start, “this job completed the last time I ran it…”. And, you’ll have to evaluate the state of the system near the job time limit to understand what happened
(write good logs!!). I’d avoid playing the detective at that, recurring, crime scene…
I forced my users to specify time limits and they quickly adapted:
`JobSubmitPlugins=require_timelimit`
Good luck!
Sebastian
--
>> >
>> >
>> > --
>> > Prentice Bisbal
>> > HPC Systems Engineer III
>> > Computational & Information Systems Laboratory (CISL)
>> > NSF National Center for Atmospheric Research (NSF NCAR)
>> > https://urldefense.com/v3/__https://www.cisl.ucar.edu__;!!NuzbfyPwt6ZyPHQ!pzCg16AzlVsRop9e83uUcbZt-GVO2XTyrusyMX9GljKv3d9nJtQDwTCAyGKAkgH4Ov1hF5BZIMqK0idZxv4Cu-WFy5yFuD5K_CSMT48$[cisl[.]ucar[.]edu]
>> > https://urldefense.com/v3/__https://ncar.ucar.edu__;!!NuzbfyPwt6ZyPHQ!pzCg16AzlVsRop9e83uUcbZt-GVO2XTyrusyMX9GljKv3d9nJtQDwTCAyGKAkgH4Ov1hF5BZIMqK0idZxv4Cu-WFy5yFuD5KArhlxNM$[ncar[.]ucar[.]edu]>> --
>> Dr. Loris Bennett (Herr/Mr)
>> FUB-IT, Freie Universität Berlin
>>
>>
>> --
>> slurm-users mailing list -- slurm...@lists.schedmd.com
>> To unsubscribe send an email to slurm-us...@lists.schedmd.com
>
>
> --
> slurm-users mailing list -- slurm...@lists.schedmd.com
> To unsubscribe send an email to slurm-us...@lists.schedmd.com
--
slurm-users mailing list -- slurm...@lists.schedmd.com
To unsubscribe send an email to slurm-us...@lists.schedmd.com
I'm the one to blame for taking this conversation off target.
Sorry about that!
Unfortunately, people are hard, and I don't think any amount of
technology will ever fix that. 😉
Thanks for explaining your situation, it's certainly different
than what most of us see. I would say you need to plan for growth
(i.e., a busy cluster). It sounds like you're already heading that
way now that you've already fixed the usability issue(s). Every
time you make a policy change, it takes effort to get the word out
and retrain/recondition your users to adapt to the change, so my
advice is whatever policy you go with now, try to choose one that
would survive at least a couple of increased cluster usage at the
rate your seeing it increase now.
I take the "if you build it, they will come" attitude - if you
design your cluster to handle a lot of traffic, it will!
imagine a user with a weeklong job who estimated a 7 day wallclock limit and "for good measure" requested 8 days, but then the job would actually take 9 days.
As much as I advocate for accurate timelimits, I always tell my users to specify a bit more, with a goal of 10-15% over. If they're not sure how long the job will run, or they have low confidence in predicting run time then they shouldn't be afraid to be be more generous with their estimates, and as they run more jobs, they'll get a better feel for predicting run time. Having a job get killed 5 minutes before it finishes after running for 72 hours is also wasteful of computer time. It's a balancing act.
Yes there's checkpointing, but that's way outside the scope of
this conversation.
Prentice