[slurm-users] Job dispatching policy

16 views
Skip to first unread message

Mahmood Naderan

unread,
Apr 23, 2019, 2:47:55 AM4/23/19
to Slurm User Community List
Hi,
How can I change the job distribution policy? Since some nodes are running non-slurm jobs, it seems that the dispatcher isn't aware of system load. Therefore, it assumes that the node is free.

I want to change the policy based on the system load.

Regards,
Mahmood



Richard Randriatoamanana

unread,
Apr 23, 2019, 3:05:10 AM4/23/19
to Slurm User Community List
Hi Mahmood,

Try the LBNL Node Health Check tool. Nodes which are determined to be "unhealthy" can be marked as down or offline so as to prevent jobs from being scheduled or run on them.

https://github.com/mej/nhc/blob/master/README.md#lbnl-node-health-check-nhc


Regards,

Richard

@cnscfr

--
Sent from my mobile
Apologies for the typos

Prentice Bisbal

unread,
Apr 23, 2019, 10:11:58 AM4/23/19
to slurm...@lists.schedmd.com

This is not a good practice. Allowing users to submit jobs that are controlled by Slurm outside of the Slurm mechanism kind of defeats the purpose of using Slurm in the first place.

--
Prentice

Mahmood Naderan

unread,
Apr 24, 2019, 2:16:02 AM4/24/19
to Slurm User Community List
Thanks for the info.
Thing is that I don't want to totally set the node as unhealthy. Assume the following scenarios:

compute-0-0 running slurm jobs and system load is 15 (32 cores)
compute-0-1 running non-slurm jobs and system load is 25 (32 cores)
Then a new slurm job should be dispatched to compute-0-0


compute-0-0 running slurm jobs and system load is 25 (32 cores)
compute-0-1 running non-slurm jobs and system load is 10 (32 cores)
Then a new slurm job should be run on compute-0-1 (assuming that it need about 10 cores and not 30 cores).


I know that running non slurm jobs sounds ugly, but there are some X11 applications that are not slurm friendly.
Number of non slurm nodes though are small.


John Hearns

unread,
Apr 24, 2019, 4:01:20 AM4/24/19
to Slurm User Community List
I would suggest that if those applications really are not possible with Slurm - then reserve a set of nodes for interactive use and disable the Slurm daemon on them.
Direct users to those nodes.

More constructively - maybe the list can help you get the X11 applications to run using Slurm.
Could you give some details please?

Mahmood Naderan

unread,
Apr 27, 2019, 5:21:55 AM4/27/19
to Slurm User Community List
>More constructively - maybe the list can help you get the X11 applications to run using Slurm.
>Could you give some details please?



For example, I an not run this GUI program with salloc


[mahmood@rocks7 ~]$ cat workbench.sh
#!/bin/bash
unset SLURM_GTIDS
/state/partition1/ans190/v190/Framework/bin/Linux64/runwb2
[mahmood@rocks7 ~]$ rocks run host compute-0-1 "ls /state/partition1/ans190/v190/Framework/bin/Linux64/runwb2"
Warning: untrusted X11 forwarding setup failed: xauth key data not generated
/state/partition1/ans190/v190/Framework/bin/Linux64/runwb2
[mahmood@rocks7 ~]$ salloc -w compute-0-1 -c 2 --mem=4G -p RUBY -A y4 ./workbench.sh
salloc: Granted job allocation 938
./workbench.sh: line 4: /state/partition1/ans190/v190/Framework/bin/Linux64/runwb2: No such file or directory
salloc: Relinquishing job allocation 938



Regards,
Mahmood




Chris Samuel

unread,
Apr 27, 2019, 11:46:59 AM4/27/19
to slurm...@lists.schedmd.com
On 27/4/19 2:20 am, Mahmood Naderan wrote:

> ./workbench.sh: line 4:
> /state/partition1/ans190/v190/Framework/bin/Linux64/runwb2: No such file
> or directory

That doesn't look like it's related to Slurm to me, if the file itself
exists then my suspicion is that it's a script and the interpreter it
has in the first #! line does not exist.

What does this command say on that node?

file /state/partition1/ans190/v190/Framework/bin/Linux64/runwb2

All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA

Mahmood Naderan

unread,
Apr 29, 2019, 8:20:16 AM4/29/19
to Slurm User Community List
[mahmood@rocks7 ~]$ rocks run host compute-0-1 "file /state/partition1/ans190/v190/Framework/bin/Linux64/runwb2"

Warning: untrusted X11 forwarding setup failed: xauth key data not generated
/state/partition1/ans190/v190/Framework/bin/Linux64/runwb2: POSIX shell script, ASCII text executable
[mahmood@rocks7 ~]$ ssh compute-0-1 -Y
Last login: Mon Apr 29 08:12:07 2019 from rocks7.local
Rocks Compute Node
Rocks 7.0 (Manzanita)
Profile built 17:50 24-Dec-2018

Kickstarted 09:35 24-Dec-2018
[mahmood@compute-0-1 ~]$ /state/partition1/ans190/v190/Framework/bin/Linux64/runwb2
[mahmood@compute-0-1 ~]$


On the node, the program opened and I saw the GUI. Then I closed it.


This is not the only problem. I also have problems with qemu runs.


Regards,
Mahmood




Prentice Bisbal

unread,
Apr 29, 2019, 9:22:50 AM4/29/19
to slurm...@lists.schedmd.com

I see two separate, unrelated problems here:

Problem 1:

Warning: untrusted X11 forwarding setup failed: xauth key data not generated

What have you done to investigate this xauth problem further?

I know there have been discussions about this problem in the past on this mailing list. Did you search to see if the previous discussions contained a fix for this?

Problem 2:

./workbench.sh: line 4: /state/partition1/ans190/v190/Framework/bin/Linux64/runwb2: No such file or directory

Some thing is wrong with your job specification. You are referencing an incorrect path somewhere.

Prentice

Chris Samuel

unread,
Apr 29, 2019, 8:59:25 PM4/29/19
to slurm...@lists.schedmd.com
On Monday, 29 April 2019 5:18:56 AM PDT Mahmood Naderan wrote:

> [mahmood@rocks7 ~]$ rocks run host compute-0-1 "file
> /state/partition1/ans190/v190/Framework/bin/Linux64/runwb2"

Given that file says it's a shell script, try and run it with this to see what doesn't work:

rocks run host compute-0-1 /bin/bash -x /state/partition1/ans190/v190/Framework/bin/Linux64/runwb2

Also why aren't you using the Slurm commands to run things?
Does this "rocks" command use them under the covers?

Mahmood Naderan

unread,
Apr 30, 2019, 1:49:50 PM4/30/19
to Slurm User Community List
>Also why aren't you using the Slurm commands to run things?

Which command?

Regards,
Mahmood




Mark Hahn

unread,
Apr 30, 2019, 1:56:22 PM4/30/19
to Slurm User Community List
>> Also why aren't you using the Slurm commands to run things?
>
> Which command?

srun or sbatch

Reply all
Reply to author
Forward
0 new messages