[slurm-users] Detecting non-MPI jobs running on multiple nodes

192 views
Skip to first unread message

Loris Bennett

unread,
Sep 29, 2022, 3:27:18 AM9/29/22
to Slurm Users Mailing List
Hi,

Has anyone already come up with a good way to identify non-MPI jobs which
request multiple cores but don't restrict themselves to a single node,
leaving cores idle on all but the first node?

I can see that this is potentially not easy, since an MPI job might have
still have phases where only one core is actually being used.

Cheers,

Loris

--
Dr. Loris Bennett (Herr/Mr)
ZEDAT, Freie Universität Berlin Email loris....@fu-berlin.de

Davide DelVento

unread,
Sep 29, 2022, 8:28:56 AM9/29/22
to Slurm User Community List
At my previous job there were cron jobs running everywhere measuring
possibly idle cores which were eventually averaged out for the
duration of the job, and reported (the day after) via email to the
user support team.
I believe they stopped doing so when compute became (relatively) cheap
at the expense of memory and I/O becoming expensive.

I know, it does not help you much, but perhaps something to think about

Ole Holm Nielsen

unread,
Sep 29, 2022, 8:52:12 AM9/29/22
to slurm...@lists.schedmd.com
Hi Loris,

On 9/29/22 09:26, Loris Bennett wrote:
> Has anyone already come up with a good way to identify non-MPI jobs which
> request multiple cores but don't restrict themselves to a single node,
> leaving cores idle on all but the first node?
>
> I can see that this is potentially not easy, since an MPI job might have
> still have phases where only one core is actually being used.

Just an idea: The "pestat -F" tool[1] will tell you if any nodes have an
"unexpected" CPU load. If you see the same JobID runing on multiple nodes
with a too low CPU load, that might point to a job such as you describe.

/Ole

[1] https://github.com/OleHolmNielsen/Slurm_tools/tree/master/pestat

Loris Bennett

unread,
Sep 29, 2022, 9:21:41 AM9/29/22
to Slurm User Community List
Hi Davide,

That is a interesting idea. We already do some averaging, but over the
whole of the past month. For each user we use the output of seff to
generate two scatterplots: CPU-efficiency vs. CPU-hours and
memory-efficiency vs. GB-hours. See

https://www.fu-berlin.de/en/sites/high-performance-computing/Dokumentation/Statistik

However, I am mainly interested in being able to cancel some of the inefficient
jobs before they have run for too long.

Cheers,

Loris

Loris Bennett

unread,
Sep 29, 2022, 9:41:27 AM9/29/22
to Slurm User Community List
Hi Ole,
I do already use 'pestat -F' although this flags over 100 of our 170
nodes, so it results in a bit of information overload. I guess it would
be nice if the sensitivity of the flagging could be tweaked on the
command line, so that only the worst nodes are shown.

I also use some wrappers around 'sueff' from

https://github.com/ubccr/stubl

to generate part of an ASCII dashboard (an dasciiboard?), which looks
like

Username Mem_Request Max_Mem_Use CPU_Efficiency Number_of_CPUs_In_Use
alpha 42000M 0.03Gn 48.80% (0.98 of 2)
beta 10500M 11.01Gn 99.55% (3.98 of 4)
gamma 8000M 8.39Gn 99.64% (63.77 of 64)
...
chi varied 3.96Gn 83.65% (248.44 of 297)
phi 1800M 1.01Gn 98.79% (248.95 of 252)
omega 16G 4.61Gn 99.69% (127.60 of 128)

== Above data from: Thu 29 Sep 15:26:29 CEST 2022 =============================

and just loops every 30 seconds. This is what I use to spot users with
badly configured jobs.

However, I'd really like to be able to identify non-MPI jobs on multiple
nodes automatically.

Bernstein, Noam CIV USN NRL (6393) Washington DC (USA)

unread,
Sep 29, 2022, 10:04:58 AM9/29/22
to Slurm User Community List
Can you check slurm for a job that requests multiple nodes but doesn't have mpirun (or srun, or mpiexec) running on its head node?

Ward Poelmans

unread,
Sep 29, 2022, 10:13:44 AM9/29/22
to slurm...@lists.schedmd.com
Hi Loris,

On 29/09/2022 09:26, Loris Bennett wrote:

> I can see that this is potentially not easy, since an MPI job might have
> still have phases where only one core is actually being used.

Slurm will create the needed cgroups on all the nodes that are part of the job when the job starts. So you could with a cron job check if there are any cgroups on the node with no processes in it?

Ward

Steffen Grunewald

unread,
Sep 29, 2022, 10:35:13 AM9/29/22
to Slurm User Community List
On Thu, 2022-09-29 at 14:03:58 +0000, Bernstein, Noam CIV USN NRL (6393) Washington DC (USA) wrote:
> Can you check slurm for a job that requests multiple nodes but doesn't have mpirun (or srun, or mpiexec) running on its head node?

Hi Noam,

I'm wondering why one would want to know that - given that there are
approaches to multi-node operation beyond MPI (Charm++ comes to mind)?

Best,
Steffen

--
Steffen Grunewald, Cluster Administrator
Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany
~~~
Fon: +49-331-567 7274
Mail: steffen.grunewald(at)aei.mpg.de
~~~

Bernstein, Noam CIV USN NRL (6393) Washington DC (USA)

unread,
Sep 29, 2022, 10:52:14 AM9/29/22
to Slurm User Community List
On Sep 29, 2022, at 10:34 AM, Steffen Grunewald <steffen....@aei.mpg.de> wrote:

Hi Noam,

I'm wondering why one would want to know that - given that there are
approaches to multi-node operation beyond MPI (Charm++ comes to mind)?

The thread title requested a way of detecting non-MPI jobs running on multiple nodes.  I assumed that the requester knows, maybe based on their users' software, that there are no legitimate ways for them to run on multiple nodes without MPI. Actually, we have users that run embarrassingly parallel jobs which just ssh to the other nodes and gather files, so clearly it can be done in a useful way with very low-tech approaches, but that's a n oddball (and just plain old) software package.

Loris Bennett

unread,
Sep 30, 2022, 1:52:27 AM9/30/22
to Slurm User Community List
"Bernstein, Noam CIV USN NRL (6393) Washington DC (USA)"
There may indeed be legitimate ways for non-MPI jobs to be running on
multiple nodes, but that's a bit of an edge case. However, such cases
would be fine, as long as the resources requested are being used
efficiently. Thus, Ward's suggestion about checking for cgroups seems
the most general solution. Having said that, it would also be useful to
then check the head node for 'mpirun' or similar.

Cheers,

Loris

Reply all
Reply to author
Forward
0 new messages