[slurm-dev] srun to existing allocation, but just a specific node

2 views
Skip to first unread message

Thompson, Matt[SCIENCE SYSTEMS AND APPLICATIONS INC]

unread,
Aug 26, 2015, 11:06:54 AM8/26/15
to slurm-dev

SLURM Dev,

I'm hoping you can help me with something. I recently had need to figure
out what's going on inside a running batch job. I had a suspicion that
maybe a job that should use, say, 48 cores on a few nodes managed to
somehow pack them all on a single node due to my idiocy with an mpirun
command.

So, I have a job-id and I know I can do, say:

srun --jobid=<JOBID> ps -ef

and I'll get a ps on the nodes in that allocation. But, if that
allocation has, say, 14 nodes, I get 14 nodes worth of information that
is hard to parse out since ps doesn't prepend/print hostname[1].

I thought maybe there is a way to run the srun command on just one of
the nodes in the allocation and I tried:

srun --jobid=<JOBID> --nodelist=node1 ps -ef

where node1 is one of the nodes in the allocation. But, no, that doesn't
seem to do what I'd hoped as I still get every node running ps.

Now, I'm sure I could whip up a bash script which tests for the hostname
and runs a command only if that matches the one I want, but I was hoping
for a nice simple way with srun itself to do this.

Matt

[1] That I know of. I didn't see "hostname" in the ps manpage.
--
Matt Thompson SSAI, Sr Software Test Engr
NASA GSFC, Global Modeling and Assimilation Office
Code 610.1, 8800 Greenbelt Rd, Greenbelt, MD 20771
Phone: 301-614-6712 Fax: 301-614-6246

Aaron Knister

unread,
Aug 26, 2015, 11:24:59 AM8/26/15
to slurm-dev

Try srun -w $NODENAME

Sent from my iPhone

Thompson, Matt[SCIENCE SYSTEMS AND APPLICATIONS INC]

unread,
Aug 26, 2015, 12:33:53 PM8/26/15
to slurm-dev

Nope. That doesn't seem to work:

> (4607) $ squeue -l -u mathomp4
> Wed Aug 26 12:03:44 2015
> JOBID PARTITION NAME USER STATE TIME TIMELIMIT NODES NODELIST(REASON)
> 5096279 compute EnADAS-I mathomp4 RUNNING 3:49 1:00:00 8 borgo[046-053]
> 5095474 compute Interact mathomp4 RUNNING 3:11:09 9:00:00 8 borgo[037-041,043-045]

So I can try on borgo046 which is in batch:

> (4608) $ srun -w borgo046 ps
> srun.slurm: Required node not available (down or drained)
> srun.slurm: job 5096288 queued and waiting for resources
>

Looks like it wants to try and allocate a new allocation...so it'll wait
until the job is done.

And since -w == --nodelist, it doesn't work with --jobid.

Matt

Skouson, Gary B

unread,
Aug 26, 2015, 7:27:06 PM8/26/15
to slurm-dev
I'm not sure what you're looking for. Here's what works for me.

$ salloc -A mscfops -N 4 -n 64 -t 1800
salloc: Job is in held state, pending scheduler release
salloc: Pending job allocation 10056660
salloc: job 10056660 queued and waiting for resources
salloc: job 10056660 has been allocated resources
salloc: Granted job allocation 10056660
$ squeue -au me
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
10056660 small bash me R 0:07 4 g[510-513]
$ srun -n 1 -N 1 -w g511 hostname
g511
$ srun -n 1 -N 1 -w g513 hostname
g513
$ srun -n 1 -N 1 -w g510 hostname
g510

You can use the -l arg for srun to prepend output with the rank number:

$ srun -N 4 -n 4 -l hostname
2: g512
0: g510
3: g513
1: g511
$ srun -l hostname | dshbak -c
----------------
[00-09,10-15]
----------------
g510
----------------
[16-31]
----------------
g511
----------------
[32-47]
----------------
g512
----------------
[48-63]
----------------
g513

Running from outside the allocation you could do:

$ srun --jobid=10056660 -l -w g510 hostname
0: g510
2: g512
1: g511
3: g513
$ srun --jobid=10056660 -l -N 1 -w g510 hostname
0: g510

-----
Gary Skouson

Danny Auble

unread,
Aug 26, 2015, 7:55:56 PM8/26/15
to slurm-dev

If you don't request a node count '-N1' the srun will grab all the nodes
in the allocation. --nodelist says "give me at least this one node plus
any thing else to fulfill my request".

Try -N1 -n1 and you should get a srun that only runs on one of the nodes
in the allocation. As long as you are in the allocation you shouldn't
need --jobid=.

srun -N1 -n1 --nodelist=node1 ps -ef

should get you what you want.

On 08/26/15 08:06, Thompson, Matt[SCIENCE SYSTEMS AND APPLICATIONS INC]

Thompson, Matt[SCIENCE SYSTEMS AND APPLICATIONS INC]

unread,
Aug 27, 2015, 8:42:59 AM8/27/15
to slurm-dev

All,

The folks here at NCCS helped me with a possible way. I do something
like this:

> srun --output=srunps.%N-job%j-%t.out --jobid=5102027 'ps -ef'

So that it puts each node's ps in a separate file delineated by the
hostname.

Oddly, I can't seem to figure out how to pipe inside that call: issuing
'ps -ef' does the same as 'ps -ef | grep GEOS'.

Still, it works pretty well!

Matt


On 08/26/2015 12:05 PM, Thompson, Matt (GSFC-610.1)[SCIENCE SYSTEMS AND

John Hearns

unread,
Aug 27, 2015, 9:11:09 AM8/27/15
to slurm-dev
> srun --output=srunps.%N-job%j-%t.out --jobid=5102027 'ps -ef'

So that it puts each node's ps in a separate file delineated by the hostname.

Oddly, I can't seem to figure out how to pipe inside that call: issuing 'ps -ef' does the same as 'ps -ef | grep GEOS'.


Matt, try using the pgrep command : pgrep -l GEOS
Or something like that!



Also was there not discussion on this list of a slurm utility which ran top on all nodes recently?
I looked back on my emails and cannot find it though.


#####################################################################################
Scanned by MailMarshal - M86 Security's comprehensive email content security solution.
#####################################################################################
Any views or opinions presented in this email are solely those of the author and do not necessarily represent those of the company. Employees of XMA Ltd are expressly required not to make defamatory statements and not to infringe or authorise any infringement of copyright or any other legal right by email communications. Any such communication is contrary to company policy and outside the scope of the employment of the individual concerned. The company will not accept any liability in respect of such communication, and the employee responsible will be personally liable for any damages or other liability arising. XMA Limited is registered in England and Wales (registered no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane, Wilford, Nottingham, NG11 7EP
Reply all
Reply to author
Forward
0 new messages