[slurm-users] Fw: slurm-users Digest, Vol 31, Issue 50

27 views

Skip to first unread message

Abhinandan Patil

unread,

May 13, 2020, 9:16:31 PM5/13/20

to slurm...@schedmd.com

Thank you Michael for pitching in to trouble shoot the config file.

Now my configfile looks like:

ClusterName=linux
ControlMachine=abhi-Latitude-E6430
SlurmctldPort=6817
SlurmdPort=6818
AuthType=auth/munge
SwitchType=switch/none
MpiDefault=none
ProctrackType=proctrack/pgid
Epilog=/usr/local/slurm/sbin/epilog
Prolog=/usr/local/slurm/sbin/prolog
SlurmdSpoolDir=/var/tmp/slurmd.spool
StateSaveLocation=/usr/local/slurm/slurm.state
TmpFS=/tmp
NodeName=abhi-Lenovo-ideapad-330-15IKB CPUS=4
NodeName=abhi-HP-EliteBook-840-G2 CPUS=4
PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP

abhi@abhi-Lenovo-ideapad-330-15IKB:~$ service slurmd status
● slurmd.service - Slurm node daemon
     Loaded: loaded (/lib/systemd/system/slurmd.service; enabled; vendor preset: enabled)
     Active: active (running) since Thu 2020-05-14 04:11:32 IST; 2h 28min ago
       Docs: man:slurmd(8)
    Process: 977 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=0/SUCCESS)
   Main PID: 1028 (slurmd)
      Tasks: 2
     Memory: 3.9M
     CGroup: /system.slice/slurmd.service
             └─1028 /usr/sbin/slurmd

abhi@abhi-HP-EliteBook-840-G2:~$ service slurmd status
● slurmd.service - Slurm node daemon
     Loaded: loaded (/lib/systemd/system/slurmd.service; enabled; vendor preset: enabled)
     Active: active (running) since Thu 2020-05-14 04:18:51 IST; 2h 24min ago
       Docs: man:slurmd(8)
    Process: 1313 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=0/SUCCESS)
   Main PID: 1372 (slurmd)
      Tasks: 2
     Memory: 3.8M
     CGroup: /system.slice/slurmd.service
             └─1372 /usr/sbin/slurmd

abhi@abhi-Latitude-E6430:~$ service slurmctld status
● slurmctld.service - Slurm controller daemon
     Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled; vendor preset: enabled)
     Active: active (running) since Thu 2020-05-14 04:11:21 IST; 2h 32min ago
       Docs: man:slurmctld(8)
    Process: 1208 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS (code=exited, status=0/SUCCESS)
   Main PID: 1306 (slurmctld)
      Tasks: 7
     Memory: 6.7M
     CGroup: /system.slice/slurmctld.service

└─1306 /usr/sbin/slurmctld

However still:

sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
debug* up infinite 1 down* abhi-Lenovo-ideapad-330-15IKB

My Study is inconclusive

Best Regards,

Abhinandan H. Patil, +919886406214
https://www.AbhinandanHPatil.info

----- Forwarded message -----

From: "slurm-use...@lists.schedmd.com" <slurm-use...@lists.schedmd.com>

To: "slurm...@lists.schedmd.com" <slurm...@lists.schedmd.com>

Sent: Thursday, 14 May 2020, 2:39:40 am GMT+5:30

Subject: slurm-users Digest, Vol 31, Issue 50

Send slurm-users mailing list submissions to

slurm...@lists.schedmd.com

To subscribe or unsubscribe via the World Wide Web, visit

https://lists.schedmd.com/cgi-bin/mailman/listinfo/slurm-users

or, via email, send a message with subject or body 'help' to

slurm-use...@lists.schedmd.com

You can reach the person managing the list at

slurm-us...@lists.schedmd.com

When replying, please edit your Subject line so it is more specific

than "Re: Contents of slurm-users digest..."

Today's Topics:

1. Re: Ubuntu Cluster with Slurm (Renfro, Michael)

2. Re: sacct returns nothing after reboot (Roger Mason)

3. Re: Reset TMPDIR for All Jobs (Ellestad, Erik)

4. Re: additional jobs killed by scancel. (Alastair Neil)

----------------------------------------------------------------------

Message: 1

Date: Wed, 13 May 2020 14:05:21 +0000

From: "Renfro, Michael" <Ren...@tntech.edu>

To: Slurm User Community List <slurm...@lists.schedmd.com>

Subject: Re: [slurm-users] Ubuntu Cluster with Slurm

Message-ID: <B4E26014-E420-4506...@tntech.edu>

Content-Type: text/plain; charset="utf-8"

I?d compare the RealMemory part of ?scontrol show node abhi-HP-EliteBook-840-G2? to the RealMemory part of your slurm.conf:

> Nodes which register to the system with less than the configured resources (e.g. too little memory), will be placed in the "DOWN" state to avoid scheduling jobs on them.

? https://slurm.schedmd.com/slurm.conf.html

As far as GPUs go, it looks like you have Intel graphics on the Lenovo and a Radeon R7 on the HP? If so, then nothing is CUDA-compatible, but you might be able to make something work with OpenCL. No idea if that would give performance improvements over the CPUs, though.

Mike Renfro, PhD / HPC Systems Administrator, Information Technology Services

931 372-3601 / Tennessee Tech University

> On May 13, 2020, at 8:42 AM, Abhinandan Patil <abhinandan...@yahoo.com> wrote:

> Dear All,

> Preamble

> ----------

> I want to form simple cluster with three laptops:

> abhi-Latitude-E6430 //This serves as the controller

> abhi-Lenovo-ideapad-330-15IKB //Compute Node

> abhi-HP-EliteBook-840-G2 //Compute Node

> Aim

> -------------

> I want to make use of CPU+GPU+RAM on all the machines when I execute JAVA programs or Python programs.

> Implementation

> ------------------------

> Now let us look at the slurm.conf

> On Machine abhi-Latitude-E6430

> ClusterName=linux

> ControlMachine=abhi-Latitude-E6430

> SlurmUser=abhi

> SlurmctldPort=6817

> SlurmdPort=6818

> AuthType=auth/munge

> SwitchType=switch/none

> StateSaveLocation=/tmp

> MpiDefault=none

> ProctrackType=proctrack/pgid

> NodeName=abhi-Lenovo-ideapad-330-15IKB RealMemory=12000 CPUs=2

> NodeName=abhi-HP-EliteBook-840-G2 RealMemory=14000 CPUs=2

> PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP

> Same slurm.conf is copied to all the Machines.

> Observations

> --------------------------------------

> Now when I do

> abhi@abhi-HP-EliteBook-840-G2:~$ service slurmd status

> ? slurmd.service - Slurm node daemon

> Loaded: loaded (/lib/systemd/system/slurmd.service; enabled; vendor preset: enabled)

> Active: active (running) since Wed 2020-05-13 18:50:01 IST; 1min 49s ago

> Docs: man:slurmd(8)

> Process: 98235 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=0/SUCCESS)

> Main PID: 98253 (slurmd)

> Tasks: 2

> Memory: 2.2M

> CGroup: /system.slice/slurmd.service

> ??98253 /usr/sbin/slurmd

> abhi@abhi-Lenovo-ideapad-330-15IKB:~$ service slurmd status

> ? slurmd.service - Slurm node daemon

> Loaded: loaded (/lib/systemd/system/slurmd.service; enabled; vendor preset: enabled)

> Active: active (running) since Wed 2020-05-13 18:50:20 IST; 8s ago

> Docs: man:slurmd(8)

> Process: 71709 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=0/SUCCESS)

> Main PID: 71734 (slurmd)

> Tasks: 2

> Memory: 2.0M

> CGroup: /system.slice/slurmd.service

> ??71734 /usr/sbin/slurmd

> abhi@abhi-Latitude-E6430:~$ service slurmctld status

> ? slurmctld.service - Slurm controller daemon

> Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled; vendor preset: enabled)

> Active: active (running) since Wed 2020-05-13 18:48:58 IST; 4min 56s ago

> Docs: man:slurmctld(8)

> Process: 97114 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS (code=exited, status=0/SUCCESS)

> Main PID: 97116 (slurmctld)

> Tasks: 7

> Memory: 2.6M

> CGroup: /system.slice/slurmctld.service

> ??97116 /usr/sbin/slurmctld

> However abhi@abhi-Latitude-E6430:~$ sinfo

> PARTITION AVAIL TIMELIMIT NODES STATE NODELIST

> debug* up infinite 1 down* abhi-Lenovo-ideapad-330-15IKB

> Advice needed

> ------------------------

> Please let me know Why I am seeing only one node.

> Further how the total memory is calculated? Can Slurm make use of GPU processing power as well

> Please let me know if I have missed something in configuration or explanation.

> Thank you all

> Best Regards,

> Abhinandan H. Patil, +919886406214

> https://www.AbhinandanHPatil.info

------------------------------

Message: 2

Date: Wed, 13 May 2020 12:20:11 -0230

From: Roger Mason <rma...@mun.ca>

To: Slurm User Community List <slurm...@lists.schedmd.com>

Subject: Re: [slurm-users] sacct returns nothing after reboot

Message-ID: <y65sgg3...@mun.ca>

Content-Type: text/plain

Hello,

Marcus Boden <mbo...@gwdg.de> writes:

> the default time window starts at 00:00:00 of the current day:

> -S, --starttime

> Select jobs in any state after the specified time. Default

> is 00:00:00 of the current day, unless the '-s' or '-j'

> options are used. If the '-s' option is used, then the

> default is 'now'. If states are given with the '-s' option

> then only jobs in this state at this time will be returned.

> If the '-j' option is used, then the default time is Unix

> Epoch 0. See the DEFAULT TIME WINDOW for more details.

Thank you! Obviously I did not read far enough down the man page.

Roger

------------------------------

Message: 3

Date: Wed, 13 May 2020 15:18:09 +0000

From: "Ellestad, Erik" <Erik.E...@ucsf.edu>

To: Slurm User Community List <slurm...@lists.schedmd.com>

Subject: Re: [slurm-users] Reset TMPDIR for All Jobs

Message-ID:

<BY5PR05MB690060B056...@BY5PR05MB6900.namprd05.prod.outlook.com>

Content-Type: text/plain; charset="utf-8"

Woo!

Thanks Marcus, that works.

Though, ahem, SLURM/SchedMD, if you're listening, would it hurt to cover this in the documentation regarding prolog/epilog, and maybe give an example?

https://slurm.schedmd.com/prolog_epilog.html

Just a thought,

Erik

Erik Ellestad

Wynton Cluster SysAdmin

UCSF

-----Original Message-----

From: slurm-users <slurm-use...@lists.schedmd.com> On Behalf Of Marcus Wagner

Sent: Tuesday, May 12, 2020 10:08 PM

To: slurm...@lists.schedmd.com

Subject: Re: [slurm-users] Reset TMPDIR for All Jobs

Hi Erik,

the output of task-prolog is sourced/evaluated (not really sure, how) in

the job environment.

Thus you don't have to export a variable in task-prolog, but echo the

export, e.g.

echo export TMPDIR=/scratch/$SLURM_JOB_ID

The variable will then be set in job environment.

Best

Marcus

Am 12.05.2020 um 17:40 schrieb Ellestad, Erik:

> I was wanted to set TMPDIR from /tmp to a per job directory I create in

> local /scratch/$SLURM_JOB_ID (for example)

> This bug suggests I should be able to do this in a task-prolog.

> https://bugs.schedmd.com/show_bug.cgi?id=2664

> However adding the following to task-prolog doesn?t seem to affect the

> variables the job script is running with.

> unset TMPDIR

> export TMPDIR=/scratch/$SLURM_JOB_ID

> It does work if it is done in the job script, rather than the task-prolog.

> Am I missing something?

> Erik

> --

> Erik Ellestad

> Wynton Cluster SysAdmin

> UCSF

------------------------------

Message: 4

Date: Wed, 13 May 2020 17:08:55 -0400

From: Alastair Neil <ajnei...@gmail.com>

To: Slurm User Community List <slurm...@lists.schedmd.com>

Subject: Re: [slurm-users] additional jobs killed by scancel.

Message-ID:

<CA+SarwpQMepkhWLC_RUqSi1Sza...@mail.gmail.com>

Content-Type: text/plain; charset="utf-8"

invalid field requested: "reason"

On Tue, 12 May 2020 at 16:47, Steven Dick <kg4...@gmail.com> wrote:

> What do you get from

> sacct -o jobid,elapsed,reason,exit -j 533900,533902

> On Tue, May 12, 2020 at 4:12 PM Alastair Neil <ajnei...@gmail.com>

> wrote:

> >

> > The log is continuous and has all the messages logged by slurmd on the

> node for all the jobs mentioned, below are the entries from the slurmctld

> log:

> >

> >> [2020-05-10T00:26:03.097] _slurm_rpc_kill_job: REQUEST_KILL_JOB

> JobId=533898 uid 1224431221

> >>

> >> [2020-05-10T00:26:03.098] email msg to ssh...@masonlive.gmu.edu:

> Slurm Job_id=533898 Name=r18-relu-ent Ended, Run time 04:36:17, CANCELLED,

> ExitCode 0

> >> [2020-05-10T00:26:03.098] job_signal: 9 of running JobId=533898

> successful 0x8004

> >> [2020-05-10T00:26:05.204] _job_complete: JobId=533902 WTERMSIG 9

> >> [2020-05-10T00:26:05.204] email msg to ssh...@masonlive.gmu.edu:

> Slurm Job_id=533902 Name=r18-soft-ent Failed, Run time 04:30:39, FAILED

> >> [2020-05-10T00:26:05.205] _job_complete: JobId=533902 done

> >> [2020-05-10T00:26:05.210] _job_complete: JobId=533900 WTERMSIG 9

> >> [2020-05-10T00:26:05.210] email msg to ssh...@masonlive.gmu.edu:

> Slurm Job_id=533900 Name=r18-soft Failed, Run time 04:32:51, FAILED

> >> [2020-05-10T00:26:05.215] _job_complete: JobId=533900 done

> >

> > it is curious, that all the jobs were running on the same processor,

> perhaps this is a cgroup related failure?

> >

> > On Tue, 12 May 2020 at 10:10, Steven Dick <kg4...@gmail.com> wrote:

> >>

> >> I see one job cancelled and two jobs failed.

> >> Your slurmd log is incomplete -- it doesn't show the two failed jobs

> >> exiting/failing, so the real error is not here.

> >>

> >> It might also be helpful to look through slurmctld's log starting from

> >> when the first job was canceled, looking at any messages mentioning

> >> the node or the two failed jobs.

> >>

> >> I've had nodes do strange things on job cancel. Last one I tracked

> >> down to the job epilog failing because it was NFS mounted and nfs was

> >> being slower than slurm liked, so it took the node offline and killed

> >> everything on it.

> >>

> >> On Mon, May 11, 2020 at 12:55 PM Alastair Neil <ajnei...@gmail.com>

> wrote:

> >> >

> >> > Hi there,

> >> >

> >> > We are using slurm 18.08 and had a weird occurrence over the

> weekend. A user canceled one of his jobs using scancel, and two additional

> jobs of the user running on the same node were killed concurrently. The

> jobs had no dependency, but they were all allocated 1 gpu. I am curious to

> know why this happened, and if this is a known bug is there a workaround

> to prevent it happening? Any suggestions gratefully received.

> >> >

> >> > -Alastair

> >> >

> >> > FYI

> >> > The cancelled job (533898) has this at the end of the .err file:

> >> >

> >> >> slurmstepd: error: *** JOB 533898 ON NODE056 CANCELLED AT

> 2020-05-10T00:26:03 ***

> >> >

> >> > both of the killed jobs (533900 and 533902) have this:

> >> >

> >> >> slurmstepd: error: get_exit_code task 0 died by signal

> >> >

> >> > here is the slurmd log from the node and ths how-job output for each

> job:

> >> >

> >> >> [2020-05-09T19:49:46.735] _run_prolog: run job script took usec=4

> >> >> [2020-05-09T19:49:46.735] _run_prolog: prolog with lock for job

> 533898 ran for 0 seconds

> >> >> [2020-05-09T19:49:46.754] ====================

> >> >> [2020-05-09T19:49:46.754] batch_job:533898 job_mem:10240MB

> >> >> [2020-05-09T19:49:46.754] JobNode[0] CPU[0] Job alloc

> >> >> [2020-05-09T19:49:46.755] JobNode[0] CPU[1] Job alloc

> >> >> [2020-05-09T19:49:46.756] JobNode[0] CPU[2] Job alloc

> >> >> [2020-05-09T19:49:46.757] JobNode[0] CPU[3] Job alloc

> >> >> [2020-05-09T19:49:46.758] ====================

> >> >> [2020-05-09T19:49:46.758] Launching batch job 533898 for UID

> 1224431221

> >> >> [2020-05-09T19:53:14.060] _run_prolog: run job script took usec=3

> >> >> [2020-05-09T19:53:14.060] _run_prolog: prolog with lock for job

> 533900 ran for 0 seconds

> >> >> [2020-05-09T19:53:14.080] ====================

> >> >> [2020-05-09T19:53:14.080] batch_job:533900 job_mem:10240MB

> >> >> [2020-05-09T19:53:14.081] JobNode[0] CPU[4] Job alloc

> >> >> [2020-05-09T19:53:14.082] JobNode[0] CPU[5] Job alloc

> >> >> [2020-05-09T19:53:14.083] JobNode[0] CPU[6] Job alloc

> >> >> [2020-05-09T19:53:14.083] JobNode[0] CPU[7] Job alloc

> >> >> [2020-05-09T19:53:14.084] ====================

> >> >> [2020-05-09T19:53:14.085] Launching batch job 533900 for UID

> 1224431221

> >> >> [2020-05-09T19:55:26.283] _run_prolog: run job script took usec=21

> >> >> [2020-05-09T19:55:26.284] _run_prolog: prolog with lock for job

> 533902 ran for 0 seconds

> >> >> [2020-05-09T19:55:26.304] ====================

> >> >> [2020-05-09T19:55:26.304] batch_job:533902 job_mem:10240MB

> >> >> [2020-05-09T19:55:26.304] JobNode[0] CPU[8] Job alloc

> >> >> [2020-05-09T19:55:26.305] JobNode[0] CPU[9] Job alloc

> >> >> [2020-05-09T19:55:26.306] JobNode[0] CPU[10] Job alloc

> >> >> [2020-05-09T19:55:26.306] JobNode[0] CPU[11] Job alloc

> >> >> [2020-05-09T19:55:26.307] ====================

> >> >> [2020-05-09T19:55:26.307] Launching batch job 533902 for UID

> 1224431221

> >> >> [2020-05-10T00:26:03.127] [533898.extern] done with job

> >> >> [2020-05-10T00:26:03.975] [533898.batch] error: *** JOB 533898 ON

> NODE056 CANCELLED AT 2020-05-10T00:26:03 ***

> >> >> [2020-05-10T00:26:04.425] [533898.batch] sending

> REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 15

> >> >> [2020-05-10T00:26:04.428] [533898.batch] done with job

> >> >> [2020-05-10T00:26:05.202] [533900.batch] error: get_exit_code task 0

> died by signal

> >> >> [2020-05-10T00:26:05.202] [533902.batch] error: get_exit_code task 0

> died by signal

> >> >> [2020-05-10T00:26:05.202] [533900.batch] sending

> REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 9

> >> >> [2020-05-10T00:26:05.202] [533902.batch] sending

> REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 9

> >> >> [2020-05-10T00:26:05.211] [533902.batch] done with job

> >> >> [2020-05-10T00:26:05.216] [533900.batch] done with job

> >> >> [2020-05-10T00:26:05.234] [533902.extern] done with job

> >> >> [2020-05-10T00:26:05.235] [533900.extern] done with job

> >> >

> >> >> [root@node056 2020-05-10]# cat 533{898,900,902}/show-job.txt

> >> >> JobId=533898 JobName=r18-relu-ent

> >> >> UserId=sshres2(1224431221) GroupId=users(100) MCS_label=N/A

> >> >> Priority=19375 Nice=0 Account=csjkosecka QOS=csqos

> >> >> JobState=CANCELLED Reason=None Dependency=(null)

> >> >> Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:15

> >> >> RunTime=04:36:17 TimeLimit=5-00:00:00 TimeMin=N/A

> >> >> SubmitTime=2020-05-09T19:49:45 EligibleTime=2020-05-09T19:49:45

> >> >> AccrueTime=2020-05-09T19:49:45

> >> >> StartTime=2020-05-09T19:49:46 EndTime=2020-05-10T00:26:03

> Deadline=N/A

> >> >> PreemptTime=None SuspendTime=None SecsPreSuspend=0

> >> >> LastSchedEval=2020-05-09T19:49:46

> >> >> Partition=gpuq AllocNode:Sid=ARGO-2:7221

> >> >> ReqNodeList=(null) ExcNodeList=(null)

> >> >> NodeList=NODE056

> >> >> BatchHost=NODE056

> >> >> NumNodes=1 NumCPUs=4 NumTasks=0 CPUs/Task=4 ReqB:S:C:T=0:0:*:*

> >> >> TRES=cpu=4,mem=10G,node=1,billing=4,gres/gpu=1

> >> >> Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*

> >> >> MinCPUsNode=4 MinMemoryNode=10G MinTmpDiskNode=0

> >> >> Features=(null) DelayBoot=00:00:00

> >> >> OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)

> >> >>

> Command=/scratch/sshres2/workspace-scratch/cs747-project/command_resnet18_relu_ent.slurm

> >> >> WorkDir=/scratch/sshres2/workspace-scratch/cs747-project

> >> >>

> StdErr=/scratch/sshres2/workspace-scratch/cs747-project/logs_slurm/r18-relu-ent-533898.err

> >> >> StdIn=/dev/null

> >> >>

> StdOut=/scratch/sshres2/workspace-scratch/cs747-project/logs_slurm/r18-relu-ent-533898.out

> >> >> Power=

> >> >> TresPerNode=gpu:1

> >> >>

> >> >> JobId=533900 JobName=r18-soft

> >> >> UserId=sshres2(1224431221) GroupId=users(100) MCS_label=N/A

> >> >> Priority=19375 Nice=0 Account=csjkosecka QOS=csqos

> >> >> JobState=FAILED Reason=JobLaunchFailure Dependency=(null)

> >> >> Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:9

> >> >> RunTime=04:32:51 TimeLimit=5-00:00:00 TimeMin=N/A

> >> >> SubmitTime=2020-05-09T19:53:13 EligibleTime=2020-05-09T19:53:13

> >> >> AccrueTime=2020-05-09T19:53:13

> >> >> StartTime=2020-05-09T19:53:14 EndTime=2020-05-10T00:26:05

> Deadline=N/A

> >> >> PreemptTime=None SuspendTime=None SecsPreSuspend=0

> >> >> LastSchedEval=2020-05-09T19:53:14

> >> >> Partition=gpuq AllocNode:Sid=ARGO-2:7221

> >> >> ReqNodeList=(null) ExcNodeList=(null)

> >> >> NodeList=NODE056

> >> >> BatchHost=NODE056

> >> >> NumNodes=1 NumCPUs=4 NumTasks=0 CPUs/Task=4 ReqB:S:C:T=0:0:*:*

> >> >> TRES=cpu=4,mem=10G,node=1,billing=4,gres/gpu=1

> >> >> Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*

> >> >> MinCPUsNode=4 MinMemoryNode=10G MinTmpDiskNode=0

> >> >> Features=(null) DelayBoot=00:00:00

> >> >> OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)

> >> >>

> Command=/scratch/sshres2/workspace-scratch/cs747-project/command_resnet18_soft.slurm

> >> >> WorkDir=/scratch/sshres2/workspace-scratch/cs747-project

> >> >>

> StdErr=/scratch/sshres2/workspace-scratch/cs747-project/logs_slurm/r18-soft-533900.err

> >> >> StdIn=/dev/null

> >> >>

> StdOut=/scratch/sshres2/workspace-scratch/cs747-project/logs_slurm/r18-soft-533900.out

> >> >> Power=

> >> >> TresPerNode=gpu:1

> >> >>

> >> >> JobId=533902 JobName=r18-soft-ent

> >> >> UserId=sshres2(1224431221) GroupId=users(100) MCS_label=N/A

> >> >> Priority=19375 Nice=0 Account=csjkosecka QOS=csqos

> >> >> JobState=FAILED Reason=JobLaunchFailure Dependency=(null)

> >> >> Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:9

> >> >> RunTime=04:30:39 TimeLimit=5-00:00:00 TimeMin=N/A

> >> >> SubmitTime=2020-05-09T19:55:26 EligibleTime=2020-05-09T19:55:26

> >> >> AccrueTime=2020-05-09T19:55:26

> >> >> StartTime=2020-05-09T19:55:26 EndTime=2020-05-10T00:26:05

> Deadline=N/A

> >> >> PreemptTime=None SuspendTime=None SecsPreSuspend=0

> >> >> LastSchedEval=2020-05-09T19:55:26

> >> >> Partition=gpuq AllocNode:Sid=ARGO-2:7221

> >> >> ReqNodeList=(null) ExcNodeList=(null)

> >> >> NodeList=NODE056

> >> >> BatchHost=NODE056

> >> >> NumNodes=1 NumCPUs=4 NumTasks=0 CPUs/Task=4 ReqB:S:C:T=0:0:*:*

> >> >> TRES=cpu=4,mem=10G,node=1,billing=4,gres/gpu=1

> >> >> Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*

> >> >> MinCPUsNode=4 MinMemoryNode=10G MinTmpDiskNode=0

> >> >> Features=(null) DelayBoot=00:00:00

> >> >> OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)

> >> >>

> Command=/scratch/sshres2/workspace-scratch/cs747-project/command_resnet18_soft_ent.slurm

> >> >> WorkDir=/scratch/sshres2/workspace-scratch/cs747-project

> >> >>

> StdErr=/scratch/sshres2/workspace-scratch/cs747-project/logs_slurm/r18-soft-ent-533902.err

> >> >> StdIn=/dev/null

> >> >>

> StdOut=/scratch/sshres2/workspace-scratch/cs747-project/logs_slurm/r18-soft-ent-533902.out

> >> >> Power=

> >> >> TresPerNode=gpu:1

> >> >

> >>

-------------- next part --------------

An HTML attachment was scrubbed...

URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200513/8ff7b80b/attachment.htm>

End of slurm-users Digest, Vol 31, Issue 50

*******************************************

Chris Samuel

unread,

May 14, 2020, 2:25:10 AM5/14/20

to slurm...@lists.schedmd.com

On Wednesday, 13 May 2020 6:15:53 PM PDT Abhinandan Patil wrote:

> However still:
> sinfo
> PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
> debug* up infinite 1 down* abhi-Lenovo-ideapad-330-15IKB

What does "sinfo -R" say ?

If the node was down at some point you may need to resume it.

All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA

Reply all

Reply to author

Forward

0 new messages