[slurm-users] Fw: slurm-users Digest, Vol 31, Issue 50

27 views
Skip to first unread message

Abhinandan Patil

unread,
May 13, 2020, 9:16:31 PM5/13/20
to slurm...@schedmd.com
Thank you Michael for pitching in to trouble shoot the config file.

Now my configfile looks like:

ClusterName=linux
ControlMachine=abhi-Latitude-E6430
SlurmctldPort=6817
SlurmdPort=6818
AuthType=auth/munge
SwitchType=switch/none
MpiDefault=none
ProctrackType=proctrack/pgid
Epilog=/usr/local/slurm/sbin/epilog
Prolog=/usr/local/slurm/sbin/prolog
SlurmdSpoolDir=/var/tmp/slurmd.spool
StateSaveLocation=/usr/local/slurm/slurm.state
TmpFS=/tmp
NodeName=abhi-Lenovo-ideapad-330-15IKB CPUS=4
NodeName=abhi-HP-EliteBook-840-G2 CPUS=4
PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP

abhi@abhi-Lenovo-ideapad-330-15IKB:~$ service slurmd status
● slurmd.service - Slurm node daemon
     Loaded: loaded (/lib/systemd/system/slurmd.service; enabled; vendor preset: enabled)
     Active: active (running) since Thu 2020-05-14 04:11:32 IST; 2h 28min ago
       Docs: man:slurmd(8)
    Process: 977 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=0/SUCCESS)
   Main PID: 1028 (slurmd)
      Tasks: 2
     Memory: 3.9M
     CGroup: /system.slice/slurmd.service
             └─1028 /usr/sbin/slurmd

abhi@abhi-HP-EliteBook-840-G2:~$ service slurmd status
● slurmd.service - Slurm node daemon
     Loaded: loaded (/lib/systemd/system/slurmd.service; enabled; vendor preset: enabled)
     Active: active (running) since Thu 2020-05-14 04:18:51 IST; 2h 24min ago
       Docs: man:slurmd(8)
    Process: 1313 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=0/SUCCESS)
   Main PID: 1372 (slurmd)
      Tasks: 2
     Memory: 3.8M
     CGroup: /system.slice/slurmd.service
             └─1372 /usr/sbin/slurmd

abhi@abhi-Latitude-E6430:~$ service slurmctld status
● slurmctld.service - Slurm controller daemon
     Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled; vendor preset: enabled)
     Active: active (running) since Thu 2020-05-14 04:11:21 IST; 2h 32min ago
       Docs: man:slurmctld(8)
    Process: 1208 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS (code=exited, status=0/SUCCESS)
   Main PID: 1306 (slurmctld)
      Tasks: 7
     Memory: 6.7M
     CGroup: /system.slice/slurmctld.service
             └─1306 /usr/sbin/slurmctld

However still:
 sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*       up   infinite      1  down* abhi-Lenovo-ideapad-330-15IKB

My Study is inconclusive

Best Regards,


----- Forwarded message -----
Sent: Thursday, 14 May 2020, 2:39:40 am GMT+5:30
Subject: slurm-users Digest, Vol 31, Issue 50

Send slurm-users mailing list submissions to

To subscribe or unsubscribe via the World Wide Web, visit
or, via email, send a message with subject or body 'help' to

You can reach the person managing the list at

When replying, please edit your Subject line so it is more specific
than "Re: Contents of slurm-users digest..."


Today's Topics:

  1. Re: Ubuntu Cluster with Slurm (Renfro, Michael)
  2. Re: sacct returns nothing after reboot (Roger Mason)
  3. Re: Reset TMPDIR for All Jobs (Ellestad, Erik)
  4. Re: additional jobs killed by scancel. (Alastair Neil)


----------------------------------------------------------------------

Message: 1
Date: Wed, 13 May 2020 14:05:21 +0000
From: "Renfro, Michael" <Ren...@tntech.edu>
To: Slurm User Community List <slurm...@lists.schedmd.com>
Subject: Re: [slurm-users] Ubuntu Cluster with Slurm
Content-Type: text/plain; charset="utf-8"

I?d compare the RealMemory part of ?scontrol show node abhi-HP-EliteBook-840-G2? to the RealMemory part of your slurm.conf:

> Nodes which register to the system with less than the configured resources (e.g. too little memory), will be placed in the "DOWN" state to avoid scheduling jobs on them.


As far as GPUs go, it looks like you have Intel graphics on the Lenovo and a Radeon R7 on the HP? If so, then nothing is CUDA-compatible, but you might be able to make something work with OpenCL. No idea if that would give performance improvements over the CPUs, though.

--
Mike Renfro, PhD / HPC Systems Administrator, Information Technology Services
931 372-3601    / Tennessee Tech University

> On May 13, 2020, at 8:42 AM, Abhinandan Patil <abhinandan...@yahoo.com> wrote:
>
> Dear All,
>
> Preamble
> ----------
> I want to form simple cluster with three laptops:
> abhi-Latitude-E6430  //This serves as the controller
> abhi-Lenovo-ideapad-330-15IKB //Compute Node
> abhi-HP-EliteBook-840-G2 //Compute Node
>
>
> Aim
> -------------
> I want to make use of CPU+GPU+RAM on all the machines when I execute JAVA programs or Python programs.
>
>
> Implementation
> ------------------------
> Now let us look at the slurm.conf
>
> On Machine abhi-Latitude-E6430
>
> ClusterName=linux
> ControlMachine=abhi-Latitude-E6430
> SlurmUser=abhi
> SlurmctldPort=6817
> SlurmdPort=6818
> AuthType=auth/munge
> SwitchType=switch/none
> StateSaveLocation=/tmp
> MpiDefault=none
> ProctrackType=proctrack/pgid
> NodeName=abhi-Lenovo-ideapad-330-15IKB RealMemory=12000 CPUs=2
> NodeName=abhi-HP-EliteBook-840-G2 RealMemory=14000 CPUs=2
> PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP
>
> Same slurm.conf is copied to all the Machines.
>
>
> Observations
> --------------------------------------
> Now when I do
> abhi@abhi-HP-EliteBook-840-G2:~$ service slurmd status
> ? slurmd.service - Slurm node daemon
>      Loaded: loaded (/lib/systemd/system/slurmd.service; enabled; vendor preset: enabled)
>      Active: active (running) since Wed 2020-05-13 18:50:01 IST; 1min 49s ago
>        Docs: man:slurmd(8)
>    Process: 98235 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=0/SUCCESS)
>    Main PID: 98253 (slurmd)
>      Tasks: 2
>      Memory: 2.2M
>      CGroup: /system.slice/slurmd.service
>              ??98253 /usr/sbin/slurmd
>
> abhi@abhi-Lenovo-ideapad-330-15IKB:~$ service slurmd status
> ? slurmd.service - Slurm node daemon
>      Loaded: loaded (/lib/systemd/system/slurmd.service; enabled; vendor preset: enabled)
>      Active: active (running) since Wed 2020-05-13 18:50:20 IST; 8s ago
>        Docs: man:slurmd(8)
>    Process: 71709 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=0/SUCCESS)
>    Main PID: 71734 (slurmd)
>      Tasks: 2
>      Memory: 2.0M
>      CGroup: /system.slice/slurmd.service
>              ??71734 /usr/sbin/slurmd
>
> abhi@abhi-Latitude-E6430:~$ service slurmctld status
> ? slurmctld.service - Slurm controller daemon
>      Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled; vendor preset: enabled)
>      Active: active (running) since Wed 2020-05-13 18:48:58 IST; 4min 56s ago
>        Docs: man:slurmctld(8)
>    Process: 97114 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS (code=exited, status=0/SUCCESS)
>    Main PID: 97116 (slurmctld)
>      Tasks: 7
>      Memory: 2.6M
>      CGroup: /system.slice/slurmctld.service
>              ??97116 /usr/sbin/slurmctld
>
>             
> However  abhi@abhi-Latitude-E6430:~$ sinfo
> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
> debug*      up  infinite      1  down* abhi-Lenovo-ideapad-330-15IKB
>
>
> Advice needed
> ------------------------
> Please let me know Why I am seeing only one node.
> Further how the total memory is calculated? Can Slurm make use of GPU processing power as well
> Please let me know if I have missed something in configuration or explanation.
>
> Thank you all
>
> Best Regards,
> Abhinandan H. Patil, +919886406214
>
>


------------------------------

Message: 2
Date: Wed, 13 May 2020 12:20:11 -0230
From: Roger Mason <rma...@mun.ca>
To: Slurm User Community List <slurm...@lists.schedmd.com>
Subject: Re: [slurm-users] sacct returns nothing after reboot
Message-ID: <y65sgg3...@mun.ca>
Content-Type: text/plain

Hello,

Marcus Boden <mbo...@gwdg.de> writes:

> the default time window starts at 00:00:00 of the current day:
> -S, --starttime
>          Select jobs in any state after the specified  time.  Default
>          is  00:00:00  of  the  current  day, unless the '-s' or '-j'
>          options are used. If the  '-s'  option  is  used,  then  the
>          default  is  'now'. If states are given with the '-s' option
>          then only jobs in this state at this time will be  returned.
>          If  the  '-j'  option is used, then the default time is Unix
>          Epoch 0. See the DEFAULT TIME WINDOW for more details.

Thank you!  Obviously I did not read far enough down the man page.

Roger



------------------------------

Message: 3
Date: Wed, 13 May 2020 15:18:09 +0000
From: "Ellestad, Erik" <Erik.E...@ucsf.edu>
To: Slurm User Community List <slurm...@lists.schedmd.com>
Subject: Re: [slurm-users] Reset TMPDIR for All Jobs
Message-ID:
   
Content-Type: text/plain; charset="utf-8"

Woo!

Thanks Marcus, that works.

Though, ahem, SLURM/SchedMD, if you're listening, would it hurt to cover this in the documentation regarding prolog/epilog, and maybe give an example?


Just a thought,

Erik

--
Erik Ellestad
Wynton Cluster SysAdmin
UCSF


-----Original Message-----
From: slurm-users <slurm-use...@lists.schedmd.com> On Behalf Of Marcus Wagner
Sent: Tuesday, May 12, 2020 10:08 PM
Subject: Re: [slurm-users] Reset TMPDIR for All Jobs

Hi Erik,

the output of task-prolog is sourced/evaluated (not really sure, how) in
the job environment.

Thus you don't have to export a variable in task-prolog, but echo the
export, e.g.

echo export TMPDIR=/scratch/$SLURM_JOB_ID

The variable will then be set in job environment.


Best
Marcus

Am 12.05.2020 um 17:40 schrieb Ellestad, Erik:
> I was wanted to set TMPDIR from /tmp to a per job directory I create in
> local /scratch/$SLURM_JOB_ID (for example)
>
> This bug suggests I should be able to do this in a task-prolog.
>
>
> However adding the following to task-prolog doesn?t seem to affect the
> variables the job script is running with.
>
> unset TMPDIR
>
> export TMPDIR=/scratch/$SLURM_JOB_ID
>
> It does work if it is done in the job script, rather than the task-prolog.
>
> Am I missing something?
>
> Erik
>
> --
>
> Erik Ellestad
>
> Wynton Cluster SysAdmin
>
> UCSF
>


------------------------------

Message: 4
Date: Wed, 13 May 2020 17:08:55 -0400
From: Alastair Neil <ajnei...@gmail.com>
To: Slurm User Community List <slurm...@lists.schedmd.com>
Subject: Re: [slurm-users] additional jobs killed by scancel.
Message-ID:
Content-Type: text/plain; charset="utf-8"

invalid field requested: "reason"

On Tue, 12 May 2020 at 16:47, Steven Dick <kg4...@gmail.com> wrote:

> What do you get from
>
> sacct -o jobid,elapsed,reason,exit -j 533900,533902
>
> On Tue, May 12, 2020 at 4:12 PM Alastair Neil <ajnei...@gmail.com>
> wrote:
> >
> >  The log is continuous and has all the messages logged by slurmd on the
> node for all the jobs mentioned, below are the entries from the slurmctld
> log:
> >
> >> [2020-05-10T00:26:03.097] _slurm_rpc_kill_job: REQUEST_KILL_JOB
> JobId=533898 uid 1224431221
> >>
> >> [2020-05-10T00:26:03.098] email msg to ssh...@masonlive.gmu.edu:
> Slurm Job_id=533898 Name=r18-relu-ent Ended, Run time 04:36:17, CANCELLED,
> ExitCode 0
> >> [2020-05-10T00:26:03.098] job_signal: 9 of running JobId=533898
> successful 0x8004
> >> [2020-05-10T00:26:05.204] _job_complete: JobId=533902 WTERMSIG 9
> >> [2020-05-10T00:26:05.204] email msg to ssh...@masonlive.gmu.edu:
> Slurm Job_id=533902 Name=r18-soft-ent Failed, Run time 04:30:39, FAILED
> >> [2020-05-10T00:26:05.205] _job_complete: JobId=533902 done
> >> [2020-05-10T00:26:05.210] _job_complete: JobId=533900 WTERMSIG 9
> >> [2020-05-10T00:26:05.210] email msg to ssh...@masonlive.gmu.edu:
> Slurm Job_id=533900 Name=r18-soft Failed, Run time 04:32:51, FAILED
> >> [2020-05-10T00:26:05.215] _job_complete: JobId=533900 done
> >
> >
> > it is curious, that all the jobs were running on the same processor,
> perhaps this is a cgroup related failure?
> >
> > On Tue, 12 May 2020 at 10:10, Steven Dick <kg4...@gmail.com> wrote:
> >>
> >> I see one job cancelled and two jobs failed.
> >> Your slurmd log is incomplete -- it doesn't show the two failed jobs
> >> exiting/failing, so the real error is not here.
> >>
> >> It might also be helpful to look through slurmctld's log starting from
> >> when the first job was canceled, looking at any messages mentioning
> >> the node or the two failed jobs.
> >>
> >> I've had nodes do strange things on job cancel.  Last one I tracked
> >> down to the job epilog failing because it was NFS mounted and nfs was
> >> being slower than slurm liked, so it took the node offline and killed
> >> everything on it.
> >>
> >> On Mon, May 11, 2020 at 12:55 PM Alastair Neil <ajnei...@gmail.com>
> wrote:
> >> >
> >> > Hi there,
> >> >
> >> > We are using slurm 18.08 and had a weird occurrence over the
> weekend.  A user canceled one of his jobs using scancel, and two additional
> jobs of the user running on the same node were killed concurrently.  The
> jobs had no dependency, but they were all allocated 1 gpu. I am curious to
> know why this happened,  and if this is a known bug is there a workaround
> to prevent it happening?  Any suggestions gratefully received.
> >> >
> >> > -Alastair
> >> >
> >> > FYI
> >> > The cancelled job (533898) has this at the end of the .err file:
> >> >
> >> >> slurmstepd: error: *** JOB 533898 ON NODE056 CANCELLED AT
> 2020-05-10T00:26:03 ***
> >> >
> >> >
> >> > both of the killed jobs (533900 and 533902)  have this:
> >> >
> >> >> slurmstepd: error: get_exit_code task 0 died by signal
> >> >
> >> >
> >> > here is the slurmd log from the node and ths how-job output for each
> job:
> >> >
> >> >> [2020-05-09T19:49:46.735] _run_prolog: run job script took usec=4
> >> >> [2020-05-09T19:49:46.735] _run_prolog: prolog with lock for job
> 533898 ran for 0 seconds
> >> >> [2020-05-09T19:49:46.754] ====================
> >> >> [2020-05-09T19:49:46.754] batch_job:533898 job_mem:10240MB
> >> >> [2020-05-09T19:49:46.754] JobNode[0] CPU[0] Job alloc
> >> >> [2020-05-09T19:49:46.755] JobNode[0] CPU[1] Job alloc
> >> >> [2020-05-09T19:49:46.756] JobNode[0] CPU[2] Job alloc
> >> >> [2020-05-09T19:49:46.757] JobNode[0] CPU[3] Job alloc
> >> >> [2020-05-09T19:49:46.758] ====================
> >> >> [2020-05-09T19:49:46.758] Launching batch job 533898 for UID
> 1224431221
> >> >> [2020-05-09T19:53:14.060] _run_prolog: run job script took usec=3
> >> >> [2020-05-09T19:53:14.060] _run_prolog: prolog with lock for job
> 533900 ran for 0 seconds
> >> >> [2020-05-09T19:53:14.080] ====================
> >> >> [2020-05-09T19:53:14.080] batch_job:533900 job_mem:10240MB
> >> >> [2020-05-09T19:53:14.081] JobNode[0] CPU[4] Job alloc
> >> >> [2020-05-09T19:53:14.082] JobNode[0] CPU[5] Job alloc
> >> >> [2020-05-09T19:53:14.083] JobNode[0] CPU[6] Job alloc
> >> >> [2020-05-09T19:53:14.083] JobNode[0] CPU[7] Job alloc
> >> >> [2020-05-09T19:53:14.084] ====================
> >> >> [2020-05-09T19:53:14.085] Launching batch job 533900 for UID
> 1224431221
> >> >> [2020-05-09T19:55:26.283] _run_prolog: run job script took usec=21
> >> >> [2020-05-09T19:55:26.284] _run_prolog: prolog with lock for job
> 533902 ran for 0 seconds
> >> >> [2020-05-09T19:55:26.304] ====================
> >> >> [2020-05-09T19:55:26.304] batch_job:533902 job_mem:10240MB
> >> >> [2020-05-09T19:55:26.304] JobNode[0] CPU[8] Job alloc
> >> >> [2020-05-09T19:55:26.305] JobNode[0] CPU[9] Job alloc
> >> >> [2020-05-09T19:55:26.306] JobNode[0] CPU[10] Job alloc
> >> >> [2020-05-09T19:55:26.306] JobNode[0] CPU[11] Job alloc
> >> >> [2020-05-09T19:55:26.307] ====================
> >> >> [2020-05-09T19:55:26.307] Launching batch job 533902 for UID
> 1224431221
> >> >> [2020-05-10T00:26:03.127] [533898.extern] done with job
> >> >> [2020-05-10T00:26:03.975] [533898.batch] error: *** JOB 533898 ON
> NODE056 CANCELLED AT 2020-05-10T00:26:03 ***
> >> >> [2020-05-10T00:26:04.425] [533898.batch] sending
> REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 15
> >> >> [2020-05-10T00:26:04.428] [533898.batch] done with job
> >> >> [2020-05-10T00:26:05.202] [533900.batch] error: get_exit_code task 0
> died by signal
> >> >> [2020-05-10T00:26:05.202] [533902.batch] error: get_exit_code task 0
> died by signal
> >> >> [2020-05-10T00:26:05.202] [533900.batch] sending
> REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 9
> >> >> [2020-05-10T00:26:05.202] [533902.batch] sending
> REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 9
> >> >> [2020-05-10T00:26:05.211] [533902.batch] done with job
> >> >> [2020-05-10T00:26:05.216] [533900.batch] done with job
> >> >> [2020-05-10T00:26:05.234] [533902.extern] done with job
> >> >> [2020-05-10T00:26:05.235] [533900.extern] done with job
> >> >
> >> >
> >> >> [root@node056 2020-05-10]# cat 533{898,900,902}/show-job.txt
> >> >> JobId=533898 JobName=r18-relu-ent
> >> >>  UserId=sshres2(1224431221) GroupId=users(100) MCS_label=N/A
> >> >>  Priority=19375 Nice=0 Account=csjkosecka QOS=csqos
> >> >>  JobState=CANCELLED Reason=None Dependency=(null)
> >> >>  Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:15
> >> >>  RunTime=04:36:17 TimeLimit=5-00:00:00 TimeMin=N/A
> >> >>  SubmitTime=2020-05-09T19:49:45 EligibleTime=2020-05-09T19:49:45
> >> >>  AccrueTime=2020-05-09T19:49:45
> >> >>  StartTime=2020-05-09T19:49:46 EndTime=2020-05-10T00:26:03
> Deadline=N/A
> >> >>  PreemptTime=None SuspendTime=None SecsPreSuspend=0
> >> >>  LastSchedEval=2020-05-09T19:49:46
> >> >>  Partition=gpuq AllocNode:Sid=ARGO-2:7221
> >> >>  ReqNodeList=(null) ExcNodeList=(null)
> >> >>  NodeList=NODE056
> >> >>  BatchHost=NODE056
> >> >>  NumNodes=1 NumCPUs=4 NumTasks=0 CPUs/Task=4 ReqB:S:C:T=0:0:*:*
> >> >>  TRES=cpu=4,mem=10G,node=1,billing=4,gres/gpu=1
> >> >>  Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
> >> >>  MinCPUsNode=4 MinMemoryNode=10G MinTmpDiskNode=0
> >> >>  Features=(null) DelayBoot=00:00:00
> >> >>  OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
> >> >>
> Command=/scratch/sshres2/workspace-scratch/cs747-project/command_resnet18_relu_ent.slurm
> >> >>  WorkDir=/scratch/sshres2/workspace-scratch/cs747-project
> >> >>
> StdErr=/scratch/sshres2/workspace-scratch/cs747-project/logs_slurm/r18-relu-ent-533898.err
> >> >>  StdIn=/dev/null
> >> >>
> StdOut=/scratch/sshres2/workspace-scratch/cs747-project/logs_slurm/r18-relu-ent-533898.out
> >> >>  Power=
> >> >>  TresPerNode=gpu:1
> >> >>
> >> >> JobId=533900 JobName=r18-soft
> >> >>  UserId=sshres2(1224431221) GroupId=users(100) MCS_label=N/A
> >> >>  Priority=19375 Nice=0 Account=csjkosecka QOS=csqos
> >> >>  JobState=FAILED Reason=JobLaunchFailure Dependency=(null)
> >> >>  Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:9
> >> >>  RunTime=04:32:51 TimeLimit=5-00:00:00 TimeMin=N/A
> >> >>  SubmitTime=2020-05-09T19:53:13 EligibleTime=2020-05-09T19:53:13
> >> >>  AccrueTime=2020-05-09T19:53:13
> >> >>  StartTime=2020-05-09T19:53:14 EndTime=2020-05-10T00:26:05
> Deadline=N/A
> >> >>  PreemptTime=None SuspendTime=None SecsPreSuspend=0
> >> >>  LastSchedEval=2020-05-09T19:53:14
> >> >>  Partition=gpuq AllocNode:Sid=ARGO-2:7221
> >> >>  ReqNodeList=(null) ExcNodeList=(null)
> >> >>  NodeList=NODE056
> >> >>  BatchHost=NODE056
> >> >>  NumNodes=1 NumCPUs=4 NumTasks=0 CPUs/Task=4 ReqB:S:C:T=0:0:*:*
> >> >>  TRES=cpu=4,mem=10G,node=1,billing=4,gres/gpu=1
> >> >>  Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
> >> >>  MinCPUsNode=4 MinMemoryNode=10G MinTmpDiskNode=0
> >> >>  Features=(null) DelayBoot=00:00:00
> >> >>  OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
> >> >>
> Command=/scratch/sshres2/workspace-scratch/cs747-project/command_resnet18_soft.slurm
> >> >>  WorkDir=/scratch/sshres2/workspace-scratch/cs747-project
> >> >>
> StdErr=/scratch/sshres2/workspace-scratch/cs747-project/logs_slurm/r18-soft-533900.err
> >> >>  StdIn=/dev/null
> >> >>
> StdOut=/scratch/sshres2/workspace-scratch/cs747-project/logs_slurm/r18-soft-533900.out
> >> >>  Power=
> >> >>  TresPerNode=gpu:1
> >> >>
> >> >> JobId=533902 JobName=r18-soft-ent
> >> >>  UserId=sshres2(1224431221) GroupId=users(100) MCS_label=N/A
> >> >>  Priority=19375 Nice=0 Account=csjkosecka QOS=csqos
> >> >>  JobState=FAILED Reason=JobLaunchFailure Dependency=(null)
> >> >>  Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:9
> >> >>  RunTime=04:30:39 TimeLimit=5-00:00:00 TimeMin=N/A
> >> >>  SubmitTime=2020-05-09T19:55:26 EligibleTime=2020-05-09T19:55:26
> >> >>  AccrueTime=2020-05-09T19:55:26
> >> >>  StartTime=2020-05-09T19:55:26 EndTime=2020-05-10T00:26:05
> Deadline=N/A
> >> >>  PreemptTime=None SuspendTime=None SecsPreSuspend=0
> >> >>  LastSchedEval=2020-05-09T19:55:26
> >> >>  Partition=gpuq AllocNode:Sid=ARGO-2:7221
> >> >>  ReqNodeList=(null) ExcNodeList=(null)
> >> >>  NodeList=NODE056
> >> >>  BatchHost=NODE056
> >> >>  NumNodes=1 NumCPUs=4 NumTasks=0 CPUs/Task=4 ReqB:S:C:T=0:0:*:*
> >> >>  TRES=cpu=4,mem=10G,node=1,billing=4,gres/gpu=1
> >> >>  Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
> >> >>  MinCPUsNode=4 MinMemoryNode=10G MinTmpDiskNode=0
> >> >>  Features=(null) DelayBoot=00:00:00
> >> >>  OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
> >> >>
> Command=/scratch/sshres2/workspace-scratch/cs747-project/command_resnet18_soft_ent.slurm
> >> >>  WorkDir=/scratch/sshres2/workspace-scratch/cs747-project
> >> >>
> StdErr=/scratch/sshres2/workspace-scratch/cs747-project/logs_slurm/r18-soft-ent-533902.err
> >> >>  StdIn=/dev/null
> >> >>
> StdOut=/scratch/sshres2/workspace-scratch/cs747-project/logs_slurm/r18-soft-ent-533902.out
> >> >>  Power=
> >> >>  TresPerNode=gpu:1
> >> >
> >> >
> >> >
> >>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...

End of slurm-users Digest, Vol 31, Issue 50
*******************************************

Chris Samuel

unread,
May 14, 2020, 2:25:10 AM5/14/20
to slurm...@lists.schedmd.com
On Wednesday, 13 May 2020 6:15:53 PM PDT Abhinandan Patil wrote:

> However still:
> sinfo
> PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
> debug* up infinite 1 down* abhi-Lenovo-ideapad-330-15IKB

What does "sinfo -R" say ?

If the node was down at some point you may need to resume it.

All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA




Reply all
Reply to author
Forward
0 new messages