[slurm-users] seff Not Caluculating

41 views
Skip to first unread message

Jason Simms

unread,
Sep 11, 2020, 11:08:48 AM9/11/20
to Slurm User Community List
Hello all,

I've found that when I run seff, it fails to report calculated values, e.g.:

Nodes: 1
Cores per node: 20
CPU Utilized: 00:00:00
CPU Efficiency: 0.00% of 1-11:49:40 core-walltime
Job Wall-clock time: 01:47:29
Memory Utilized: 0.00 MB (estimated maximum)
Memory Efficiency: 0.00% of 180.00 GB (180.00 GB/node)

Is there some plugin or setting that I need to enable?

Warmest regards,
Jason

--
Jason L. Simms, Ph.D., M.P.H.
Manager of Research and High-Performance Computing
XSEDE Campus Champion
Lafayette College
Information Technology Services
710 Sullivan Rd | Easton, PA 18042
Office: 112 Skillman Library
p: (610) 330-5632

Diego Zuccato

unread,
Sep 15, 2020, 4:15:07 AM9/15/20
to Slurm User Community List, Jason Simms
Il 10/09/20 22:19, Jason Simms ha scritto:

> I've found that when I run seff, it fails to report calculated values, e.g.:
I didn't know seff. Quite interesting. Same problem detected.
I'm neither Perl nor Slurm expert so I'm quite sure there's a better way
to do it, but I "fixed" it by changing at line 66:
my $ncpus = $job->{'alloc_cpus'};
to
my $ncpus = $job->{'req_cpus'};

And at ~ line 106:
my $lmem = $step->{'stats'}{'rss_max'};
to:
my %hash = split /[,=]/, $step->{'stats'}{'tres_usage_in_max'};
my $lmem=$hash{'2'}/1024;

It seems to give meaningful results, for some value of "meaningful" :)
Corrections accepted.

HIH

--
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786

Diego Zuccato

unread,
Nov 9, 2020, 6:54:02 AM11/9/20
to Slurm User Community List, Jason Simms
Il 15/09/20 10:14, Diego Zuccato ha scritto:

Seems my corrections actually work only for single-node jobs.
In case of multi-node jobs, it only considers the memory used on one
node, hence understimates the real efficiency.
Someone more knowledgeable than me can spot the error?

TIA!

> I'm neither Perl nor Slurm expert so I'm quite sure there's a better way
> to do it, but I "fixed" it by changing at line 66:
> my $ncpus = $job->{'alloc_cpus'};
> to
> my $ncpus = $job->{'req_cpus'};
>
> And at ~ line 106:
> my $lmem = $step->{'stats'}{'rss_max'};
> to:
> my %hash = split /[,=]/, $step->{'stats'}{'tres_usage_in_max'};
> my $lmem=$hash{'2'}/1024;
>
> It seems to give meaningful results, for some value of "meaningful" :)
> Corrections accepted.

Diego Zuccato

unread,
Nov 17, 2020, 6:39:35 AM11/17/20
to Slurm User Community List
Il 09/11/20 12:53, Diego Zuccato ha scritto:

> Seems my corrections actually work only for single-node jobs.
> In case of multi-node jobs, it only considers the memory used on one
> node, hence understimates the real efficiency.
> Someone more knowledgeable than me can spot the error?Seems I managed to have it account for the memory on all the nodes.
See attached file.
The results seem quite meaningful and match the ones done by hand.
seff

Jason Simms

unread,
Nov 18, 2020, 9:16:25 AM11/18/20
to Slurm User Community List
Dear Diego,

A while back, I attempted to make some edits locally to see whether I could produce "better" results. Here is a comparison of the output of your latest version, and then mine:

[root@hpc bin]# seff 24567
Use of uninitialized value $hash{"2"} in division (/) at /bin/seff line 108, <DATA> line 602.
Use of uninitialized value $hash{"2"} in division (/) at /bin/seff line 108, <DATA> line 602.
Job ID: 24567
Cluster: hpc
User/Group: chuat/users
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 4

CPU Utilized: 00:00:00
CPU Efficiency: 0.00% of 00:00:40 core-walltime
Memory Utilized: 1.24 MB (estimated maximum)
Memory Efficiency: 0.06% of 2.00 GB (2.00 GB/node)

[root@hpc bin]# seff.bak 24567
Job ID: 24567
Cluster: hpc
User/Group: chuat/users
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 4
CPU Utilized: 00:00:20
CPU Efficiency: 50.00% of 00:00:40 core-walltime
Job Wall-clock time: 00:00:10
Memory Utilized: 1.24 MB
Memory Efficiency: 0.06% of 2.00 GB

Yours doesn't seem to report anything for CPU Utilized or CPU Efficiency. At the same time, however, the changes I made to my code to produce those results may not even be "correct." Moreover, my version may or may not work for multi-node jobs; I have no way to test those, since at the moment no user is running them.

For what it's worth, here is a diff of your script vs. mine, in case that's helpful. That said, while I used to code Perl scripts all the time, I haven't in, oh, about 20 years, so again, my edits could be entirely the wrong approach:


Warmest regards,
Jason

Peter Kjellström

unread,
Nov 18, 2020, 12:09:45 PM11/18/20
to Jason Simms, Slurm User Community List
On Wed, 18 Nov 2020 09:15:59 -0500
Jason Simms <sim...@lafayette.edu> wrote:

> Dear Diego,
>
> A while back, I attempted to make some edits locally to see whether I
> could produce "better" results. Here is a comparison of the output of
> your latest version, and then mine:

I'm not sure what bug or behavior you're seeing but seff has always
reported correct numbers for me (no local patch afaik). Re-checked right
now for two different clusters (different slurm):

slurm-18.08.8
slurm-contribs-18.08.8

And:

slurm-20.02.4
slurm-contribs-20.02.4

Are you running ProctrackType=proctrack/cgroup?

/Peter

Jason Simms

unread,
Nov 18, 2020, 12:35:43 PM11/18/20
to Peter Kjellström, Slurm User Community List
Dear Peter,

Thanks for your response. Yes, I am running ProctrackType=proctrack/cgroup

The behavior that I was seeing with the default seff, and that Diego saw as well, was simply that seff was not reporting really any information for a given job. I'm glad it's working for you, but it doesn't for everyone, and I can't figure out why.

Warmest regards,
Jason

Diego Zuccato

unread,
Nov 19, 2020, 2:23:43 AM11/19/20
to Jason Simms, Slurm User Community List
Il 18/11/20 15:15, Jason Simms ha scritto:

> Use of uninitialized value $hash{"2"} in division (/) at /bin/seff line
> 108, <DATA> line 602.
> Use of uninitialized value $hash{"2"} in division (/) at /bin/seff line
> 108, <DATA> line 602.
Seems some setups report data in a different format, hence the
uninitialized value. In my case, there's no $step->{'stats'}{'rss_max'}
and I've had to use $step->{'stats'}{'tres_usage_in_max'} . Maybe your
install uses the former?
I'm using Debian stable (Buster) w/ default packages (currently
18.08.5.2-1+deb10u1).

What I get:
-8<--
$ /home/software/utils/seff 9604
Job ID: 9604
Cluster: oph
User/Group: name.surname/domain^users
State: COMPLETED (exit code 0)
Cores: 1
CPU Utilized: 00:01:40
CPU Efficiency: 98.04% of 00:01:42 core-walltime
Memory Utilized: 203.71 MB
Memory Efficiency: 1.46% of 13.67 GB

$ sacct -a --format
JobID,User,Group,State,Cluster,AllocCPUS,REQMEM,TotalCPU,Elapsed,MaxRSS,ExitCode,NNodes,NTasks
-j 9604
JobID User Group State Cluster AllocCPUS
ReqMem TotalCPU Elapsed MaxRSS ExitCode NNodes NTasks
------------ --------- --------- ---------- ---------- ----------
---------- ---------- ---------- ---------- -------- -------- --------
9604 name.sur+ domain^u+ COMPLETED oph 1
13993Mn 01:39.703 00:01:42 0:0 1
9604.batch COMPLETED oph 1
13993Mn 01:39.703 00:01:42 208604K 0:0 1 1
-8<--

You can try enabling Dumper uncommenting lines 11 and 60. My result:
-8<--
$ bin/seff.debug 9604
$VAR1 = {
'eligible' => 1605621478,
'timelimit' => 300,
'derived_ec' => 0,
'resvid' => 0,
'user' => 'name.surname',
'nodes' => 'str957-bl0-17',
'uid' => 0,
'account' => 'astro',
'sys_cpu_usec' => 0,
'jobname' => 'job-blade-serial.sh',
'show_full' => 1,
'start' => 1605621479,
'user_cpu_sec' => 0,
'priority' => 1,
'req_cpus' => 1,
'tot_cpu_sec' => 0,
'end' => 1605621581,
'qosid' => 1,
'suspended' => 0,
'state' => 3,
'array_max_tasks' => 0,
'exitcode' => 0,
'wckeyid' => 0,
'tres_alloc_str' =>
'1=1,2=13993,3=18446744073709551614,4=1,5=150',
'wckey' => '',
'elapsed' => 102,
'sys_cpu_sec' => 0,
'lft' => 250,
'requid' => 4294967295,
'req_mem' => 13993,
'submit' => 1605621478,
'track_steps' => 1,
'partition' => 'b5',
'cluster' => 'oph',
'array_job_id' => 0,
'user_cpu_usec' => 0,
'jobid' => 9604,
'stats' => {
'consumed_energy' => 0,
'act_cpufreq' => '0'
},
'alloc_gres' => '',
'associd' => 20,
'array_task_id' => 4294967294,
'req_gres' => '',
'alloc_nodes' => 1,
'gid' => 2125988353,
'tot_cpu_usec' => 0,
'steps' => [
{
'user_cpu_sec' => 99,
'tot_cpu_sec' => 99,
'stats' => {
'tres_usage_in_ave' =>
'1=88010,2=213610496,3=0,6=95527596,7=514859008,8=0',
'tres_usage_in_min_nodeid' =>
'1=0,2=0,3=0,6=0,7=0,8=0',
'tres_usage_out_tot' =>
'3=0,6=231676',
'tres_usage_in_min_taskid' =>
'1=0,2=0,6=0,7=0,8=0',
'tres_usage_out_max' =>
'3=0,6=231676',
'tres_usage_in_tot' =>
'1=88010,2=213610496,3=0,6=95527596,7=514859008,8=0',
'tres_usage_in_min' =>
'1=88010,2=213610496,3=0,6=95527596,7=514859008,8=0',
'consumed_energy' => 0,
'tres_usage_out_max_nodeid' =>
'3=0,6=0',
'tres_usage_out_max_taskid' => '6=0',
'act_cpufreq' => '8755',
'tres_usage_in_max_taskid' =>
'1=0,2=0,6=0,7=0,8=0',
'tres_usage_out_ave' =>
'3=0,6=231676',
'tres_usage_out_min' =>
'3=0,6=231676',
'tres_usage_in_max' =>
'1=88010,2=213610496,3=0,6=95527596,7=514859008,8=0',
'tres_usage_out_min_taskid' => '6=0',
'tres_usage_in_max_nodeid' =>
'1=0,2=0,3=0,6=0,7=0,8=0',
'tres_usage_out_min_nodeid' =>
'3=0,6=0'
},
'nnodes' => 1,
'end' => 1605621581,
'stepid' => 4294967294,
'suspended' => 0,
'state' => 3,
'tres_alloc_str' => '1=1,2=13993,4=1',
'exitcode' => 0,
'tot_cpu_usec' => 703732,
'stepname' => 'batch',
'requid' => 4294967295,
'req_cpufreq_gov' => 0,
'elapsed' => 102,
'sys_cpu_sec' => 0,
'task_dist' => 0,
'req_cpufreq_min' => 0,
'sys_cpu_usec' => 587602,
'ntasks' => 1,
'nodes' => 'str957-bl0-17',
'start' => 1605621479,
'user_cpu_usec' => 116130,
'req_cpufreq_max' => 0
}
]
};
Job ID: 9604
Cluster: oph
User/Group: name.surname/domain^users
State: COMPLETED (exit code 0)
Cores: 1
CPU Utilized: 00:01:40
CPU Efficiency: 98.04% of 00:01:42 core-walltime
Memory Utilized: 203.71 MB
Memory Efficiency: 1.46% of 13.67 GB
-8<--
Reply all
Reply to author
Forward
0 new messages