[slurm-users] slurm 17.11.2: Socket timed out on send/recv operation

1,584 views
Skip to first unread message

Alessandro Federico

unread,
Jan 12, 2018, 5:33:53 AM1/12/18
to slurm...@lists.schedmd.com, Isabella Baccarelli, hpc-sys...@cineca.it
Hi all,

we are setting up SLURM 17.11.2 on a small test cluster of about 100 nodes.
Sometimes we get the error in the subject when running any SLURM command (e.g. sinfo, squeue, scontrol reconf, etc...)

Do we have to apply any particular setting to avoid incurring the problem?

We found this bug report https://bugs.schedmd.com/show_bug.cgi?id=4002 but it regards the previous SLURM version
and we do not set debug3 on slurmctld.

thanks in advance
ale

--
Alessandro Federico
HPC System Management Group
System & Technology Department
CINECA www.cineca.it
Via dei Tizii 6, 00185 Rome - Italy
phone: +39 06 44486708

All work and no play makes Jack a dull boy.
All work and no play makes Jack a dull boy.
All work and no play makes Jack...

John DeSantis

unread,
Jan 12, 2018, 1:59:53 PM1/12/18
to Alessandro Federico, hpc-sys...@cineca.it, slurm...@lists.schedmd.com, Isabella Baccarelli
Ciao Alessandro,

> Do we have to apply any particular setting to avoid incurring the
> problem?

What is your "MessageTimeout" value in slurm.conf? If it's at the
default of 10, try changing it to 20.

I'd also check and see if the slurmctld log is reporting anything
pertaining to the server thread count being over its limit.

HTH,
John DeSantis

Alessandro Federico

unread,
Jan 15, 2018, 7:33:54 AM1/15/18
to John DeSantis, hpc-sys...@cineca.it, slurm...@lists.schedmd.com, Isabella Baccarelli
Hi John

thanks for the info.
slurmctld doesn't report anything about the server thread count in the logs
and sdiag show only 3 server threads.

We changed the MessageTimeout value to 20.

I'll let you know if it solves the problem.

Thanks
ale

Cooper, Trevor

unread,
Jan 15, 2018, 6:11:16 PM1/15/18
to Slurm User Community List
Alessandro,

You might want to consider tracking your Slurm scheduler diagnostics output with some type of time-series monitoring system. The time-based history has proven more helpful at times than log contents by themselves.

See Giovanni Torres' post on setting this up...

http://giovannitorres.me/graphing-sdiag-with-graphite.html

-- Trevor

Alessandro Federico

unread,
Jan 16, 2018, 4:27:20 AM1/16/18
to Slurm User Community List, Isabella Baccarelli, hpc-sys...@cineca.it
Hi Trevor

thank you very much

we'll give it a try

Alessandro Federico

unread,
Jan 16, 2018, 5:23:13 AM1/16/18
to Slurm User Community List, Isabella Baccarelli, hpc-sys...@cineca.it
Hi,

setting MessageTimeout to 20 didn't solve it :(

looking at slurmctld logs I noticed many warning like these

Jan 16 05:11:00 r000u17l01 slurmctld[22307]: Warning: Note very large processing time from _slurm_rpc_dump_partitions: usec=42850604 began=05:10:17.289
Jan 16 05:20:58 r000u17l01 slurmctld[22307]: Warning: Note very large processing time from load_part_uid_allow_list: usec=44861325 began=05:20:13.257
Jan 16 05:20:58 r000u17l01 slurmctld[22307]: Warning: Note very large processing time from _slurmctld_background: usec=44861653 began=05:20:13.257

they are generated in many functions:

[root@r000u17l01 ~]# journalctl -u slurmctld --since='2018-01-16 00:00:00' | grep -oP 'Note very large processing time from \w+:' | sort | uniq -c
4 Note very large processing time from dump_all_job_state:
67 Note very large processing time from load_part_uid_allow_list:
67 Note very large processing time from _slurmctld_background:
7 Note very large processing time from _slurm_rpc_complete_batch_script:
4 Note very large processing time from _slurm_rpc_dump_jobs:
3 Note very large processing time from _slurm_rpc_dump_job_user:
271 Note very large processing time from _slurm_rpc_dump_partitions:
5 Note very large processing time from _slurm_rpc_epilog_complete:
1 Note very large processing time from _slurm_rpc_job_pack_alloc_info:
3 Note very large processing time from _slurm_rpc_step_complete:

processing times are always around tens of seconds.

I'm attaching sdiag output and slurm.conf.

thanks
ale

----- Original Message -----
> From: "Trevor Cooper" <tco...@sdsc.edu>
> To: "Slurm User Community List" <slurm...@lists.schedmd.com>
> Sent: Tuesday, January 16, 2018 12:10:21 AM
sdiag.out
slurm.conf

John DeSantis

unread,
Jan 16, 2018, 12:56:04 PM1/16/18
to Alessandro Federico, hpc-sys...@cineca.it, Slurm User Community List, Isabella Baccarelli
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

Ciao Alessandro,

> setting MessageTimeout to 20 didn't solve it :(
>
> looking at slurmctld logs I noticed many warning like these
>
> Jan 16 05:11:00 r000u17l01 slurmctld[22307]: Warning: Note very large
> processing time from _slurm_rpc_dump_partitions: usec=42850604
> began=05:10:17.289 Jan 16 05:20:58 r000u17l01 slurmctld[22307]:
> Warning: Note very large processing time from
> load_part_uid_allow_list: usec=44861325 began=05:20:13.257 Jan 16
> 05:20:58 r000u17l01 slurmctld[22307]: Warning: Note very large
> processing time from _slurmctld_background: usec=44861653
> began=05:20:13.257

And:

> 271 Note very large processing time from _slurm_rpc_dump_partitions:
> 67 Note very large processing time from load_part_uid_allow_list:

I believe these values are in microseconds, so an average of 44 seconds
per call, mostly related to partition information. Given that our
configuration has the maximum value set of 90 seconds, I'd again
recommend another adjustment, perhaps to 60 seconds.

I'm not sure if redefining your partitions will help, but you do have
several partitions which contain the same set of nodes that could be
condensed - decreasing the amount of partitions. For example, the
partitions bdw_all_serial & bdw_all_rcm could be consolidated into a
single partition by:

1.) Using AllowQOS=bdw_all_serial,bdw_all_rcm;
2.) Setting MaxTime to 04:00:00 and defining a MaxWall via each QOS
(since one partition has 04:00:00 and the other 03:00:00).

The same could be done for the partitions skl_fua_{prod,bprod,lprod} as
well.

HTH,
John DeSantis


On Tue, 16 Jan 2018 11:22:44 +0100
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2

iQEcBAEBCgAGBQJaXjxzAAoJEEmckBqrs5nB9FQH/Rq6avZRXV0r1qQhSBH514J6
vHWzGAgVSvBrpxFrtfu3aVTK6fk3bFahB9t2jtVJlg0HgO8dm3Gj6FMNo0nDyemD
NlIePvvXGwZYXeXlif+OtCTu/3fOqvuol1jX8/iXcG89Lm+HA92BhLKPYoqzWsK4
KQ/m8Mlj91Ei3GRZorZfyZrRrfAYNatIV2plmRaGWmuH39MEwQ0bF/qQhci/LAXB
xquAZWAVeSE1uWThXPS4sbzmHjNuenT9RqlGtgQOEMO4z/bHFQwmMVuxqfmS537h
/93icpAcWhJQ1bYe51ePykWk3Jkv901Z7Cr6bG1+hu2asN1loFzz38YugHUcfBs=
=VWA7
-----END PGP SIGNATURE-----

Matthieu Hautreux

unread,
Jan 16, 2018, 1:13:44 PM1/16/18
to Slurm User Community List, Isabella Baccarelli, hpc-sys...@cineca.it
Hi,

In this kind if issues, one good thing to do is to get a backtrace of slurmctld during the slowdown. You should thus easily identify the subcomponent responsible for the issue.

I would bet on something like LDAP requests taking too much time because of a missing sssd cache.

Regards
Matthieu

John DeSantis

unread,
Jan 16, 2018, 2:21:31 PM1/16/18
to Matthieu Hautreux, hpc-sys...@cineca.it, Slurm User Community List, Isabella Baccarelli
Matthieu,

> I would bet on something like LDAP requests taking too much time
> because of a missing sssd cache.

Good point! It's easy to forget to check something as "simple" as user
look-up when something is taking "too long".

John DeSantis

Alessandro Federico

unread,
Jan 17, 2018, 7:21:50 AM1/17/18
to Slurm User Community List, Isabella Baccarelli, hpc-sys...@cineca.it
Hi Matthieu & John

this is the backtrace of slurmctld during the slowdown

(gdb) bt
#0 0x00007fb0e8b1e69d in poll () from /lib64/libc.so.6
#1 0x00007fb0e8617bfa in sss_cli_make_request_nochecks () from /lib64/libnss_sss.so.2
#2 0x00007fb0e86185a3 in sss_nss_make_request () from /lib64/libnss_sss.so.2
#3 0x00007fb0e8619104 in _nss_sss_getpwnam_r () from /lib64/libnss_sss.so.2
#4 0x00007fb0e8aef07d in getpwnam_r@@GLIBC_2.2.5 () from /lib64/libc.so.6
#5 0x00007fb0e9360986 in _getpwnam_r (result=<optimized out>, bufsiz=<optimized out>, buf=<optimized out>, pwd=<optimized out>, name=<optimized out>) at uid.c:73
#6 uid_from_string (name=0x1820e41 "g2bottin", uidp=uidp@entry=0x7fff07f03a6c) at uid.c:111
#7 0x000000000043587d in get_group_members (group_name=0x10ac500 "g2") at groups.c:139
#8 0x000000000047525a in _get_groups_members (group_names=<optimized out>) at partition_mgr.c:2006
#9 0x0000000000475505 in _update_part_uid_access_list (x=0x7fb03401e650, arg=0x7fff07f13bf4) at partition_mgr.c:1930
#10 0x00007fb0e92ab675 in list_for_each (l=0x1763e50, f=f@entry=0x4754d8 <_update_part_uid_access_list>, arg=arg@entry=0x7fff07f13bf4) at list.c:420
#11 0x000000000047911a in load_part_uid_allow_list (force=1) at partition_mgr.c:1971
#12 0x0000000000428e5c in _slurmctld_background (no_data=0x0) at controller.c:1911
#13 main (argc=<optimized out>, argv=<optimized out>) at controller.c:601

As Matthieu said it seems something related to SSS daemon.
However we don't notice any slowdown due to SSSd in our environment.
As I told you before, we are just testing SLURM on a small 100 nodes cluster before going into production with about 6000 nodes next Wednesday.
At present the other nodes are managed by PBSPro and the 2 PBS servers are running on the same nodes as the SLURM controllers.
PBS queues are also configured with users/groups ACLs and we never noticed any similar slowdown.

Moreover, only 3 SLURM partitions have the AllowGroups ACL

[root@mgmt01 slurm]# grep AllowGroups slurm.conf
PartitionName=bdw_fua_gwdbg Nodes=r040c03s0[1,2] Default=NO DefMemPerCPU=3000 DefaultTime=00:30:00 MaxTime=00:30:00 State=UP QOS=bdw_fua_gwdbg DenyQos=bdw_qos_special AllowGroups=g2
PartitionName=bdw_fua_gw Nodes=r040c03s0[1,2] Default=NO DefMemPerCPU=3000 DefaultTime=00:30:00 MaxTime=48:00:00 State=UP QOS=bdw_fua_gw DenyQos=bdw_qos_special AllowGroups=g2
PartitionName=bdw_fua_gwg2 Nodes=r040c03s0[1,2] Default=NO DefMemPerCPU=3000 DefaultTime=00:30:00 MaxTime=168:00:00 State=UP QOS=bdw_fua_gwg2 DenyQos=bdw_qos_special AllowGroups=g2

So why does the UID-GID mapping take so long?

@John: we defined many partitions on the same nodes but in the production cluster they will be more or less split across the 6K nodes.

thank you very much

John DeSantis

unread,
Jan 17, 2018, 9:31:30 AM1/17/18
to Alessandro Federico, hpc-sys...@cineca.it, Slurm User Community List, Isabella Baccarelli
Ale,

> As Matthieu said it seems something related to SSS daemon.

That was a great catch by Matthieu.

> Moreover, only 3 SLURM partitions have the AllowGroups ACL

Correct, which may seem negligent, but after each `scontrol
reconfigure`, slurmctld restart, and/or AllowGroups= partition update,
the mapping of UID's for each group will be updated.

> So why does the UID-GID mapping take so long?

We attempted to use "AllowGroups" previously, but we found (even with
sssd.conf tuning regarding enumeration) that unless the group was local
(/etc/group), we were experiencing delays before the AllowGroups
parameter was respected. This is why we opted to use SLURM's
AllowQOS/AllowAccounts instead.

You would have to enable debugging on your remote authentication
software to see where the hang-up is occurring (if it is that at all,
and not just a delay with the slurmctld).

Given the direction that this is going - why not replace the
"AllowGroups" with either a simple "AllowAccounts=" or "AllowQOS="?

> @John: we defined many partitions on the same nodes but in the
> production cluster they will be more or less split across the 6K
> nodes.

Ok, that makes sense. Looking initially at your partition definitions,
I immediately thought of being DRY, especially since the "finer" tuning
between the partitions could easily be controlled via the QOS' allowed
to access the resources.

John DeSantis

On Wed, 17 Jan 2018 13:20:49 +0100

Alessandro Federico

unread,
Jan 17, 2018, 11:42:39 AM1/17/18
to John DeSantis, hpc-sys...@cineca.it, Slurm User Community List, Isabella Baccarelli
Hi John

thanks for the infos.
We are investigating the slowdown of sssd and I found some bug reports regarding slow sssd query
with almost the same backtrace. Hopefully an update of sssd could solve this issue.

We'll let you know if we found a solution.

Alessandro Federico

unread,
Jan 22, 2018, 5:39:30 AM1/22/18
to Slurm User Community List, Isabella Baccarelli, hpc-sys...@cineca.it
Hi John,

just an update...
we not have a solution for the SSSD issue yet, but we changed the ACL
on the 2 partitions from AllowGroups=g2 to AllowAccounts=g2 and the
slowdown has gone.

Thanks for the help

James A. Peltier

unread,
Jan 23, 2018, 5:24:19 PM1/23/18
to Slurm User Community List, Isabella Baccarelli, hpc-sys...@cineca.it
We put SSSD caches on a RAMDISK which helped a little bit with performance.
--
James A. Peltier
Manager - Research Computing Group | IT Services
Simon Fraser University | ASB10973
8888 University Dr., Burnaby, B.C. V5A 1S6
T: 778.782.6573 | M: 604.365.6432 | sfu.ca/itservices
Twitter: @sfu_it

Reply all
Reply to author
Forward
0 new messages