[slurm-users] Is this a known error?

1,457 views
Skip to first unread message

Andreas Davour

unread,
Sep 17, 2021, 4:33:39 AM9/17/21
to slurm...@lists.schedmd.com

Hi

I get some weird errors in the slurmd.log on my nodes. Maybe it's not
possible to comment upon without seeing my slurm.conf but I wanted to
check if it's a known problem or something simple. Jobs run just fine,
and slurmd and slurmctld start without errors. I run
slurm-20.11.8-1.el7.x86_64 and use the
SlurmctldParameters=enable_configless setup.

[2021-09-17T08:53:49.166] error: unpack_header: protocol_version 8448
not supported
[2021-09-17T08:53:49.166] error: unpacking header
[2021-09-17T08:53:49.166] error: destroy_forward: no init
[2021-09-17T08:53:49.166] error: slurm_receive_msg_and_forward: Message
receive failure
[2021-09-17T08:53:49.176] error: service_connection: slurm_receive_msg:
Message receive failure

Anyone seen that before, or immediately see that I did something wrong?

/andreas

Bjørn-Helge Mevik

unread,
Sep 17, 2021, 5:54:40 AM9/17/21
to slurm...@schedmd.com
Andreas Davour <andreas...@conoa.se> writes:

> [2021-09-17T08:53:49.166] error: unpack_header: protocol_version 8448
> not supported
> [2021-09-17T08:53:49.166] error: unpacking header
> [2021-09-17T08:53:49.166] error: destroy_forward: no init
> [2021-09-17T08:53:49.166] error: slurm_receive_msg_and_forward:
> Message receive failure
> [2021-09-17T08:53:49.176] error: service_connection:
> slurm_receive_msg: Message receive failure
>
> Anyone seen that before, or immediately see that I did something wrong?

Sounds a lot like you have a different version of Slurm installed on some
compute node(s).

--
B/H
signature.asc

Andreas Davour

unread,
Sep 17, 2021, 9:21:27 AM9/17/21
to slurm...@lists.schedmd.com
That's the kind of impressions I was hoping for.

Yeah, I thought that as well but I can not find any packages differing
and as far as I know they have all been restarted.

I'll see if there is anything like a version mismatch somewhere.

/andreas




Sean McGrath

unread,
Dec 7, 2021, 12:20:37 PM12/7/21
to Slurm User Community List
Hi,

I'm seeing something similar.

slurmdbd version is 21.08.4

All the slurmd's & slurmctld's are version 20.11.8

This is what is in the slurmdbd.log

[2021-12-07T17:16:50.001] error: unpack_header: protocol_version 8704 not supported
[2021-12-07T17:16:50.001] error: unpacking header
[2021-12-07T17:16:50.001] error: destroy_forward: no init
[2021-12-07T17:16:50.001] error: slurm_unpack_received_msg: Message receive failure
[2021-12-07T17:16:50.011] error: CONN:17 Failed to unpack SLURM_PERSIST_INIT message
[2021-12-07T17:17:09.001] error: unpack_header: protocol_version 8704 not supported
[2021-12-07T17:17:09.001] error: unpacking header
[2021-12-07T17:17:09.001] error: destroy_forward: no init
[2021-12-07T17:17:09.001] error: slurm_unpack_received_msg: Message receive failure
[2021-12-07T17:17:09.011] error: CONN:35 Failed to unpack SLURM_PERSIST_INIT message

I've looked through our clusters but don't see any that aren't 20.11.8.

Can anyone advise how to identify the clients that are generating those
errors please?

Thanks

Sean
--
Sean McGrath M.Sc

Systems Administrator
Trinity Centre for High Performance and Research Computing
Trinity College Dublin

sean.m...@tchpc.tcd.ie

https://www.tcd.ie/
https://www.tchpc.tcd.ie/

+353 (0) 1 896 3725


Bjørn-Helge Mevik

unread,
Dec 8, 2021, 3:04:32 AM12/8/21
to slurm...@schedmd.com
Sean McGrath <smc...@tchpc.tcd.ie> writes:

> I'm seeing something similar.
>
> slurmdbd version is 21.08.4
>
> All the slurmd's & slurmctld's are version 20.11.8
>
> This is what is in the slurmdbd.log
>
> [2021-12-07T17:16:50.001] error: unpack_header: protocol_version 8704 not supported

I believe 8704 corresponds to 19.05.x, which is no longer accepted in
21.08.x.

> Can anyone advise how to identify the clients that are generating those
> errors please?

I don't think slurmd connects directly to slurmdbd, so perhaps it is
some frontend node or machine outside the cluster itself which has the
slurm commands installed and is doing requests to slurmdbd (sacct,
sacctmgr, etc.)?

With SlurmdbdDebug set to debug or higher, new client connections will
be logged with

[2021-12-08T09:00:07.992] debug: REQUEST_PERSIST_INIT: CLUSTER:saga VERSION:9472 UID:51568 IP:10.2.3.185 CONN:8

in slurmdbd.log. But perhaps that will not happen if slurmdbd fails to
unpack the header?

--
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo

signature.asc

Sean McGrath

unread,
Dec 8, 2021, 12:04:49 PM12/8/21
to Slurm User Community List
Hi Bjørn-Helge,

Thanks for that.

On Wed, Dec 08, 2021 at 09:03:36AM +0100, Bj?rn-Helge Mevik wrote:

> Sean McGrath <smc...@tchpc.tcd.ie> writes:
>
> > I'm seeing something similar.
> >
> > slurmdbd version is 21.08.4
> >
> > All the slurmd's & slurmctld's are version 20.11.8
> >
> > This is what is in the slurmdbd.log
> >
> > [2021-12-07T17:16:50.001] error: unpack_header: protocol_version 8704 not supported
>
> I believe 8704 corresponds to 19.05.x, which is no longer accepted in
> 21.08.x.
>
> > Can anyone advise how to identify the clients that are generating those
> > errors please?
>
> I don't think slurmd connects directly to slurmdbd, so perhaps it is
> some frontend node or machine outside the cluster itself which has the
> slurm commands installed and is doing requests to slurmdbd (sacct,
> sacctmgr, etc.)?

Yes, I think it is it, I haven't been able to track it down and will
just have to live with the messages in the logs.

>
> With SlurmdbdDebug set to debug or higher, new client connections will
> be logged with
>
> [2021-12-08T09:00:07.992] debug: REQUEST_PERSIST_INIT: CLUSTER:saga VERSION:9472 UID:51568 IP:10.2.3.185 CONN:8
>
> in slurmdbd.log. But perhaps that will not happen if slurmdbd fails to
> unpack the header?

Unfortunately it doesn't as it can't unpack the headers so I don't get a
more informative error.

Thanks for your help all the same.

Sean


>
> --
> Regards,
> Bjørn-Helge Mevik, dr. scient,
> Department for Research Computing, University of Oslo
>



Nicolas Greneche

unread,
Feb 8, 2022, 6:20:54 AM2/8/22
to slurm...@lists.schedmd.com
Hi,

I had the same issue. It was just because I had an older slurmctld
somwhere with the node set to drain. Even if the node was drain in the
old slurmctld, it tries to connect to slurmd.
Nicolas Greneche
USPN
Support à la recherche / RSSI
https://www-magi.univ-paris13.fr

Reply all
Reply to author
Forward
0 new messages