[slurm-users] controller node in Ubuntu 22.04.5 LTS does not work with compute node in Debian GNU/Linux 12

5 views
Skip to first unread message

allspace--- via slurm-users

unread,
Jun 16, 2026, 1:32:02 AM (3 days ago) Jun 16
to slurm...@lists.schedmd.com
download slurm-26.05.1.tar.bz2 and build it on Ubuntu 22.04.5 LTS. Use it as controller node.
download slurm-26.05.1.tar.bz2 and build it on Debian GNU/Linux 12. Use it as compute node(node-3).

Run command "srun -w node-3 hostname" on controller node, got the following error message:

srun: error: Task launch for StepId=10.0 failed on node node-3: Header lengths are longer than data received
srun: error: Application launch failed: Header lengths are longer than data received
srun: Job step aborted

Here is the logs on node-3:

[2026-06-16T12:45:20.645] error: _verify_signature: failed decode
[2026-06-16T12:45:20.645] error: Malformed RPC of type REQUEST_LAUNCH_TASKS(6001) received
[2026-06-16T12:45:20.653] error: slurm_unpack_msg_and_forward: [192.168.245.1:59308] failed: Header lengths are longer than data received
[2026-06-16T12:45:20.663] error: wrap_on_data: [192.168.245.133:6818(fd:13)] on_data returned rc: Header lengths are longer than data received

How can I fix this problem?
Thanks

--
slurm-users mailing list -- slurm...@lists.schedmd.com
To unsubscribe send an email to slurm-us...@lists.schedmd.com

Brian Andrus via slurm-users

unread,
Jun 17, 2026, 1:06:29 AM (yesterday) Jun 17
to slurm...@lists.schedmd.com
Double check the actual running version. I had that when I tried to
start a too-new slurmd/sackd against an older slurmctld that was still
running.

Brian Andrus

John Hearns via slurm-users

unread,
Jun 17, 2026, 3:10:25 AM (yesterday) Jun 17
to Brian Andrus, Slurm User Community List
This is not the answer here.
However check that controller mode and worker nodes are in time sync.
As I remember if they are out of sync a connection is refused.

Bjørn-Helge Mevik via slurm-users

unread,
Jun 17, 2026, 3:45:47 AM (yesterday) Jun 17
to slurm...@schedmd.com
allspace--- via slurm-users <slurm...@lists.schedmd.com> writes:

> srun: error: Task launch for StepId=10.0 failed on node node-3: Header lengths are longer than data received
> srun: error: Application launch failed: Header lengths are longer than data received
> srun: Job step aborted
>
> Here is the logs on node-3:
>
> [2026-06-16T12:45:20.645] error: _verify_signature: failed decode
> [2026-06-16T12:45:20.645] error: Malformed RPC of type REQUEST_LAUNCH_TASKS(6001) received
> [2026-06-16T12:45:20.653] error: slurm_unpack_msg_and_forward: [192.168.245.1:59308] failed: Header lengths are longer than data received
> [2026-06-16T12:45:20.663] error: wrap_on_data: [192.168.245.133:6818(fd:13)] on_data returned rc: Header lengths are longer than data received

My suggestion: Double check that all hosts have the same slurm.key.

--
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo
signature.asc

allspace--- via slurm-users

unread,
Jun 17, 2026, 4:59:23 AM (yesterday) Jun 17
to slurm...@lists.schedmd.com
All nodes run the same version of slurm. Actually, I built it from source code slurm-26.05.1.tar.bz2.
Time was in sync, and slurm keys were the same.

Bjørn-Helge Mevik via slurm-users

unread,
Jun 17, 2026, 7:08:45 AM (yesterday) Jun 17
to slurm...@schedmd.com
Then I'd check that the s2n-tls versions are compatible on the two OSes,
and whether there are any *other* machines running slurmd/slurmctld that
could be talking to your machines (I've had that happen, yes :) ).

--
B/H
signature.asc

Brian Andrus via slurm-users

unread,
Jun 17, 2026, 12:14:56 PM (yesterday) Jun 17
to slurm...@lists.schedmd.com
Yes, to check for other systems, ensure you don't have a different
system in the options when starting and/or a _SRV record in DNS that
points elsewhere.

Brian

--

Christopher Samuel via slurm-users

unread,
Jun 17, 2026, 2:34:15 PM (yesterday) Jun 17
to slurm...@lists.schedmd.com
On 6/16/26 1:11 am, allspace--- via slurm-users wrote:

> [2026-06-16T12:45:20.653] error: slurm_unpack_msg_and_forward: [192.168.245.1:59308] failed: Header lengths are longer than data received

When I've seen that reported before on the list it's looked like issues
around mismatching configs or binaries - that text comes from the error
number ESLURM_PROTOCOL_INCOMPLETE_PACKET.

As others have mentioned, make sure that all daemons have been
correctly restarted after packages were upgraded, and on top of that I'd
suggest making sure that your configurations are consistent across the
cluster.

--
Chris Samuel : http://www.csamuel.org/ : Philadelphia, PA, USA
Reply all
Reply to author
Forward
0 new messages