Re: [slurm-users] DBD_SEND_MULT_MSG - invalid uid error

295 views
Skip to first unread message

Craig Stark

unread,
Jan 8, 2024, 1:52:15 PM1/8/24
to slurm...@lists.schedmd.com
3rd time trying to get this to come through to the list - hopefully this time works.

I've been running SLURM for several years now, but in setting it up on a new cluster, I'm hitting a recurring issue.  I'm using a MariaDB and configured it just as I had in my several-year-ago setup and in the docs.  There's a "slurm" user (59999) on the OS (Rocky 9), that's on all the nodes, and I've added the slurm@localhost as instructed (grant all on slurm_acct_db.* TO 'slurm'@'localhost' identified by 'PASSWORD').  But, I keep getting things like this:

```
Dec 22 14:22:07 kirby slurmdbd[14518]: slurmdbd: error: DBD_SEND_MULT_MSG message from invalid uid 59999
Dec 22 14:22:07 kirby slurmdbd[14518]: slurmdbd: error: Processing last message from connection 7(192.168.1.2) uid(59999)
Dec 22 14:22:07 kirby slurmdbd[14518]: slurmdbd: error: CONN:7 DBD_REGISTER_CTLD message from invalid uid 59999
Dec 22 14:22:07 kirby slurmdbd[14518]: slurmdbd: error: CONN:7 Security violation, DBD_REGISTER_CTLD
Dec 22 14:22:07 kirby slurmdbd[14518]: slurmdbd: error: Processing last message from connection 7(192.168.1.2) uid(59999)
```

I'm a total SQL noob, but can at least verify that the user is in there:
MariaDB [(none)]> SELECT User, Host, Password FROM mysql.user;
+-------------+-----------+-------------------------------------------+
| User        | Host      | Password                                  |
+-------------+-----------+-------------------------------------------+
| mariadb.sys | localhost |                                           |
| root        | localhost | invalid                                   |
| mysql       | localhost | invalid                                   |
| slurm       | localhost | *D6665ECF4F3CB12BCA836117F7727B6D0B78D644 |
+-------------+-----------+-------------------------------------------+
4 rows in set (0.002 sec)

Any thoughts as to where I might look to fix this?

Craig

Timony, Mick

unread,
Jan 8, 2024, 2:49:13 PM1/8/24
to Slurm User Community List
This ticket with SchedMD implies it's a munged issue:


Is the munge daemon running on all systems? If it is, are all servers running a network time daemon such chronyd or ntpd and the time is in sync on all hosts?

Regards
--Mick

From: slurm-users <slurm-use...@lists.schedmd.com> on behalf of Craig Stark <ces...@ad.uci.edu>
Sent: Monday, January 8, 2024 1:51 PM
To: slurm...@lists.schedmd.com <slurm...@lists.schedmd.com>
Subject: Re: [slurm-users] DBD_SEND_MULT_MSG - invalid uid error
 

Craig Stark

unread,
Jan 8, 2024, 5:47:14 PM1/8/24
to slurm...@lists.schedmd.com
This ticket with SchedMD implies it's a munged issue:


Is the munge daemon running on all systems? If it is, are all servers running a network time daemon such chronyd or ntpd and the time is in sync on all hosts?
Thanks Mick,

munge is seemingly running on all systems (systemctl status munge).  I do get a warning about the munge file changing on disk, but I'm pretty sure that's from warewulf sync'ing files every minute.  A sha256sum on the munge.key file on the compute nodes and host node says they're the same, so I think I can put that aside.

The management node runs chrony and the compute nodes sync to the management node. 
[root@kirby uber]# chronyc tracking
Reference ID    : 4A06A849 (t2.time.gq1.yahoo.com)
Stratum         : 3
Ref time (UTC)  : Mon Jan 08 22:26:44 2024
System time     : 0.000032525 seconds slow of NTP time
Last offset     : -0.000021390 seconds
RMS offset      : 0.000055729 seconds
Frequency       : 38.797 ppm slow
Residual freq   : +0.001 ppm
Skew            : 0.018 ppm
Root delay      : 0.033342984 seconds
Root dispersion : 0.000524800 seconds
Update interval : 256.8 seconds
Leap status     : Normal

vs
[root@sonic01 ~]# chronyc tracking
Reference ID    : C0A80102 (warewulf)
Stratum         : 4
Ref time (UTC)  : Mon Jan 08 22:31:02 2024
System time     : 0.000000120 seconds slow of NTP time
Last offset     : -0.000000092 seconds
RMS offset      : 0.000014737 seconds
Frequency       : 47.495 ppm slow
Residual freq   : +0.000 ppm
Skew            : 0.066 ppm
Root delay      : 0.033458963 seconds
Root dispersion : 0.000283949 seconds
Update interval : 64.2 seconds
Leap status     : Normal

So, the compute node is talking to the host and the host is talking to generic NTP sources.  "date" shows the same time on the compute nodes

Timony, Mick

unread,
Jan 9, 2024, 9:55:28 AM1/9/24
to slurm...@lists.schedmd.com
You could enable debug logging on your slurm controllers to see if that provides some more useful info. I'd also check your firewall settings to make sure your not blocking some traffic that you shouldn't. iptables -F​ will clear your local Linux firewall.

I'd also triple check the UID on all the systems and run this on all your compute nodes, slurm controllers, and slurmdb to make sure it is the same! 🙂

id 59999

I'd also restart all the slurm daemons all the systems to make sure that you don't have systems that running a daemon from before you created UID 59999 as running processes often don't pick up changes like that unless they're restarted.


Cheers
-- 
Mick Timony
Senior DevOps Engineer
Harvard Medical School
--


From: slurm-users <slurm-use...@lists.schedmd.com> on behalf of Craig Stark <ces...@ad.uci.edu>
Sent: Monday, January 8, 2024 5:46 PM

To: slurm...@lists.schedmd.com <slurm...@lists.schedmd.com>
Subject: Re: [slurm-users] DBD_SEND_MULT_MSG - invalid uid error
 
This ticket with SchedMD implies it's a munged issue:
Thanks Mick,
Reply all
Reply to author
Forward
0 new messages