Good day.
Is there some command that I can use in Slurm to see a node’s history?
Not the job history, but the state history.
Something like:
Jul 5 13:11:01 node01 taken offline by slurmctld because node01 not responding
And/Or:
Jul 5 13:11:01 node01 taken offline by USER1 state=DRAIN reason=”System acting up, going to reboot”
And/Or:
Jul 5 13:11:01 node01 online by USER1
My goal/idea is to see if a node has been having problems according to Slurm itself.
Or if someone DOWNed a node for some reason.
Or to see if a node was down and just returned to service recently.
Does anything like that already exist in Slurm?
Thanks!
- Bill
+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+
Bill Benedetto bbene...@goodyear.com The Goodyear Tire & Rubber Co.
I don't speak for Goodyear and they don't speak for me. We're both happy.
Hi Bill,
Your best bet is probably /var/log/slurmctld on the server that is acting as active controller.
Best,
--
Roberto P. Monti
DevOps Engineer I
The Jackson Laboratory
United States | China | Japan
Hi Bill,
I think the command you’re looking for is `sacctmgr show event`.
Best,
Steve
From: slurm-users <slurm-use...@lists.schedmd.com>
On Behalf Of Bill Benedetto