[slurm-users] slurmctld up and running but not really working

800 views
Skip to first unread message

Julien Rey

unread,
Jul 19, 2022, 10:29:51 AM7/19/22
to slurm...@lists.schedmd.com
Hello,

I am currently facing an issue with an old install of slurm (17.02.11).
However, I cannot upgrade this version because I had troubles with
database migration in the past (when upgrading to 17.11) and this
install is set to be replaced in the next coming monthes. For the time
being I have to keep it running because some of our services still rely
on it.

This issue occured after a power outage.

slurmctld is up and running, however, when I enter "sinfo", I end up
with this message after a few minutes:

slurm_load_partitions: Unable to contact slurm controller (connect failure)

I set SlurmctldDebug=7 in slurm.conf and DebugLevel=7 in slurmdbd.conf,
however I don't get much info about any specific error that would
prevent the slurm controller from working in the logs.

Any help would be greatly appreciated.

/var/log/slurm-llnl/slurmctld.log:

[2022-07-19T15:17:58.342] debug3: Version in assoc_usage header is 7936
[2022-07-19T15:17:58.345] debug3: Version in qos_usage header is 7936
[2022-07-19T15:17:58.345] debug:  Reading slurm.conf file:
/etc/slurm-llnl/slurm.conf
[2022-07-19T15:17:58.347] debug:  Ignoring obsolete SchedulerPort option.
[2022-07-19T15:17:58.347] debug3: layouts: layouts_init()...
[2022-07-19T15:17:58.347] layouts: no layout to initialize
[2022-07-19T15:17:58.347] debug3: Trying to load plugin
/usr/local/lib/slurm/topology_none.so
[2022-07-19T15:17:58.347] topology NONE plugin loaded
[2022-07-19T15:17:58.347] debug3: Success.
[2022-07-19T15:17:58.348] debug:  No DownNodes
[2022-07-19T15:17:58.348] debug3: Version in last_conf_lite header is 7936
[2022-07-19T15:17:58.349] debug3: Trying to load plugin
/usr/local/lib/slurm/jobcomp_none.so
[2022-07-19T15:17:58.349] debug3: Success.
[2022-07-19T15:17:58.349] debug3: Trying to load plugin
/usr/local/lib/slurm/sched_backfill.so
[2022-07-19T15:17:58.349] sched: Backfill scheduler plugin loaded
[2022-07-19T15:17:58.349] debug3: Success.
[2022-07-19T15:17:58.350] debug3: Trying to load plugin
/usr/local/lib/slurm/route_default.so
[2022-07-19T15:17:58.350] route default plugin loaded
[2022-07-19T15:17:58.350] debug3: Success.
[2022-07-19T15:17:58.355] layouts: loading entities/relations information
[2022-07-19T15:17:58.355] debug3: layouts: loading node node0
[2022-07-19T15:17:58.356] debug3: layouts: loading node node1
[2022-07-19T15:17:58.356] debug3: layouts: loading node node2
[2022-07-19T15:17:58.356] debug3: layouts: loading node node3
[2022-07-19T15:17:58.356] debug3: layouts: loading node node4
[2022-07-19T15:17:58.356] debug3: layouts: loading node node5
[2022-07-19T15:17:58.356] debug3: layouts: loading node node6
[2022-07-19T15:17:58.356] debug3: layouts: loading node node7
[2022-07-19T15:17:58.356] debug3: layouts: loading node node8
[2022-07-19T15:17:58.356] debug3: layouts: loading node node9
[2022-07-19T15:17:58.356] debug3: layouts: loading node node10
[2022-07-19T15:17:58.356] debug3: layouts: loading node node11
[2022-07-19T15:17:58.356] debug3: layouts: loading node node12
[2022-07-19T15:17:58.356] debug3: layouts: loading node node13
[2022-07-19T15:17:58.356] debug3: layouts: loading node node14
[2022-07-19T15:17:58.356] debug3: layouts: loading node node15
[2022-07-19T15:17:58.356] debug3: layouts: loading node node16
[2022-07-19T15:17:58.356] debug3: layouts: loading node node17
[2022-07-19T15:17:58.356] debug3: layouts: loading node node18
[2022-07-19T15:17:58.356] debug3: layouts: loading node node19
[2022-07-19T15:17:58.356] debug3: layouts: loading node node20
[2022-07-19T15:17:58.356] debug3: layouts: loading node node21
[2022-07-19T15:17:58.356] debug3: layouts: loading node node22
[2022-07-19T15:17:58.356] debug3: layouts: loading node node23
[2022-07-19T15:17:58.356] debug3: layouts: loading node node24
[2022-07-19T15:17:58.356] debug3: layouts: loading node node25
[2022-07-19T15:17:58.356] debug3: layouts: loading node node26
[2022-07-19T15:17:58.356] debug3: layouts: loading node node27
[2022-07-19T15:17:58.356] debug3: layouts: loading node node28
[2022-07-19T15:17:58.356] debug3: layouts: loading node node29
[2022-07-19T15:17:58.356] debug3: layouts: loading node node30
[2022-07-19T15:17:58.356] debug3: layouts: loading node node31
[2022-07-19T15:17:58.356] debug3: layouts: loading node node42
[2022-07-19T15:17:58.356] debug3: layouts: loading node node43
[2022-07-19T15:17:58.356] debug3: layouts: loading node node44
[2022-07-19T15:17:58.356] debug3: layouts: loading node node45
[2022-07-19T15:17:58.356] debug3: layouts: loading node node46
[2022-07-19T15:17:58.356] debug3: layouts: loading node node47
[2022-07-19T15:17:58.356] debug3: layouts: loading node node49
[2022-07-19T15:17:58.356] debug3: layouts: loading node node50
[2022-07-19T15:17:58.356] debug3: layouts: loading node node51
[2022-07-19T15:17:58.356] debug3: layouts: loading node node52
[2022-07-19T15:17:58.356] debug3: layouts: loading node node53
[2022-07-19T15:17:58.356] debug3: layouts: loading node node54
[2022-07-19T15:17:58.356] debug3: layouts: loading node node55
[2022-07-19T15:17:58.356] debug3: layouts: loading node node56
[2022-07-19T15:17:58.356] debug3: layouts: loading node node60
[2022-07-19T15:17:58.356] debug3: layouts: loading node node61
[2022-07-19T15:17:58.356] debug3: layouts: loading node node62
[2022-07-19T15:17:58.356] debug3: layouts: loading node node63
[2022-07-19T15:17:58.356] debug3: layouts: loading node node64
[2022-07-19T15:17:58.356] debug3: layouts: loading node node65
[2022-07-19T15:17:58.356] debug3: layouts: loading node node66
[2022-07-19T15:17:58.356] debug3: layouts: loading node node67
[2022-07-19T15:17:58.356] debug3: layouts: loading node node68
[2022-07-19T15:17:58.356] debug3: layouts: loading node node73
[2022-07-19T15:17:58.356] debug3: layouts: loading node node74
[2022-07-19T15:17:58.356] debug3: layouts: loading node node75
[2022-07-19T15:17:58.356] debug3: layouts: loading node node76
[2022-07-19T15:17:58.356] debug3: layouts: loading node node77
[2022-07-19T15:17:58.356] debug3: layouts: loading node node78
[2022-07-19T15:17:58.356] debug3: layouts: loading node node100
[2022-07-19T15:17:58.356] debug3: layouts: loading node node101
[2022-07-19T15:17:58.356] debug3: layouts: loading node node102
[2022-07-19T15:17:58.356] debug3: layouts: loading node node103
[2022-07-19T15:17:58.356] debug3: layouts: loading node node104
[2022-07-19T15:17:58.356] debug3: layouts: loading node node105
[2022-07-19T15:17:58.356] debug3: layouts: loading node node106
[2022-07-19T15:17:58.356] debug3: layouts: loading node node107
[2022-07-19T15:17:58.356] debug3: layouts: loading node node108
[2022-07-19T15:17:58.356] debug3: layouts: loading node node109
[2022-07-19T15:17:58.356] debug:  layouts: 71/71 nodes in hash table, rc=0
[2022-07-19T15:17:58.356] debug:  layouts: loading stage 1
[2022-07-19T15:17:58.356] debug:  layouts: loading stage 1.1 (restore state)
[2022-07-19T15:17:58.356] debug:  layouts: loading stage 2
[2022-07-19T15:17:58.356] debug:  layouts: loading stage 3
[2022-07-19T15:17:58.356] error: Node state file
/var/lib/slurm-llnl/slurmctld/node_state too small
[2022-07-19T15:17:58.356] error: NOTE: Trying backup state save file.
Information may be lost!
[2022-07-19T15:17:58.356] debug3: Version string in node_state header is
PROTOCOL_VERSION
[2022-07-19T15:17:58.357] Recovered state of 71 nodes
[2022-07-19T15:17:58.357] error: Job state file
/var/lib/slurm-llnl/slurmctld/job_state too small
[2022-07-19T15:17:58.357] error: NOTE: Trying backup state save file.
Jobs may be lost!
[2022-07-19T15:17:58.357] error: Incomplete job state save file
[2022-07-19T15:17:58.357] Recovered information about 0 jobs
[2022-07-19T15:17:58.357] cons_res: select_p_node_init
[2022-07-19T15:17:58.357] cons_res: preparing for 7 partitions
[2022-07-19T15:17:58.357] debug:  Ports available for reservation
10000-30000
[2022-07-19T15:17:58.359] debug2: init_requeue_policy:
kill_invalid_depend is set to 0
[2022-07-19T15:17:58.359] debug:  Updating partition uid access list
[2022-07-19T15:17:58.359] debug3: Version string in resv_state header is
PROTOCOL_VERSION
[2022-07-19T15:17:58.359] Recovered state of 0 reservations
[2022-07-19T15:17:58.359] State of 0 triggers recovered


/var/log/slurm-llnl/slurmdbd.log:

[2022-07-19T15:00:45.265] debug3: Trying to load plugin
/usr/local/lib/slurm/auth_munge.so
[2022-07-19T15:00:45.265] debug:  Munge authentication plugin loaded
[2022-07-19T15:00:45.265] debug3: Success.
[2022-07-19T15:00:45.265] debug3: Trying to load plugin
/usr/local/lib/slurm/accounting_storage_mysql.so
[2022-07-19T15:00:45.268] debug2: mysql_connect() called for db
slurm_acct_db
[2022-07-19T15:00:45.402] debug2: It appears the table conversions have
already taken place, hooray!
[2022-07-19T15:00:48.146] Accounting storage MYSQL plugin loaded
[2022-07-19T15:00:48.147] debug3: Success.
[2022-07-19T15:00:48.153] debug2: ArchiveDir        = /home/slurm
[2022-07-19T15:00:48.153] debug2: ArchiveScript     = (null)
[2022-07-19T15:00:48.153] debug2: AuthInfo          =
/var/run/munge/munge.socket.2
[2022-07-19T15:00:48.154] debug2: AuthType          = auth/munge
[2022-07-19T15:00:48.154] debug2: CommitDelay       = 0
[2022-07-19T15:00:48.154] debug2: DbdAddr           = localhost
[2022-07-19T15:00:48.154] debug2: DbdBackupHost     = (null)
[2022-07-19T15:00:48.154] debug2: DbdHost           = localhost
[2022-07-19T15:00:48.154] debug2: DbdPort           = 6819
[2022-07-19T15:00:48.154] debug2: DebugFlags        = (null)
[2022-07-19T15:00:48.154] debug2: DebugLevel        = 7
[2022-07-19T15:00:48.154] debug2: DefaultQOS        = (null)
[2022-07-19T15:00:48.154] debug2: LogFile           =
/var/log/slurm-llnl/slurmdbd.log
[2022-07-19T15:00:48.154] debug2: MessageTimeout    = 10
[2022-07-19T15:00:48.154] debug2: PidFile           =
/var/run/slurm-llnl/slurmdbd.pid
[2022-07-19T15:00:48.154] debug2: PluginDir         = /usr/local/lib/slurm
[2022-07-19T15:00:48.154] debug2: PrivateData       = none
[2022-07-19T15:00:48.154] debug2: PurgeEventAfter   = 730 days
[2022-07-19T15:00:48.154] debug2: PurgeJobAfter     = 730 days
[2022-07-19T15:00:48.154] debug2: PurgeResvAfter    = 730 days
[2022-07-19T15:00:48.154] debug2: PurgeStepAfter    = 730 days
[2022-07-19T15:00:48.154] debug2: PurgeSuspendAfter = 730 days
[2022-07-19T15:00:48.154] debug2: PurgeTXNAfter = NONE
[2022-07-19T15:00:48.154] debug2: PurgeUsageAfter = NONE
[2022-07-19T15:00:48.154] debug2: SlurmUser         = slurm(64030)
[2022-07-19T15:00:48.154] debug2: StorageBackupHost = (null)
[2022-07-19T15:00:48.154] debug2: StorageHost       = localhost
[2022-07-19T15:00:48.154] debug2: StorageLoc        = slurm_acct_db
[2022-07-19T15:00:48.154] debug2: StoragePort       = 3306
[2022-07-19T15:00:48.154] debug2: StorageType       =
accounting_storage/mysql
[2022-07-19T15:00:48.154] debug2: StorageUser       = slurm
[2022-07-19T15:00:48.154] debug2: TCPTimeout        = 2
[2022-07-19T15:00:48.154] debug2: TrackWCKey        = 0
[2022-07-19T15:00:48.154] debug2: TrackSlurmctldDown= 0
[2022-07-19T15:00:48.154] debug2: acct_storage_p_get_connection: request
new connection 1
[2022-07-19T15:00:48.430] slurmdbd version 17.02.11 started
[2022-07-19T15:00:48.431] debug2: running rollup at Tue Jul 19 15:00:48 2022
[2022-07-19T15:00:48.435] debug2: No need to roll cluster clusterdev
this day 1658181600 <= 1658181600
[2022-07-19T15:00:48.435] debug2: No need to roll cluster clusterdev
this month 1656626400 <= 1656626400
[2022-07-19T15:00:48.436] debug2: Got 1 of 2 rolled up
[2022-07-19T15:00:48.454] error: We have more time than is possible
(1576800+2160000+0)(3736800) > 3456000 for cluster cluster(960) from
2022-07-19T14:00:00 - 2022-07-19T15:00:00 tres 1
[2022-07-19T15:00:48.456] debug2: No need to roll cluster cluster this
day 1658181600 <= 1658181600
[2022-07-19T15:00:48.457] debug2: No need to roll cluster cluster this
month 1656626400 <= 1656626400
[2022-07-19T15:00:48.458] debug2: Got 2 of 2 rolled up
[2022-07-19T15:00:48.458] debug2: Everything rolled up
[2022-07-19T15:01:05.000] debug2: Opened connection 9 from 10.0.1.51
[2022-07-19T15:01:05.001] debug:  REQUEST_PERSIST_INIT: CLUSTER:cluster
VERSION:7936 UID:64030 IP:10.0.1.51 CONN:9
[2022-07-19T15:01:05.001] debug2: acct_storage_p_get_connection: request
new connection 1
[2022-07-19T15:06:52.329] Terminate signal (SIGINT or SIGTERM) received
[2022-07-19T15:06:52.330] debug:  rpc_mgr shutting down
[2022-07-19T15:06:52.331] debug2: Closed connection 9 uid(64030)
[2022-07-19T15:06:52.332] debug3: starting mysql cleaning up
[2022-07-19T15:06:52.332] debug3: finished mysql cleaning up
[2022-07-19T15:11:13.288] debug3: Trying to load plugin
/usr/local/lib/slurm/auth_munge.so
[2022-07-19T15:11:13.301] debug:  Munge authentication plugin loaded
[2022-07-19T15:11:13.301] debug3: Success.
[2022-07-19T15:11:13.301] debug3: Trying to load plugin
/usr/local/lib/slurm/accounting_storage_mysql.so
[2022-07-19T15:11:13.362] debug2: mysql_connect() called for db
slurm_acct_db
[2022-07-19T15:11:15.447] debug2: It appears the table conversions have
already taken place, hooray!
[2022-07-19T15:11:40.975] Accounting storage MYSQL plugin loaded
[2022-07-19T15:11:40.975] debug3: Success.
[2022-07-19T15:11:40.978] debug2: ArchiveDir        = /home/slurm
[2022-07-19T15:11:40.978] debug2: ArchiveScript     = (null)
[2022-07-19T15:11:40.978] debug2: AuthInfo          =
/var/run/munge/munge.socket.2
[2022-07-19T15:11:40.978] debug2: AuthType          = auth/munge
[2022-07-19T15:11:40.978] debug2: CommitDelay       = 0
[2022-07-19T15:11:40.978] debug2: DbdAddr           = localhost
[2022-07-19T15:11:40.978] debug2: DbdBackupHost     = (null)
[2022-07-19T15:11:40.978] debug2: DbdHost           = localhost
[2022-07-19T15:11:40.978] debug2: DbdPort           = 6819
[2022-07-19T15:11:40.978] debug2: DebugFlags        = (null)
[2022-07-19T15:11:40.978] debug2: DebugLevel        = 7
[2022-07-19T15:11:40.978] debug2: DefaultQOS        = (null)
[2022-07-19T15:11:40.978] debug2: LogFile           =
/var/log/slurm-llnl/slurmdbd.log
[2022-07-19T15:11:40.978] debug2: MessageTimeout    = 10
[2022-07-19T15:11:40.978] debug2: PidFile           =
/var/run/slurm-llnl/slurmdbd.pid
[2022-07-19T15:11:40.978] debug2: PluginDir         = /usr/local/lib/slurm
[2022-07-19T15:11:40.978] debug2: PrivateData       = none
[2022-07-19T15:11:40.978] debug2: PurgeEventAfter   = 730 days
[2022-07-19T15:11:40.978] debug2: PurgeJobAfter     = 730 days
[2022-07-19T15:11:40.978] debug2: PurgeResvAfter    = 730 days
[2022-07-19T15:11:40.978] debug2: PurgeStepAfter    = 730 days
[2022-07-19T15:11:40.978] debug2: PurgeSuspendAfter = 730 days
[2022-07-19T15:11:40.978] debug2: PurgeTXNAfter = NONE
[2022-07-19T15:11:40.978] debug2: PurgeUsageAfter = NONE
[2022-07-19T15:11:40.979] debug2: SlurmUser         = slurm(64030)
[2022-07-19T15:11:40.979] debug2: StorageBackupHost = (null)
[2022-07-19T15:11:40.979] debug2: StorageHost       = localhost
[2022-07-19T15:11:40.979] debug2: StorageLoc        = slurm_acct_db
[2022-07-19T15:11:40.979] debug2: StoragePort       = 3306
[2022-07-19T15:11:40.979] debug2: StorageType       =
accounting_storage/mysql
[2022-07-19T15:11:40.979] debug2: StorageUser       = slurm
[2022-07-19T15:11:40.979] debug2: TCPTimeout        = 2
[2022-07-19T15:11:40.979] debug2: TrackWCKey        = 0
[2022-07-19T15:11:40.979] debug2: TrackSlurmctldDown= 0
[2022-07-19T15:11:40.979] debug2: acct_storage_p_get_connection: request
new connection 1
[2022-07-19T15:11:41.168] slurmdbd version 17.02.11 started
[2022-07-19T15:11:41.168] debug2: running rollup at Tue Jul 19 15:11:41 2022
[2022-07-19T15:11:41.170] debug2: No need to roll cluster clusterdev
this hour 1658235600 <= 1658235600
[2022-07-19T15:11:41.170] debug2: No need to roll cluster clusterdev
this day 1658181600 <= 1658181600
[2022-07-19T15:11:41.170] debug2: No need to roll cluster clusterdev
this month 1656626400 <= 1656626400
[2022-07-19T15:11:41.170] debug2: No need to roll cluster cluster this
hour 1658235600 <= 1658235600
[2022-07-19T15:11:41.170] debug2: No need to roll cluster cluster this
day 1658181600 <= 1658181600
[2022-07-19T15:11:41.170] debug2: No need to roll cluster cluster this
month 1656626400 <= 1656626400
[2022-07-19T15:11:41.170] debug2: Got 2 of 2 rolled up
[2022-07-19T15:11:41.170] debug2: Everything rolled up
[2022-07-19T15:11:58.000] debug2: Opened connection 9 from 10.0.1.51
[2022-07-19T15:11:58.003] debug:  REQUEST_PERSIST_INIT: CLUSTER:cluster
VERSION:7936 UID:64030 IP:10.0.1.51 CONN:9
[2022-07-19T15:11:58.003] debug2: acct_storage_p_get_connection: request
new connection 1
[2022-07-19T15:15:28.671] debug2: Opened connection 10 from 10.0.1.51
[2022-07-19T15:15:28.672] debug:  REQUEST_PERSIST_INIT: CLUSTER:cluster
VERSION:7936 UID:0 IP:10.0.1.51 CONN:10
[2022-07-19T15:15:28.672] debug2: acct_storage_p_get_connection: request
new connection 1
[2022-07-19T15:15:28.710] debug2: DBD_FINI: CLOSE:0 COMMIT:0
[2022-07-19T15:15:28.790] debug2: DBD_GET_USERS: called
[2022-07-19T15:15:28.847] debug2: DBD_FINI: CLOSE:1 COMMIT:0
[2022-07-19T15:15:28.847] debug2: persistant connection is closed
[2022-07-19T15:15:28.847] debug2: Closed connection 10 uid(0)
[2022-07-19T15:17:19.421] debug2: Closed connection 9 uid(64030)
[2022-07-19T15:17:53.635] debug2: Opened connection 12 from 10.0.1.51
[2022-07-19T15:17:53.636] debug:  REQUEST_PERSIST_INIT: CLUSTER:cluster
VERSION:7936 UID:64030 IP:10.0.1.51 CONN:12
[2022-07-19T15:17:53.636] debug2: acct_storage_p_get_connection: request
new connection 1
[2022-07-19T15:17:53.674] debug2: DBD_GET_TRES: called
[2022-07-19T15:17:53.754] debug2: DBD_GET_QOS: called
[2022-07-19T15:17:53.834] debug2: DBD_GET_USERS: called
[2022-07-19T15:17:54.150] debug2: DBD_GET_ASSOCS: called
[2022-07-19T15:17:58.304] debug2: DBD_GET_RES: called
[2022-07-19T16:00:00.171] debug2: running rollup at Tue Jul 19 16:00:00 2022
[2022-07-19T16:00:00.318] debug2: No need to roll cluster clusterdev
this day 1658181600 <= 1658181600
[2022-07-19T16:00:00.318] debug2: No need to roll cluster clusterdev
this month 1656626400 <= 1656626400
[2022-07-19T16:00:00.320] debug2: Got 1 of 2 rolled up
[2022-07-19T16:00:01.603] error: We have more time than is possible
(1576800+2160000+0)(3736800) > 3456000 for cluster cluster(960) from
2022-07-19T15:00:00 - 2022-07-19T16:00:00 tres 1
[2022-07-19T16:00:01.693] debug2: No need to roll cluster cluster this
day 1658181600 <= 1658181600
[2022-07-19T16:00:01.694] debug2: No need to roll cluster cluster this
month 1656626400 <= 1656626400
[2022-07-19T16:00:01.711] debug2: Got 2 of 2 rolled up
[2022-07-19T16:00:01.711] debug2: Everything rolled up

--
Julien Rey

Plate-forme RPBS
Unité BFA - CMPLI
Université de Paris
tel: 01 57 27 83 95


Ole Holm Nielsen

unread,
Jul 19, 2022, 2:30:28 PM7/19/22
to slurm...@lists.schedmd.com
Hi Julien,

Apparently your slurmdbd is quite happy, but it seems that your
slurmctld StateSaveLocation has been corrupted:

> [2022-07-19T15:17:58.356] error: Node state file /var/lib/slurm-llnl/slurmctld/node_state too small
> [2022-07-19T15:17:58.356] error: NOTE: Trying backup state save file. Information may be lost!
> [2022-07-19T15:17:58.356] debug3: Version string in node_state header is PROTOCOL_VERSION
> [2022-07-19T15:17:58.357] Recovered state of 71 nodes
> [2022-07-19T15:17:58.357] error: Job state file /var/lib/slurm-llnl/slurmctld/job_state too small
> [2022-07-19T15:17:58.357] error: NOTE: Trying backup state save file. Jobs may be lost!
> [2022-07-19T15:17:58.357] error: Incomplete job state save file

Did something bad happen to your storage of
/var/lib/slurm-llnl/slurmctld/ ? Could you possibly restore this folder
from the last backup?

I don't know if it's possible to recover from a corrupted slurmctld
StateSaveLocation, maybe some others have an experience?

Even if you could restore it, the Slurm database probably needs to be
consistent with your slurmctld StateSaveLocation, and I don't know if
this is feasible...

Could you initialize your slurm 17.02.11 and start it from scratch?

Regarding an upgrade from 17.02 or 17.11, you may find some useful notes
in my Wiki page
https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#upgrading-slurm

/Ole

Julien Rey

unread,
Jul 20, 2022, 8:20:40 AM7/20/22
to slurm...@lists.schedmd.com
Hello,

Thanks for your quick reply.

I don't mind losing jobs information but I certainly don't want to clear
the slurm database altogether.

The /var/lib/slurm-llnl/slurmctld/node_state and
/var/lib/slurm-llnl/slurmctld/node_state.old files look effectively
empty. I then entered the following command:

sacct | grep RUNNING

and found about 253 jobs.

Is there any elegant way to remove these jobs from the database ?

J.

Ole Holm Nielsen

unread,
Jul 20, 2022, 8:45:34 AM7/20/22
to slurm...@lists.schedmd.com
Hi Julien,

You could make a database dump of the current database so that you can
load it on another server outside the cluster, while you reinitialize
Slurm with a fresh database.

So the database thinks that you have 253 running jobs? I guess that
slurmctld is not working, otherwise you could do: squeue -t running

This command can report current jobs that have been orphaned on the local
cluster and are now runaway:

sacctmgr show runawayjobs

Read the sacctmgr manual page.

I hope this helps.

/Ole

Julien Rey

unread,
Jul 20, 2022, 11:06:49 AM7/20/22
to slurm...@lists.schedmd.com
Hello,

Unfortunately, the sacctmgr show runawayjobs is returning the following
error:

sacctmgr: error: Slurmctld running on cluster cluster is not up, can't
check running jobs

J.

Julien Rey

unread,
Jul 20, 2022, 11:17:26 AM7/20/22
to slurm...@lists.schedmd.com
Actually, I was able to fix the problem by starting slurmctld with the
-c option and then clear the runaway jobs with sacctmgr.

Thanks for your help.

J.
Reply all
Reply to author
Forward
0 new messages