I am running into a strange scheduling issue. If there are ANY nodes available on the first island (s0), then the topology plugin will always try to place them on the first island, even if the total required number of nodes is not available. This results in getting stuck at PD (Resources), even when they could run on the second island.
I'd be extremely thankful for any help in understanding what is happening. Thanks!
Below is some additional debugging info.
slurmctld.log seems to show that even jobs which are currently allocated are being considered for the job. Here is the state of allocation:
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
normal* up 5-00:00:00 76 alloc c1-a[1-10],c1-b[1-10],c2-a[1-10],c2-b[1-6,9-10],c3-a[1-5,10],c3-b[1-2],c4-a[1-10],c5-a[5-10],c5-b[1-2],c7-a[1-10],c7-b[8-9]
normal* up 5-00:00:00 64 idle c2-b[7-8],c3-a[6-9],c3-b[3-10],c4-b[1-10],c5-a[1-4],c5-b[3-10],c6-a[1-10],c6-b[1-10],c7-b[1-7,10]
The stuck job is 82663
JOBID PARTITION ST TIME NODES NODELIST(REASON)
82663 normal PD 0:00 10 (Priority)
82660 normal PD 0:00 4 (Resources)
82661 normal PD 0:00 4 (Priority)
82658 normal R 38:29 4 c2-a[2-5]
82659 normal R 38:29 4 c1-a[5-8]
82641 normal R 1:26:19 10 c1-b[6-10],c2-b[2-5,9]
82640 normal R 1:37:37 10 c1-b[4-5],c2-a[6-10],c2-b[1,6,10]
82637 normal R 1:42:29 4 c3-a[5,10],c3-b[1-2]
82638 normal R 1:42:29 4 c5-a[9-10],c5-b[1-2]
82635 normal R 1:42:32 4 c3-a[1-4]
82636 normal R 1:42:32 4 c5-a[5-8]
82342 normal R 3:34:32 10 c7-a[1-10]
82339 normal R 3:42:02 10 c4-a[1-10]
82337 normal R 3:46:12 10 c1-a[1-4,9-10],c1-b[1-3],c2-a1
82662 test R 6:33 2 c7-b[8-9]
And here is part of slurmctld.log
[2022-05-13T12:43:49.441] select/cons_tres: core_array_log: _select_nodes/enter
[2022-05-13T12:43:49.441] select/cons_tres: core_array_log: node_list:c1-a[5-8],c2-b[7-8],c3-a[1-10],c3-b[1-10],c4-b[1-10],c5-a[1-10],c5-b[1-10],c6-a[1-10],c6-b[1-10],c7-b[1-10]
[2022-05-13T12:43:49.441] select/cons_tres: core_array_log: core_list:node[4]:0-63,node[5]:0-63,node[6]:0-63,node[7]:0-63,node[36]:0-63,node[37]:0-63,node[40]:0-63,node[41]:0-63,node[42]:0-63,node[43]:0-63,node[44]:0-63,node[45]:0-63,node[46]:0-63,node[47]:0-63,node[48]:0-63,node[49]:0-63,node[50]:0-63,node[51]:0-63,node[52]:0-63,node[53]:0-63,node[54]:0-63,node[55]:0-63,node[56]:0-63,node[57]:0-63,node[58]:0-63,node[59]:0-63,node[70]:0-63,node[71]:0-63,node[72]:0-63,node[73]:0-63,node[74]:0-63,node[75]:0-63,node[76]:0-63,node[77]:0-63,node[78]:0-63,node[79]:0-63,node[80]:0-63,node[81]:0-63,node[82]:0-63,node[83]:0-63,node[84]:0-63,node[85]:0-63,node[86]:0-63,node[87]:0-63,node[88]:0-63,node[89]:0-63,node[90]:0-63,node[91]:0-63,node[92]:0-63,node[93]:0-63,node[94]:0-63,node[95]:0-63,node[96]:0-63,node[97]:0-63,node[98]:0-63,node[99]:0-63,node[100]:0-63,node[101]:0-63,node[102]:0-63,node[103]:0-63,node[104]:0-63,node[105]:0-63,node[106]:0-63,node[107]:0-63,node[108]:0-63,node[109]:0-63,node[110]:0-63,node[111]:0-63,node[112]:0-63,node[113]:0-63,node[114]:0-63,node[115]:0-63,node[116]:0-63,node[117]:0-63,node[118]:0-63,node[119]:0-63,node[130]:0-63,node[131]:0-63,node[132]:0-63,node[133]:0-63,node[134]:0-63,node[135]:0-63,node[136]:0-63,node[137]:0-63,node[138]:0-63,node[139]:0-63
[2022-05-13T12:43:49.441] select/cons_tres: core_array_log: _select_nodes/elim_nodes
[2022-05-13T12:43:49.441] select/cons_tres: core_array_log: node_list:c1-a[5-8],c2-b[7-8],c3-a[1-10],c3-b[1-10],c4-b[1-10],c5-a[1-10],c5-b[1-10],c6-a[1-10],c6-b[1-10],c7-b[1-10]
[2022-05-13T12:43:49.442] select/cons_tres: core_array_log: core_list:node[4]:0-63,node[5]:0-63,node[6]:0-63,node[7]:0-63,node[36]:0-63,node[37]:0-63,node[40]:0-63,node[41]:0-63,node[42]:0-63,node[43]:0-63,node[44]:0-63,node[45]:0-63,node[46]:0-63,node[47]:0-63,node[48]:0-63,node[49]:0-63,node[50]:0-63,node[51]:0-63,node[52]:0-63,node[53]:0-63,node[54]:0-63,node[55]:0-63,node[56]:0-63,node[57]:0-63,node[58]:0-63,node[59]:0-63,node[70]:0-63,node[71]:0-63,node[72]:0-63,node[73]:0-63,node[74]:0-63,node[75]:0-63,node[76]:0-63,node[77]:0-63,node[78]:0-63,node[79]:0-63,node[80]:0-63,node[81]:0-63,node[82]:0-63,node[83]:0-63,node[84]:0-63,node[85]:0-63,node[86]:0-63,node[87]:0-63,node[88]:0-63,node[89]:0-63,node[90]:0-63,node[91]:0-63,node[92]:0-63,node[93]:0-63,node[94]:0-63,node[95]:0-63,node[96]:0-63,node[97]:0-63,node[98]:0-63,node[99]:0-63,node[100]:0-63,node[101]:0-63,node[102]:0-63,node[103]:0-63,node[104]:0-63,node[105]:0-63,node[106]:0-63,node[107]:0-63,node[108]:0-63,node[109]:0-63,node[110]:0-63,node[111]:0-63,node[112]:0-63,node[113]:0-63,node[114]:0-63,node[115]:0-63,node[116]:0-63,node[117]:0-63,node[118]:0-63,node[119]:0-63,node[130]:0-63,node[131]:0-63,node[132]:0-63,node[133]:0-63,node[134]:0-63,node[135]:0-63,node[136]:0-63,node[137]:0-63,node[138]:0-63,node[139]:0-63
[2022-05-13T12:43:49.442] select/cons_tres: _topo_weight_log: Topo:c1-a[5-8],c2-b[7-8],c3-a[1-10],c3-b[1-10],c4-b[1-10],c5-a[1-10],c5-b[1-10],c6-a[1-10],c6-b[1-10],c7-b[1-10] weight:511
[2022-05-13T12:43:49.442] select/cons_tres: _eval_nodes_topo: Best nodes:c1-a[5-8],c2-b[7-8] node_cnt:6 cpu_cnt:384
[2022-05-13T12:43:49.442] select/cons_tres: _eval_nodes_topo: switch=c1 level=0 nodes=4:c1-a[5-8] required:0 speed:1
[2022-05-13T12:43:49.442] select/cons_tres: _eval_nodes_topo: switch=c2 level=0 nodes=2:c2-b[7-8] required:0 speed:1
[2022-05-13T12:43:49.442] select/cons_tres: _eval_nodes_topo: switch=c3 level=0 nodes=0:(null) required:0 speed:1
[2022-05-13T12:43:49.442] select/cons_tres: _eval_nodes_topo: switch=c4 level=0 nodes=0:(null) required:0 speed:1
[2022-05-13T12:43:49.442] select/cons_tres: _eval_nodes_topo: switch=c5 level=0 nodes=0:(null) required:0 speed:1
[2022-05-13T12:43:49.442] select/cons_tres: _eval_nodes_topo: switch=c6 level=0 nodes=0:(null) required:0 speed:1
[2022-05-13T12:43:49.442] select/cons_tres: _eval_nodes_topo: switch=c7 level=0 nodes=0:(null) required:0 speed:1
[2022-05-13T12:43:49.442] select/cons_tres: _eval_nodes_topo: switch=c1_dummy level=1 nodes=4:c1-a[5-8] required:0 speed:1
[2022-05-13T12:43:49.442] select/cons_tres: _eval_nodes_topo: switch=c2_dummy level=1 nodes=2:c2-b[7-8] required:0 speed:1
[2022-05-13T12:43:49.442] select/cons_tres: _eval_nodes_topo: switch=s0 level=2 nodes=6:c1-a[5-8],c2-b[7-8] required:0 speed:1
[2022-05-13T12:43:49.442] select/cons_tres: _eval_nodes_topo: switch=s1 level=1 nodes=0:(null) required:0 speed:1
[2022-05-13T12:43:49.442] select/cons_tres: core_array_log: _select_nodes/choose_nodes
[2022-05-13T12:43:49.442] select/cons_tres: core_array_log: node_list:c1-a[5-8]
[2022-05-13T12:43:49.442] select/cons_tres: core_array_log: core_list:node[4]:0-63,node[5]:0-63,node[6]:0-63,node[7]:0-63,node[36]:0-63,node[37]:0-63,node[40]:0-63,node[41]:0-63,node[42]:0-63,node[43]:0-63,node[44]:0-63,node[45]:0-63,node[46]:0-63,node[47]:0-63,node[48]:0-63,node[49]:0-63,node[50]:0-63,node[51]:0-63,node[52]:0-63,node[53]:0-63,node[54]:0-63,node[55]:0-63,node[56]:0-63,node[57]:0-63,node[58]:0-63,node[59]:0-63,node[70]:0-63,node[71]:0-63,node[72]:0-63,node[73]:0-63,node[74]:0-63,node[75]:0-63,node[76]:0-63,node[77]:0-63,node[78]:0-63,node[79]:0-63,node[80]:0-63,node[81]:0-63,node[82]:0-63,node[83]:0-63,node[84]:0-63,node[85]:0-63,node[86]:0-63,node[87]:0-63,node[88]:0-63,node[89]:0-63,node[90]:0-63,node[91]:0-63,node[92]:0-63,node[93]:0-63,node[94]:0-63,node[95]:0-63,node[96]:0-63,node[97]:0-63,node[98]:0-63,node[99]:0-63,node[100]:0-63,node[101]:0-63,node[102]:0-63,node[103]:0-63,node[104]:0-63,node[105]:0-63,node[106]:0-63,node[107]:0-63,node[108]:0-63,node[109]:0-63,node[110]:0-63,node[111]:0-63,node[112]:0-63,node[113]:0-63,node[114]:0-63,node[115]:0-63,node[116]:0-63,node[117]:0-63,node[118]:0-63,node[119]:0-63,node[130]:0-63,node[131]:0-63,node[132]:0-63,node[133]:0-63,node[134]:0-63,node[135]:0-63,node[136]:0-63,node[137]:0-63,node[138]:0-63,node[139]:0-63
[2022-05-13T12:43:49.442] select/cons_tres: core_array_log: _select_nodes/sync_cores
[2022-05-13T12:43:49.442] select/cons_tres: core_array_log: node_list:c1-a[5-8]
[2022-05-13T12:43:49.442] select/cons_tres: core_array_log: core_list:node[4]:0-63,node[5]:0-63,node[6]:0-63,node[7]:0-63,node[36]:0-63,node[37]:0-63,node[40]:0-63,node[41]:0-63,node[42]:0-63,node[43]:0-63,node[44]:0-63,node[45]:0-63,node[46]:0-63,node[47]:0-63,node[48]:0-63,node[49]:0-63,node[50]:0-63,node[51]:0-63,node[52]:0-63,node[53]:0-63,node[54]:0-63,node[55]:0-63,node[56]:0-63,node[57]:0-63,node[58]:0-63,node[59]:0-63,node[70]:0-63,node[71]:0-63,node[72]:0-63,node[73]:0-63,node[74]:0-63,node[75]:0-63,node[76]:0-63,node[77]:0-63,node[78]:0-63,node[79]:0-63,node[80]:0-63,node[81]:0-63,node[82]:0-63,node[83]:0-63,node[84]:0-63,node[85]:0-63,node[86]:0-63,node[87]:0-63,node[88]:0-63,node[89]:0-63,node[90]:0-63,node[91]:0-63,node[92]:0-63,node[93]:0-63,node[94]:0-63,node[95]:0-63,node[96]:0-63,node[97]:0-63,node[98]:0-63,node[99]:0-63,node[100]:0-63,node[101]:0-63,node[102]:0-63,node[103]:0-63,node[104]:0-63,node[105]:0-63,node[106]:0-63,node[107]:0-63,node[108]:0-63,node[109]:0-63,node[110]:0-63,node[111]:0-63,node[112]:0-63,node[113]:0-63,node[114]:0-63,node[115]:0-63,node[116]:0-63,node[117]:0-63,node[118]:0-63,node[119]:0-63,node[130]:0-63,node[131]:0-63,node[132]:0-63,node[133]:0-63,node[134]:0-63,node[135]:0-63,node[136]:0-63,node[137]:0-63,node[138]:0-63,node[139]:0-63
[2022-05-13T12:43:49.442] select/cons_tres: common_job_test: no job_resources info for JobId=82661 rc=0