[slurm-dev] Insane message length

10 views
Skip to first unread message

Paul Edmon

unread,
Sep 29, 2013, 3:16:19 PM9/29/13
to slurm-dev

[root@holy-slurm01 ~]# squeue
squeue: error: slurm_receive_msg: Insane message length
slurm_load_jobs error: Insane message length

[root@holy-slurm01 ~]# sdiag
*******************************************************
sdiag output at Sun Sep 29 15:12:13 2013
Data since Sat Sep 28 20:00:01 2013
*******************************************************
Server thread count: 3
Agent queue size: 0

Jobs submitted: 21797
Jobs started: 12030
Jobs completed: 12209
Jobs canceled: 70
Jobs failed: 5

Main schedule statistics (microseconds):
Last cycle: 9207042
Max cycle: 10088674
Total cycles: 1563
Mean cycle: 17859
Mean depth cycle: 12138
Cycles per minute: 1
Last queue length: 496816

Backfilling stats
Total backfilled jobs (since last slurm start): 9325
Total backfilled jobs (since last stats cycle start): 4952
Total cycles: 84
Last cycle when: Sun Sep 29 15:06:15 2013
Last cycle: 2555321
Max cycle: 27633565
Mean cycle: 6115033
Last depth cycle: 3
Last depth cycle (try sched): 2
Depth Mean: 278
Depth Mean (try depth): 62
Last queue length: 496814
Queue length mean: 100807

I'm guessing this is due to the fact that there are roughly 500,000 jobs
in the queue. This is at our upper limit which is 500,000
(MaxJobCount). Is there anything that can be done about this? It seems
that commands that query jobs such as squeue and scancel are not
working. So I can't tell who sent in this many jobs.

-Paul Edmon-

Morris Jette

unread,
Sep 29, 2013, 3:26:59 PM9/29/13
to slurm-dev
Here are some options
1. User scontrol to set queue state to drain and prevent more jobs from being submitted
2. Lower the job limit to block new job submissions
3. Increase the max message size limit and rebuild Slurm
4. Check accounting records for the rogue user
5. Long term set user job limits and train them to run multiple steps on fewer jovs
--
Sent from my Android phone with K-9 Mail. Please excuse my brevity.

Paul Edmon

unread,
Sep 29, 2013, 3:34:00 PM9/29/13
to slurm-dev
Where is the max message size limit set in SLURM?  That's probably the best route at this point.

-Paul Edmon-

Moe Jette

unread,
Sep 29, 2013, 3:40:57 PM9/29/13
to slurm-dev

See MAX_MSG_SIZE in src/common/slurm_protocol_socket_implementation.c

That should get you going again, but setting per user job limits
strongly recommended longer term. That should prevent a rogue script
from bringing the system to its knees.

Moe

Paul Edmon

unread,
Sep 29, 2013, 5:39:06 PM9/29/13
to slurm-dev

Yeah, that's why we set the 500,000 job limit. Though I didn't
anticipate the insane message length issue.

If I drop the MaxJobCount will it purge jobs to get down to that? Or
will it just prohibit new jobs?

I'm assuming that this rebuild would need to be pushed out everywhere as
well? Both clients and master?

-Paul Edmon-

Moe Jette

unread,
Sep 29, 2013, 5:43:01 PM9/29/13
to slurm-dev

Quoting Paul Edmon <ped...@cfa.harvard.edu>:

>
> Yeah, that's why we set the 500,000 job limit. Though I didn't
> anticipate the insane message length issue.

I'd recommend per-user job limits too.


> If I drop the MaxJobCount will it purge jobs to get down to that? Or
> will it just prohibit new jobs?

It will only prohibit new jobs.


> I'm assuming that this rebuild would need to be pushed out
> everywhere as well? Both clients and master?

Only needed on the clients.

Paul Edmon

unread,
Sep 29, 2013, 5:53:50 PM9/29/13
to slurm-dev

That's good to hear. Is there an option to do it per user? I didn't
see one in the slurm.conf. I may have missed it.

-Paul Edmon-

Morris Jette

unread,
Sep 29, 2013, 5:58:11 PM9/29/13
to slurm-dev
That goes into the Slurm database. There are about 20 different limits available by user or group. See the resource limits web page.
Sent from my Android phone with K-9 Mail. Please excuse my brevity.

Paul Edmon

unread,
Sep 29, 2013, 6:00:16 PM9/29/13
to slurm-dev
Ah, okay.  I figured that might be the case.

-Paul Edmon-

Paul Edmon

unread,
Sep 29, 2013, 6:39:14 PM9/29/13
to slurm-dev
Increasing the MAX_MSG_SIZE to 1024*1024*1024 worked.  Is there any reason this couldn't be pushed back into the main tree?  Or do you guys want to keep the smaller message size.

-Paul Edmon-

Paul Edmon

unread,
Sep 29, 2013, 6:44:15 PM9/29/13
to slurm-dev
By the way it looks like this was caused by a user submitting 47 job array jobs each with 10,000 tasks in the array.  Which ended up producing 470,000 jobs.  Is there a quick way to cancel job arrays?  If I were to guess if you canceled the primary id that would take care of all of them?

-Paul Edmon-

Morris Jette

unread,
Sep 29, 2013, 8:20:15 PM9/29/13
to slurm-dev
Just cancel the primary job ID.

Paul Edmon

unread,
Sep 29, 2013, 9:03:22 PM9/29/13
to slurm-dev
Thanks.  That's what I suspected.

-Paul Edmon-

Paul Edmon

unread,
Sep 30, 2013, 11:25:32 AM9/30/13
to slurm-dev
Quick question is there an easy one line command to set MaxSubmitjobs for every user.  So lets say I want

MaxSubmitJobs=50,000

And to apply it every where in the DB.  Is there a way to do it?  Or do you have to walk every single user in the DB?

-Paul Edmon-

Moe Jette

unread,
Sep 30, 2013, 11:30:16 AM9/30/13
to slurm-dev

Just set it on the root bank and it will automatically apply to every
child bank and user (unless another value is explicitly set for some
sub-tree).
Reply all
Reply to author
Forward
0 new messages