apparent SWIF2 resource throttling

74 views
Skip to first unread message

andrew...@gmail.com

unread,
Feb 28, 2022, 2:42:00 PM2/28/22
to GlueX Software Help
Dear software help,

My MCWrapper jobs are taking an inordinate amount of time to complete using SWIF2. A workflow that would take at most 8 hours to complete on SWIF has now taken over a week. When I check on individual jobs' status with 'squeue -u acschick', I find that there are only ever 5 jobs running at one time, while the rest sit in a pending state. Under the NODELIST (REASON) column it says (QOSMaxJobsPerUserLimit). However, when I type swif status -workflow <workflow> it lists that I am allowed to have 500 max concurrent jobs. Does this imply somewhere I have 495 other jobs running? I looked on the scicomp page and it didn't seem like I did. It also lets me run more batch jobs in addition to the 5 SWIF2 is running if I submit some with SBATCH. 

I have kept my job parameters the same as when I successfully ran with original SWIF. My MC.config file is here: 
/w/halld-scshelf2101/home/acschick/channels/epemmissprot/ee_MCWrapper/MC.config

Is there anything in my MC.config file that would make SWIF2 throttle my jobs like this?

Thanks,

-Andrew Schick
 

Sean Dobbs

unread,
Feb 28, 2022, 2:53:39 PM2/28/22
to andrew...@gmail.com, GlueX Software Help
Hi Andrew,

My first guess would be that your PARTITION should be set to
"production" not "ifarm". My understanding is that the "ifarm"
partition is for interactive jobs, so these might be handled
differently.

Cheers,
Sean
> --
> You received this message because you are subscribed to the Google Groups "GlueX Software Help" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to gluex-softwar...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/gluex-software/4095c9ad-3164-45d8-b544-5a7b8af7e7e4n%40googlegroups.com.

Peter Pauli

unread,
Feb 28, 2022, 3:12:57 PM2/28/22
to Sean Dobbs, andrew...@gmail.com, GlueX Software Help
The only documentation I could find regarding the partition variable is on


Pointing to

https://scicomp.jlab.org/scicomp/slurmJob/slurmInfo

I think I put ifarm into the example because I saw it in another script. I wasn’t sure if production related to data production and is used for accounting as such.

I am happy to change that if anyone knows more about this.

Cheers,
Peter

Sent from phone

On 28 Feb 2022, at 19:53, Sean Dobbs <sean...@gmail.com> wrote:

Hi Andrew,

Sean Dobbs

unread,
Feb 28, 2022, 3:21:26 PM2/28/22
to Peter Pauli, andrew...@gmail.com, GlueX Software Help
Looking at https://scicomp.jlab.org/docs/farm_slurm_account_partitions -

"The production partition, which is the default, should be used for
most jobs. The priority queue is for quick turnaround of short running
jobs. The ifarm partition is used for interactive access for one or
more cores (Note: an interactive session may not be available
instantly because an interactive slurm job competes with other
interactive jobs for available interactive computing resources)."

With the standard caveat that the documentation might not reflect the
current farm configuration.

---Sean

andrew...@gmail.com

unread,
Feb 28, 2022, 5:37:42 PM2/28/22
to GlueX Software Help
Thanks Sean and Peter! Switching to the production partition has fixed my problems. 

Related:
There are 11/200 failed jobs on the current workflow which should complete if I simply just retry them. Is there a way to use 'swif2 modify-jobs ' to change the partition so I don't have to continue to wait for them to finish on the ifarm partition? I'm guessing not since I don't see partition listed as a parameter here: https://halldweb.jlab.org/wiki/index.php/HOWTO_Execute_a_Launch_using_NERSC

-Andrew



Sean Dobbs

unread,
Feb 28, 2022, 5:40:38 PM2/28/22
to andrew...@gmail.com, GlueX Software Help

Mark Ito

unread,
Mar 1, 2022, 8:52:19 AM3/1/22
to gluex-s...@googlegroups.com
Sorry I am late with this. Ex post facto, there may have been a hint here:

https://halldweb.jlab.org/wiki/index.php/Transition_from_SWIF_to_SWIF2
>>>> To view this discussion on the web visit https://urldefense.proofpoint.com/v2/url?u=https-3A__groups.google.com_d_msgid_gluex-2Dsoftware_4095c9ad-2D3164-2D45d8-2Db544-2D5a7b8af7e7e4n-2540googlegroups.com&d=DwIFaQ&c=CJqEzB1piLOyyvZjb8YUQw&r=i3BxiInvFIdWuzU_zoMBCQ&m=IgRLn-t8XBS805BwdeemKJJNiaqE86X1bDKtUTw58Jrt0tLfyVSjfdwpJs5dMCfk&s=KMRWHlyOosYkRwudUD4spoFleeTgtz8REHkqCVF79JY&e= .
>>>>
>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google Groups "GlueX Software Help" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send an email to gluex-softwar...@googlegroups.com.
>>>> To view this discussion on the web visit https://urldefense.proofpoint.com/v2/url?u=https-3A__groups.google.com_d_msgid_gluex-2Dsoftware_CAEAoKm6qPdW1W3MzY3R-253DqrUejyyVGty69K-252BQAfdLVV7-2DGiy-5FmQ-2540mail.gmail.com&d=DwIFaQ&c=CJqEzB1piLOyyvZjb8YUQw&r=i3BxiInvFIdWuzU_zoMBCQ&m=IgRLn-t8XBS805BwdeemKJJNiaqE86X1bDKtUTw58Jrt0tLfyVSjfdwpJs5dMCfk&s=2hT_K8MEC6Shg2NG4__Vgu3-CJfEmFYYIy10p1mTTX8&e= .
>> --
>> You received this message because you are subscribed to the Google Groups "GlueX Software Help" group.
>> To unsubscribe from this group and stop receiving emails from it, send an email to gluex-softwar...@googlegroups.com.
>> To view this discussion on the web visit https://urldefense.proofpoint.com/v2/url?u=https-3A__groups.google.com_d_msgid_gluex-2Dsoftware_3065876c-2D813a-2D43e2-2Da50b-2D958134f7321bn-2540googlegroups.com&d=DwIFaQ&c=CJqEzB1piLOyyvZjb8YUQw&r=i3BxiInvFIdWuzU_zoMBCQ&m=IgRLn-t8XBS805BwdeemKJJNiaqE86X1bDKtUTw58Jrt0tLfyVSjfdwpJs5dMCfk&s=RpvrIjGgTvcJEcvwgqkTQJd1NbQ3MzUrh8MTZii3Wu0&e= .

Jon Zarling

unread,
Mar 1, 2022, 1:17:23 PM3/1/22
to GlueX Software Help
Hi all,

Just wanted to chime in and mention one thing I ran into: the (default) 500 max concurrent jobs includes jobs that quit with any type of error. Until you retry/bless/cancel those jobs, they count towards the 500 job limit.

So if you submit say 20,000 jobs, the swif2 system will stop dispatching any new jobs if you rack up 500 jobs that exit with problems. If you have 450 problems, you won't stop completely but you'll definitely feel a squeeze. If you're confident enough in your jobs, you may want to start your workflow with an increased max with `swif2 create [workflow] -max-concurrent [num]`. After a workflow is created, there is apparently no way to increase this max number of jobs.


Cheers,
Jon

Alexander Austregesilo

unread,
Mar 1, 2022, 2:50:27 PM3/1/22
to gluex-s...@googlegroups.com

Chris Larrieu (developer of swif2) told me this:

(1) at present max-concurrent is really the only way to constrain the amount of data stored stored in /cache for your workflows. If you set it too high, you run the risk of flushing files from disk that will be needed again later.
(2) you can set max-concurrent for an extant workflow via swif2 run -max-concurrent <n> (you don't need to pause the workflow just to make this happen, you can use 'swif2 run' to tweak this and several other settings for a workflow that is already running).
To view this discussion on the web visit https://groups.google.com/d/msgid/gluex-software/4b309068-d0a4-4d37-b736-709f2e3a07ffn%40googlegroups.com.
-- 
Alexander Austregesilo

Staff Scientist - Experimental Nuclear Physics
Thomas Jefferson National Accelerator Facility
Newport News, VA
aaus...@jlab.org
(757) 269-6982
Reply all
Reply to author
Forward
0 new messages