Hello FireWorks Users,
Background about my workflow
I am using fireworks on LBNL's NERSC Cori system. I have a workflow which contains ~50,000 fireworks with no interdependencies. Each firework consists of two firetasks:
1) Scriptask which executes a parallel program (currently ~3 min cpu time /execution)
2) Scriptask which runs a small python script that processes some output
I am running on the 68-core KNL nodes so I am using ‘rocket_launch: rlaunch multi 34’ in my qadapter script with 2 threads in my Scriptasks for the parallel program.
Hundreds of thousands of these fireworks will eventually need to be run with an even more expensive version of the firetask #1 software.
Issue
When I only have a few thousand of these types of fireworks in my workflow everything runs perfectly well. However, when I scale up to ~50,000 I begin to get many the following type of error in the FW_job%.out file after ~10 fireworks have successfully completed:
2018-07-05 15:13:52,532 INFO fw_id 63526 locked. Can't refresh!
I believe the fireworks’ lock attempts are expiring because they have waited past the config file parameter WFLOCK_EXPIRATION_SECS time limit without getting a lock on the DB because there are too many fireworks finishing around the same time and the database updates are apparently taking too long. I have extended the WFLOCK_EXPIRATION_SECS parameter from 5 min to 10 min but this did not solve the problem. In any case, I can’t even afford 5 minutes of downtime between each firework completion.
Setting the WFLOCK_EXPIRATION_KILL to True does not solve my problem because then there are too many fireworks forcing locks on the database.
I have tried the ‘lpad admin maintain --infinite --maintain_interval 60’ command to no avail. I have also poked around with the database profiler but I am not sure what to look at.
I have heard adding indices to the LaunchPad may improve update speed. I know I do this in the format LaunchPad(..., user_indices= ['spec.paramter1],...) but I am not sure what ‘parameter1’ should be.
Is a fix possible or is my workflow already too large?
Please let me know if more information would be helpful.
I would greatly appreciate any advice.
Thank you.
--
You received this message because you are subscribed to the Google Groups "fireworkflows" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fireworkflow...@googlegroups.com.
To post to this group, send email to firewo...@googlegroups.com.
Visit this group at https://groups.google.com/group/fireworkflows.
To view this discussion on the web visit https://groups.google.com/d/msgid/fireworkflows/fb576419-616e-4880-9184-3580b0bd84bd%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
To view this discussion on the web visit https://groups.google.com/d/msgid/fireworkflows/8b94a355-8724-4740-a057-971a01176443%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.