Hello Anubhav, thanks for your time looking at these issues. Simultaneously to your debugging, I looked at the recover_offline call by just running the commands step by step for a particular launch that has been marked as RUNNING again after being FIZZLED. Here, I will illustrate with Database screenshots what I noticed: As described before, "updated_on" is set to the current date every time calling "lpad recover_offline -w PATH_TO_THE_APPROPTIATE_WORKER_FILE": Running the first few lines of the recovery code https://github.com/materialsproject/fireworks/blob/df8374bc3358a826eaa258de333ff6a46d4f54fa/fireworks/core/launchpad.py#L1677-L1684 m_launch = self.get_launch_by_id(launch_id) try: self.m_logger.debug("RECOVERING fw_id: {}".format(m_launch.fw_id)) # look for ping file - update the Firework if this is the case ping_loc = os.path.join(m_launch.launch_dir, "FW_ping.json") if os.path.exists(ping_loc): ping_dict = loadfn(ping_loc) self.ping_launch(launch_id, ptime=ping_dict['ping_time']) on the ping file with content '{"ping_time": "2019-07-28T12:54:43.213215"}' modifies the database entry as expected: After the first part of the few lines pointed out by you, https://github.com/materialsproject/fireworks/blob/df8374bc3358a826eaa258de333ff6a46d4f54fa/fireworks/core/launchpad.py#L1690-L1697 offline_data = loadfn(offline_loc) if 'started_on' in offline_data: m_launch.state = 'RUNNING' for s in m_launch.state_history: if s['state'] == 'RUNNING': s['created_on'] = reconstitute_dates(offline_data['started_on']) l = self.launches.find_one_and_replace({'launch_id': m_launch.launch_id}, m_launch.to_db_dict(), upsert=True) , the state history is still consistent: The Fireworks has not been touched and still looks like this After https://github.com/materialsproject/fireworks/blob/df8374bc3358a826eaa258de333ff6a46d4f54fa/fireworks/core/launchpad.py#L1698-L1704 fw_id = l['fw_id'] f = self.fireworks.find_one_and_update({'fw_id': fw_id}, {'$set': {'state': 'RUNNING', 'updated_on': datetime.datetime.utcnow() } }) the Fireworks is updated to the current time: That is what yaou described. However, I do not yet understand where that state setter you mention comes into play, I will have to look at that tomorrow. The launche's state_history is still consistent up until here. A few lines below, https://github.com/materialsproject/fireworks/blob/df8374bc3358a826eaa258de333ff6a46d4f54fa/fireworks/core/launchpad.py#L1708-L1711 if 'checkpoint' in offline_data: m_launch.touch_history(checkpoint=offline_data['checkpoint']) self.launches.find_one_and_replace({'launch_id': m_launch.launch_id}, m_launch.to_db_dict(), upsert=True) calls "touch_history" again, this time, however, without any ptime argument, and thus overrides the previous change again with the current time: Since the FW_offline.json contains a non-empty "checkpoint" entry, {"launch_id": 12392, "started_on": "2019-07-24T12:54:41.031150", "checkpoint": {"_task_n": 0, "_all_stored_data": {}, "_all_update_spec": {}, "_all_mod_spec": []}} these lines are executed. That is how the current time enters state history. What is the actual purpose of a "checkpoint"? There is not much documentation on this. Find the test protocal attached (Jupyter notebook and HTML). In the next few days, I will address the other points in your post. Best, Johannes Am Dienstag, 6. August 2019 19:19:27 UTC+2 schrieb Anubhav Jain: > Hi Johannes > > To follow up again, for issue #1 above I think I found the offending line: > > > https://github.com/materialsproject/fireworks/blob/df8374bc3358a826eaa258de333ff6a46d4f54fa/fireworks/core/launchpad.py#L1692 > > This line updates the state of the launch to "RUNNING". However, the > "setter" of the state in the Launch object automatically touches the > history with the current time anytime the state is modified: > > https://github.com/materialsproject/fireworks/blob/df8374bc3358a826eaa258de333ff6a46d4f54fa/fireworks/core/firework.py#L259 > > I think that is what is causing the problem. > > It's been awhile (i.e. years) since I've wrapped my head around the > offline code. However perhaps based on this you can suggest a solution? Let > me know if not. If that's the case I might ask you for some more > information to help design something. > > Best, > Anubhav > > > On Tuesday, August 6, 2019 at 10:12:23 AM UTC-7, Anubhav Jain wrote: >> >> Hi Johannes, >> >> Going back to two messages up. >> >> For issue #1: >> - It is good / correct that the type of the updated_on is String >> - The line you indicated as problematic should be OK, I think. This line >> is updating the "updated_on" field of the *root* Launch document. This >> should be different than the "updated_on" in the state_history[1] field. >> The key is to make sure that "state_history[{x}].updated_on" contains the >> correct timestamp (where state_history[{x}] corresponds to the entry for >> "RUNNING" state). >> - I am actually quite confused as to where the origin of the problem is. >> I would thing that state_history[1] would be updated in this line of the >> code: >> https://github.com/materialsproject/fireworks/blob/df8374bc3358a826eaa258de333ff6a46d4f54fa/fireworks/core/launchpad.py#L1684 >> - But the line of code above seems to respect updating the >> state_history.updated_on as the "ping_time" of FW_ping.json, which looks >> correct. >> >> So, unfortunately, I think some more debugging is needed. e.g., to dig >> into the recover_offline() code and see where in the process the >> "state_history[{x}].updated_on" field gets corrupted to be the current time >> and not the ping time. >> >> Issue 2: >> Your suggestion at least seems better than the current situation. Do you >> want to try it out and submit a pull request if it works? >> >> I have not been able to read the most recent message (about LAMMPS, >> allow_fizzled_parents, etc) in detail. However, if you were to fix issue #2 >> as per above, would it also fix this issue? Or is it separate? >> >> Thanks for your help in reporting / debugging this. >> >> >> >> >> On Tuesday, August 6, 2019 at 7:04:01 AM UTC-7, Johannes Hörmann wrote: >>> >>> A related issue: >>> >>> In this blurry workflow snippet >>> the following happens: >>> >>> An initial Fireworks (1a) >>> runs LAMMPS until walltime expires on an HPC resource. It is then marked >>> as "fizzled" with a suitable "lpad detect_lostruns --fizzle", as described >>> in the first post in this thread. A subsequent recovery Firework (1b) >>> with {"spec._allow_fizzled_parents": true} recovers the necessary >>> restart files, automatically appends a suitable restart run (2a) >>> with another subsequent recovery Fireworks (2b) as well as some >>> post-processing Fireworks (1c) >>> >>> This recovery loop then repeats (2c, 3a, 3b, ...) until the LAMMPS run >>> finishes successfully. >>> >>> What happened in the above example is that due to the *issue 2* >>> described in the previous posts here, Fireworks 1a and 2a have been marked >>> as "running" again after they were marked as "fizzled" with >>> "detect_lostruns" and their "allow_fizzled_parents" children 1b and 2b >>> started to run. The dangerous point here is that if another "lpad >>> detect_lostruns --fizzle" is applied without carefully discriminating >>> between the "generations" of Fireworks in the tree here, 1a will be marked >>> as fizzled again, *and all its children, grandchildren, etc. will lose >>> the information on its current state and be marked as "wating" again, *with >>> expensive computations already finished (i.e. 2a), currently running (i.e. >>> 3a) or queued on the HPC resource 'dropping out' of the workflow management >>> framework, without simple means to recover these. >>> >>> Here, a way to fizzle these lost runs 1a, 2a again properly *without >>> affecting the state of their children *is necessary to keep the >>> workflow information in the database coherent with what is actually present >>> on the computing resources and file systems. >>> >>> Best regards, >>> >>> Johannes >>> >>> >>> Am Mittwoch, 24. Juli 2019 14:26:34 UTC+2 schrieb Johannes Hörmann: >>>> >>>> Hello Anubhav, >>>> >>>> thanks for the answer. Finally, I found some opportunity & time to do >>>> as suggested on a job that actually got killed a few days ago after >>>> exceeding the maximum walltime of 4 days. >>>> >>>> *Issue 1: * >>>> >>>> Here the MOAB job log ( >>>> /work/ws/nemo/fr_jh1130-fw_ws_20190311-0/launchpad/block_2019-06-30-13-07-21-802466/launcher_2019-07-13-22-54-14-628683/NEMO_AU_111_r__25_An.e6012657 >>>> ) >>>> >>>> + cd >>>> /work/ws/nemo/fr_jh1130-fw_ws_20190311-0/launchpad/block_2019-06-30-13-07-21-802466/launcher_2019-07-13-22-54-14-628683 >>>> + rlaunch -w /home/fr/fr_fr/fr_jh1130/.fireworks/nemo_queue_worker.yaml >>>> -l /home/fr/fr_fr/fr_jh1130/.fireworks/fireworks_mongodb_auth.yaml >>>> singleshot --offline --fw_id 15514 >>>> =>> PBS: job killed: walltime 345642 exceeded limit 345600 >>>> >>>> The FW ID is 15514 and content of >>>> /work/ws/nemo/fr_jh1130-fw_ws_20190311-0/launchpad/block_2019-06-30-13-07-21-802466/launcher_2019-07-13-22-54-14-628683/FW_ping.json >>>> is >>>> >>>> {"ping_time": "2019-07-17T22:54:51.000760"} >>>> >>>> That being the last update agrees very well with the maximum walltime. >>>> /work/ws/nemo/fr_jh1130-fw_ws_20190311-0/launchpad/block_2019-06-30-13-07-21-802466/launcher_2019-07-13-22-54-14-628683/FW_offline.json >>>> shows that the run started exactly four days earlier: >>>> >>>> {"launch_id": 11789, "started_on": "2019-07-13T22:54:49.124427", >>>> "checkpoint": {"_task_n": 0, "_all_stored_data": {}, "_all_update_spec": >>>> {}, "_all_mod_spec": []}} >>>> >>>> A manual check shows no other files in this launchdir have been touched >>>> afterwards: >>>> >>>> $ ls -lht >>>> total 8,0G >>>> -rw------- 1 fr_jh1130 fr_fr 28K 18. Jul 00:55 >>>> NEMO_AU_111_r__25_An.e6012657 >>>> -rw------- 1 fr_jh1130 fr_fr 43 18. Jul 00:54 FW_ping.json >>>> -rw------- 1 fr_jh1130 fr_fr 663K 18. Jul 00:52 log.lammps >>>> -rw------- 1 fr_jh1130 fr_fr 83M 18. Jul 00:52 default.mpiio.restart1 >>>> ... >>>> >>>> >>>> However, the update state in the "launch" collection just corresponds >>>> to the current time (see state_history[1]: updated_on): >>>> >>>> >>>> Am I correct in assuming that the repeatedly running lpad >>>> recover_offline updates this time after reading FW_offline.json? >>>> That I read from the recover_offline code >>>> https://github.com/materialsproject/fireworks/blob/df8374bc3358a826eaa258de333ff6a46d4f54fa/fireworks/core/launchpad.py#L1728-L1730 >>>> >>>> As you see, the type is "String", no datetime type. >>>> >>>> Would that be the expected behavior? Or should lpad recover_offline >>>> leave the updated_on key untouched, if no update has been recorded to the >>>> FW_ping.json? >>>> >>>> *Issue 2* >>>> >>>> Wouldn't it be the quick solution to always "forget" the offline run by >>>> the already existing "lpad.forget_offline" method internally when calling >>>> "lpad detect_lostruns --fizzle / --rerun"? >>>> I don't see any situation where one would want to keep an offline run >>>> already explicitly identified as "dead" available to the "recover_offline" >>>> functionality. >>>> >>>> Best regards, >>>> >>>> Johannes >>>> >>>> For completeness, the according lpad get_fws output: >>>> >>>> $ lpad get_fws -i 15514 -d all >>>> { >>>> "spec": { >>>> "_category": "nemo_queue_offline", >>>> "_files_in": { >>>> "coeff_file": "coeff.input", >>>> "data_file": "datafile.lammps", >>>> "input_header": "lmp_header.input", >>>> "input_production": "lmp_production.input" >>>> }, >>>> "_files_out": { >>>> "ave_file": "thermo_ave.out", >>>> "data_file": "default.lammps", >>>> "log_file": "log.lammps", >>>> "ndx_file": "groups.ndx", >>>> "traj_file": "default.nc" >>>> }, >>>> "_queueadapter": { >>>> "nodes": 16, >>>> "ppn": 20, >>>> "queue": null, >>>> "walltime": "96:00:00" >>>> }, >>>> "_tasks": [ >>>> { >>>> "_fw_name": "CmdTask", >>>> "cmd": "lmp", >>>> "fizzle_bad_rc": true, >>>> "opt": [ >>>> "-in lmp_production.input", >>>> "-v coeffInfile coeff.input", >>>> "-v coeffOutfile coeff.input.transient", >>>> "-v compute_group_properties 1", >>>> "-v compute_interactions 0", >>>> "-v dataFile datafile.lammps", >>>> "-v dilate_solution_only 1", >>>> "-v freeze_substrate 0", >>>> "-v freeze_substrate_layer 14.0", >>>> "-v has_indenter 1", >>>> "-v rigid_indenter_core_radius 12.0", >>>> "-v constant_indenter_velocity -1e-06", >>>> "-v mpiio 1", >>>> "-v netcdf_frequency 50000", >>>> "-v productionSteps 17500000", >>>> "-v pressureP 1.0", >>>> "-v pressurize_z_only 1", >>>> "-v pressurize_solution_only 0", >>>> "-v reinitialize_velocities 0", >>>> "-v read_groups_from_file 0", >>>> "-v rigid_indenter 0", >>>> "-v restrained_indenter 0", >>>> "-v restart_frequency 50000", >>>> "-v store_forces 1", >>>> "-v surfactant_name SDS", >>>> "-v temperatureT 298.0", >>>> "-v temper_solid_only 1", >>>> "-v temper_substrate_only 0", >>>> "-v thermo_frequency 5000", >>>> "-v thermo_average_frequency 5000", >>>> "-v use_barostat 0", >>>> "-v use_berendsen_bstat 0", >>>> "-v use_dpd_tstat 1", >>>> "-v use_eam 1", >>>> "-v use_ewald 1", >>>> "-v write_coeff 1", >>>> "-v write_coeff_to_datafile 0", >>>> "-v write_groups_to_file 1", >>>> "-v coulomb_cutoff 8.0", >>>> "-v ewald_accuracy 0.0001", >>>> "-v neigh_delay 2", >>>> "-v neigh_every 1", >>>> "-v neigh_check 1", >>>> "-v skin_distance 3.0" >>>> ], >>>> "stderr_file": "std.err", >>>> "stdout_file": "std.out", >>>> "store_stderr": true, >>>> "store_stdout": true, >>>> "use_shell": true >>>> } >>>> ], >>>> "_trackers": [ >>>> { >>>> "filename": "log.lammps", >>>> "nlines": 25 >>>> } >>>> ], >>>> "metadata": { >>>> "barostat_damping": 10000.0, >>>> "ci_preassembly": "at polar heads", >>>> "compute_group_properties": 1, >>>> "constant_indenter_velocity": -1e-06, >>>> "constant_indenter_velocity_unit": "Ang_per_fs", >>>> "coulomb_cutoff": 8.0, >>>> "coulomb_cutoff_unit": "Ang", >>>> "counterion": "NA", >>>> "ewald_accuracy": 0.0001, >>>> "force_field": { >>>> "solution_solution": "charmm36-jul2017", >>>> "substrate_solution": "interface_ff_1_5", >>>> "substrate_substrate": >>>> "Au-Grochola-JCP05-units-real.eam.alloy" >>>> }, >>>> "frozen_sb_layer_thickness": 14.0, >>>> "frozen_sb_layer_thickness_unit": "Ang", >>>> "indenter": { >>>> "crystal_plane": 111, >>>> "equilibration_time_span": 50, >>>> "equilibration_time_span_unit": "ps", >>>> "initial_radius": 25, >>>> "initial_radius_unit": "Ang", >>>> "initial_shape": "sphere", >>>> "lammps_units": "real", >>>> "melting_final_temperature": 1800, >>>> "melting_time_span": 10, >>>> "melting_time_span_unit": "ns", >>>> "minimization_ftol": 1e-05, >>>> "minimization_ftol_unit": "kcal", >>>> "natoms": 3873, >>>> "orientation": "111 facet facing negative z", >>>> "potential": "Au-Grochola-JCP05-units-real.eam.alloy", >>>> "quenching_time_span": 100, >>>> "quenching_time_span_unit": "ns", >>>> "quenching_time_step": 5, >>>> "quenching_time_step_unit": "fs", >>>> "substrate": "AU", >>>> "temperature": 298, >>>> "temperature_unit": "K", >>>> "time_step": 2, >>>> "time_step_unit": "fs", >>>> "type": "AFM tip" >>>> }, >>>> "langevin_damping": 1000.0, >>>> "machine": "NEMO", >>>> "mode": "TRIAL", >>>> "mpiio": 1, >>>> "neigh_check": 1, >>>> "neigh_delay": 2, >>>> "neigh_every": 1, >>>> "netcdf_frequency": 50000, >>>> "pbc": 111, >>>> "pressure": 1, >>>> "pressure_unit": "atm", >>>> "production_steps": 17500000, >>>> "restrained_sb_layer_thickness": null, >>>> "restrained_sb_layer_thickness_unit": null, >>>> "sb_area": 2.25e-16, >>>> "sb_area_unit": "m^2", >>>> "sb_base_length": 150, >>>> "sb_base_length_unit": "Ang", >>>> "sb_crystal_plane": 111, >>>> "sb_crystal_plane_multiples": [ >>>> 52, >>>> 90, >>>> 63 >>>> ], >>>> "sb_in_dist": 30.0, >>>> "sb_in_dist_unit": "Ang", >>>> "sb_lattice_constant": 4.075, >>>> "sb_lattice_constant_unit": "Ang", >>>> "sb_measures": [ >>>> 1.49836e-08, >>>> 1.49725e-08, >>>> 1.47828e-08 >>>> ], >>>> "sb_measures_unit": "m", >>>> "sb_multiples": [ >>>> 52, >>>> 30, >>>> 21 >>>> ], >>>> "sb_name": "AU_111_150Ang_cube", >>>> "sb_natoms": 196560, >>>> "sb_normal": 2, >>>> "sb_shape": "cube", >>>> "sb_thickness": 1.5e-08, >>>> "sb_thickness_unit": "m", >>>> "sb_volume": 3.375e-23, >>>> "sb_volume_unit": "m^3", >>>> "sf_concentration": 0.0068, >>>> "sf_concentration_unit": "M", >>>> "sf_nmolecules": 646, >>>> "sf_preassembly": "monolayer", >>>> "skin_distance": 3.0, >>>> "skin_distance_unit": "Ang", >>>> "solvent": "H2O", >>>> "state": "production", >>>> "step": "production_nemo_trial_with_dpd_tstat", >>>> "substrate": "AU", >>>> "surfactant": "SDS", >>>> "sv_density": 997, >>>> "sv_density_unit": "kg m^-3", >>>> "sv_preassembly": "random", >>>> "system_name": >>>> "646_SDS_monolayer_on_AU_111_150Ang_cube_with_AU_111_r_25Ang_indenter_at_-1e-06_Ang_per_fs_approach_velocity", >>>> "temperature": 298, >>>> "temperature_unit": "K", >>>> "thermo_average_frequency": 5000, >>>> "thermo_frequency": 5000, >>>> "type": "AFM", >>>> "use_barostat": 0, >>>> "use_dpd_tstat": 1, >>>> "use_eam": 1, >>>> "use_ewald": 1, >>>> "workflow_creation_date": "2019-07-13-22:53" >>>> }, >>>> "_files_prev": { >>>> "coeff_file": >>>> "/work/ws/nemo/fr_jh1130-fw_ws_20190311-0/launchpad/launcher_2019-07-13-22-53-59-844042/coeff_hybrid.input", >>>> "input_header": >>>> "/work/ws/nemo/fr_jh1130-fw_ws_20190311-0/launchpad/launcher_2019-07-13-22-53-59-844042/lmp_header.input", >>>> "input_production": >>>> "/work/ws/nemo/fr_jh1130-fw_ws_20190311-0/launchpad/launcher_2019-07-13-22-53-59-844042/lmp_production.input", >>>> "data_file": >>>> "/work/ws/nemo/fr_jh1130-fw_ws_20190311-0/launchpad/launcher_2019-07-13-22-54-00-115840/default.lammps", >>>> "ndx_file": >>>> "/work/ws/nemo/fr_jh1130-fw_ws_20190311-0/launchpad/launcher_2019-07-13-22-54-00-115840/groups.ndx" >>>> } >>>> }, >>>> "fw_id": 15514, >>>> "created_on": "2019-07-13T22:53:09.213733", >>>> "updated_on": "2019-07-24T12:01:26.321000", >>>> "launches": [ >>>> { >>>> "fworker": { >>>> "name": "nemo_queue_worker", >>>> "category": [ >>>> "nemo_queue_offline" >>>> ], >>>> "query": "{}", >>>> "env": { >>>> "lmp": "module purge; module use >>>> /work/ws/nemo/fr_lp1029-IMTEK_SIMULATION-0/modulefiles; module load >>>> lammps/16Mar18-gnu-7.3-openmpi-3.1-colvars-09Feb19; mpirun >>>> ${MPIRUN_OPTIONS} lmp", >>>> "exchange_substrate.py": "module purge; module use >>>> /work/ws/nemo/fr_lp1029-IMTEK_SIMULATION-0/modulefiles; module load mdtools >>>> ovitos; exchange_substrate.py", >>>> "extract_bb.py": "module purge; module use >>>> /work/ws/nemo/fr_lp1029-IMTEK_SIMULATION-0/modulefiles; module load >>>> mdtools/12Mar19-python-2.7; extract_bb.py", >>>> >>>> "extract_indenter_nonindenter_forces_from_netcdf.py": "module purge; module >>>> use /work/ws/nemo/fr_lp1029-IMTEK_SIMULATION-0/modulefiles; module load >>>> mdtools/11Jul19; extract_indenter_nonindenter_forces_from_netcdf.py", >>>> "extract_property.py": "module purge; module use >>>> /work/ws/nemo/fr_lp1029-IMTEK_SIMULATION-0/modulefiles; module load mdtools >>>> ovitos; extract_property.py", >>>> "extract_thermo.sh": "module purge; module use >>>> /work/ws/nemo/fr_lp1029-IMTEK_SIMULATION-0/modulefiles; module load >>>> mdtools; extract_thermo.sh", >>>> "join_thermo.py": "module purge; module use >>>> /work/ws/nemo/fr_lp1029-IMTEK_SIMULATION-0/modulefiles; module load >>>> mdtools; join_thermo.py", >>>> "merge.py": "module purge; module use >>>> /work/ws/nemo/fr_lp1029-IMTEK_SIMULATION-0/modulefiles; module load >>>> mdtools/12Mar19-python-2.7; merge.py", >>>> "ncfilter.py": "module purge; module use >>>> /work/ws/nemo/fr_lp1029-IMTEK_SIMULATION-0/modulefiles; module load >>>> mdtools/11Jul19; mpirun ${MPIRUN_OPTIONS} ncfilter.py", >>>> "ncjoin.py": "module purge; module use >>>> /work/ws/nemo/fr_lp1029-IMTEK_SIMULATION-0/modulefiles; module load >>>> mdtools; ncjoin.py", >>>> "pizza.py": "module purge; module use >>>> /work/ws/nemo/fr_lp1029-IMTEK_SIMULATION-0/modulefiles; module load >>>> mdtools/12Mar19-python-2.7; pizza.py", >>>> "strip_comments.py": "module purge; module use >>>> /work/ws/nemo/fr_lp1029-IMTEK_SIMULATION-0/modulefiles; module load >>>> mdtools/12Mar19-python-2.7; strip_comments.py", >>>> "to_hybrid.py": "module purge; module use >>>> /work/ws/nemo/fr_lp1029-IMTEK_SIMULATION-0/modulefiles; module load >>>> mdtools; to_hybrid.py", >>>> "vmd": "module purge; module use >>>> /work/ws/nemo/fr_lp1029-IMTEK_SIMULATION-0/modulefiles; module load >>>> vmd/1.9.3-text; vmd", >>>> "smbsync.py": "module purge; module use >>>> /work/ws/nemo/fr_lp1029-IMTEK_SIMULATION-0/modulefiles; module load >>>> mdtools; smbsync.py" >>>> } >>>> }, >>>> "fw_id": 15514, >>>> "launch_dir": >>>> "/work/ws/nemo/fr_jh1130-fw_ws_20190311-0/launchpad/block_2019-06-30-13-07-21-802466/launcher_2019-07-13-22-54-14-628683", >>>> "host": "login2.nemo.privat", >>>> "ip": "10.16.44.2", >>>> "trackers": [ >>>> { >>>> "filename": "log.lammps", >>>> "nlines": 25, >>>> "allow_zipped": false >>>> } >>>> ], >>>> "action": null, >>>> "state": "RUNNING", >>>> "state_history": [ >>>> { >>>> "state": "RESERVED", >>>> "created_on": "2019-07-13T22:54:14.596648", >>>> "updated_on": "2019-07-13T22:54:14.596655", >>>> "reservation_id": "6012657" >>>> }, >>>> { >>>> "state": "RUNNING", >>>> "created_on": "2019-07-13T22:54:49.124427", >>>> "updated_on": "2019-07-24T12:01:26.363237", >>>> "checkpoint": { >>>> "_task_n": 0, >>>> "_all_stored_data": {}, >>>> "_all_update_spec": {}, >>>> "_all_mod_spec": [] >>>> } >>>> } >>>> ], >>>> "launch_id": 11789 >>>> } >>>> ], >>>> "state": "RUNNING", >>>> "name": "NEMO, AU 111 r = 25 Ang indenter at -1e-06 Ang_per_fs >>>> approach velocity on 646 SDS monolayer on AU 111 150 Ang cube substrate, >>>> LAMMPS production" >>>> } >>>> >>>> >>>> >>>> >>>> Am Mittwoch, 5. Juni 2019 03:02:51 UTC+2 schrieb Anubhav Jain: >>>>> >>>>> Hi Johannes, >>>>> >>>>> Thanks for reporting these issues. We do not run offline mode >>>>> ourselves, so sometimes there are issues that we are unaware of. >>>>> >>>>> Regarding issue 1: >>>>> >>>>> For jobs that are stuck in the RUNNING state, the crucial thing that >>>>> needs to be correct in order for "detect_lostruns" to work properly is the >>>>> timestamp on the last ping of the launch. Could you try to check the >>>>> following (let me know if you need help with this process): >>>>> >>>>> 1. Identify a job that has this problem, and where you've already run >>>>> the recover_offline() command on it >>>>> 2. Go to the directory where that job ran >>>>> 3. There should be a file called FW_ping.json. Look inside and note >>>>> down the "ping_time" of that file >>>>> 4. There should also be a file called FW_offline.json. Look inside and >>>>> note down the "launch_id" in that file >>>>> 5. Next, we want to check the database for consistency. You want to >>>>> search your "launches" collection (either through MongoDB itself, or >>>>> through pymongo, or through the "launches" collection in the LaunchPad >>>>> object) for the launch id that you noted in #4. In that document for that >>>>> launch id, you should see a key called "state_history". In there should be >>>>> an entry where you see "updated_on". See screenshot for example ... >>>>> >>>>> [image: Screen Shot 2019-06-04 at 5.53.11 PM.png] >>>>> >>>>> 6) Now the two things for you to confirm: >>>>> >>>>> A: does the updated_on timestamp mach the FW_ping.json "ping_time" >>>>> that you noted earlier? If not, is the timestamp later or earlier? >>>>> B: is the type of the updated_on timestamp a String type (as opposed >>>>> to a datetime type)? >>>>> >>>>> Regarding issue 2: >>>>> >>>>> I think this is a separate issue. When you run "lpad detect_lostruns >>>>> --fizzle" the *database* knows that the job is FIZZLED, but the filesystem >>>>> information in FW_offline.json still thinks the job is running / completed >>>>> / etc. Thus when running recover_offline() again, the file system >>>>> information overrides the DB information and you end up forgetting that you >>>>> decided to fizzle the job. >>>>> >>>>> Unfortunately, this does mean that at the current stage you need to >>>>> manually "forget" about the information on the filesystem any time you want >>>>> to change the state of an offline Firework using one of the Launchpad >>>>> commands. I've added an issue about this on Github ( >>>>> https://github.com/materialsproject/fireworks/issues/326), but >>>>> unfortunately don't have a quick fix at the moment. >>>>> >>>>> On Friday, May 31, 2019 at 5:30:07 AM UTC-7, Johannes Hörmann wrote: >>>>>> >>>>>> Dear Fireworks Team, >>>>>> >>>>>> In the course of my PhD, I have been using Fireworks since about a >>>>>> year for managing work flows on different computing resources, most >>>>>> importantly on the supercomputers NEMO in Freiburg and the Jülich machine >>>>>> JUWELS. While NEMO is using the queueing system MOAB/Torque, JUWELS employs >>>>>> SLURM. On both machines, I submit jobs vial Firework's offline mode in >>>>>> order to be independent from a stable connection between computing nodes >>>>>> and MongoDB (which would have to be tunneled via the login nodes, not >>>>>> reliable). On the login nodes, usually have an infinite loop running the >>>>>> command >>>>>> >>>>>> lpad -l "${FW_CONFIG_PREFIX}/fireworks_mongodb_auth.yaml" >>>>>> recover_offline -w "${QLAUNCH_FWORKER_FILE}" >>>>>> >>>>>> every couple of minutes checking for job state updates. >>>>>> >>>>>> What I became aware of over the time is that on the JUWELS/SLURM >>>>>> machine, offline jobs fizzle properly, even when the are cancelled due to >>>>>> the walltime running out. I assume that SLURM sends a proper signal to >>>>>> rlaunch and allows some clean-up work to be done before forcefully killing. >>>>>> >>>>>> On the NEMO/MOAB machine, however, it seems the job is killed >>>>>> immediately if walltime expires, and its stays marked as "running" >>>>>> indefinitely. I have to manually use "lpad detect_lostruns" to fizzle the >>>>>> Firework and here I want to point out two issues: >>>>>> >>>>>> The first issue is that selecting the "dead" runs by the "--time" >>>>>> options of "lpad detect_lostruns" oftentimes does not work as expected. >>>>>> Even if the runs has been "dead" for days, it might happen that >>>>>> "detect_lostruns" does not recognize it as "lost" and I have to go down to >>>>>> a few seconds with the expiration time to have the lost run(s) show up. >>>>>> But then, of course, also other healthy runs appear in the list. Here I >>>>>> would like to ask whether this behavior might be related to the the >>>>>> "recover" loop running in background continuously, as described above? >>>>>> >>>>>> The second, related issue is that even if i mark a lost run on the >>>>>> NEMO/MOAB machine as "fizzled" by "lpad detect_lostruns --fizzle" (and >>>>>> maybe a suitable --query in order to narrow the selection), it will get >>>>>> marked as "running" again by the next call of "lpad recover_offline" as >>>>>> shown above. The only way I can avoid that behavior is stopping the >>>>>> automized recovery loop and executing the python command >>>>>> "lp.forget_offline(accordingFireWorksID,launch_mode=False)". Only then the >>>>>> next "recover_offline" will leave the run in question marked as "fizzled". >>>>>> >>>>>> I have observed these issues mostly for Fireworks 1.8.7, but a few >>>>>> days ago I updated to 1.9.1 and I believe they still persist. Would you >>>>>> have an idea about the source of those two (probably related?) issues? >>>>>> >>>>>> Best regars, >>>>>> >>>>>> Johannes Hörmann >>>>>> >>>>>>