Today my workflow "gen_pi0kp_FMR_phasespace_2019-11" suddenly start to massively fail, with 60% of its jobs already successfully finished a few days ago.
The .err files look fine, the .out files all showing that they stopped after MCSMEAR is completed, and started to set up new environment for reconstruction using hd_root.
Here is one example attached (/u/scifarm/farm_out/haoli/gluex_simulations/gen_pi0kp_FMR_phasespace_2019-11/log/72036_stdout.72036_9.out)
here is the last few lines:
JANA >>Merging event reader thread ...
JANA >> 10000 events processed (10000 events read) Average rate: 2.9Hz
JANA >>Closing shared object handle 0 ...
JANA >>Closing shared object handle 1 ...
JANA >>Closing shared object handle 2 ...
new env setup
EMULATING ANALYSIS LAUNCH
a
I don't remember which MCWrapper I used to submit the jobs on ifarm but it cannot be the most recent version, nor can I find a record in the log (the log only shows the MCWrapper version from the version set). But it is weird that all 6000 jobs were successful a few days ago and the last 40% start to fail for this reason.