Problem with DSelector analysis with Prooflite

13 views
Skip to first unread message

Mariana Khachatryan

unread,
Oct 26, 2022, 12:50:45 PM10/26/22
to gluex-s...@googlegroups.com
Dear all,

I have a DSelector code that works fine when analyzing one or few files using Prooflite,
but when I try to analyze more files, after some progress prooflite displays purple color and
shows following message in the terminal 


Info in <TProofQueryResult::SetRunning>: nwrks: 8
Looking up for exact location of files: OK (1483 files) 
Info in <TPacketizer::TPacketizer>: Initial number of workers: 8
Validating files: OK (1483 files) 
0.2: caught exception triggered by signal '1' while processing dset:'TDSet:pi0etapr__B4_M35_Tree', file:'/lustre19/expphy/cache/halld/gluex_simulations/REQUESTED_MC/ExoticReview2022_PI1_EtaprimePi0_Spring2018_20221020113238pm/root/trees/tree_pi0etapr__B4_M35_genr8_040856_000.root' - check logs for possible stacktrace - last event: 0
0.6: caught exception triggered by signal '1' while processing dset:'TDSet:pi0etapr__B4_M35_Tree', file:'/lustre19/expphy/cache/halld/gluex_simulations/REQUESTED_MC/ExoticReview2022_PI1_EtaprimePi0_Spring2018_20221020113238pm/root/trees/tree_pi0etapr__B4_M35_genr8_040858_000.root' - check logs for possible stacktrace - last event: 0
0.4: caught exception triggered by signal '1' while processing dset:'TDSet:pi0etapr__B4_M35_Tree', file:'/lustre19/expphy/cache/halld/gluex_simulations/REQUESTED_MC/ExoticReview2022_PI1_EtaprimePi0_Spring2018_20221020113238pm/root/trees/tree_pi0etapr__B4_M35_genr8_040860_000.root' - check logs for possible stacktrace - last event: 0
0.7: caught exception triggered by signal '1' while processing dset:'TDSet:pi0etapr__B4_M35_Tree', file:'/lustre19/expphy/cache/halld/gluex_simulations/REQUESTED_MC/ExoticReview2022_PI1_EtaprimePi0_Spring2018_20221020113238pm/root/trees/tree_pi0etapr__B4_M35_genr8_040857_000.root' - check logs for possible stacktrace - last event: 0
0.5: caught exception triggered by signal '1' while processing dset:'TDSet:pi0etapr__B4_M35_Tree', file:'/lustre19/expphy/cache/halld/gluex_simulations/REQUESTED_MC/ExoticReview2022_PI1_EtaprimePi0_Spring2018_20221020113238pm/root/trees/tree_pi0etapr__B4_M35_genr8_040859_000.root' - check logs for possible stacktrace - last event: 0
0.1: caught exception triggered by signal '1' while processing dset:'TDSet:pi0etapr__B4_M35_Tree', file:'/lustre19/expphy/cache/halld/gluex_simulations/REQUESTED_MC/ExoticReview2022_PI1_EtaprimePi0_Spring2018_20221020113238pm/root/trees/tree_pi0etapr__B4_M35_genr8_040861_000.root' - check logs for possible stacktrace - last event: 0
0.3: caught exception triggered by signal '1' while processing dset:'TDSet:pi0etapr__B4_M35_Tree', file:'/lustre19/expphy/cache/halld/gluex_simulations/REQUESTED_MC/ExoticReview2022_PI1_EtaprimePi0_Spring2018_20221020113238pm/root/trees/tree_pi0etapr__B4_M35_genr8_040865_000.root' - check logs for possible stacktrace - last event: 0
Info in <TProofLite::MarkBad>: 
+++ Message from master at ifarm1901.jlab.org : marking ifarm1901.jlab.org:-1 (0.5) as bad
+++ Reason: undefined message in TProof::CollectInputFrom(...)



I have also attached the .log file of one of the workers.
The Dselector code is the following :
/w/halld-scshelf2101/Mariana/DSelector_analysis/gluex_root_analysis_massconst_Rupeshlatest/pi1estimate/DSelector_etaprpi0.C

and can be executed via Exe_DSelector_etaprpi0.csh in the same directory. The files being analyzed can be seen inside Run_DSelector_etaprpi0.C.
Inside DSelector code I analyze etaprrime pi0 events, and apply basic cuts and write a flat tree and some histograms in the output.


Please take a look and let me know what could be casing this.
Also please let me know if you need additional information.


With regards,
Mariana.


worker-0.0.log

Naomi Jarvis

unread,
Oct 26, 2022, 1:06:59 PM10/26/22
to Mariana Khachatryan, gluex-s...@googlegroups.com
Hi Mariana,

This type of messages are generated when the code crashes. 
Idk why each file would be ok individually but when you run them all together there's a problem.  We have a similar problem with os8 anyway so I process my files separately.   If you are running on os8 then you should look at my os8 issues in github. 

Naomi. 

--
You received this message because you are subscribed to the Google Groups "GlueX Software Help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gluex-softwar...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gluex-software/31209CF1-058C-4BC2-9701-15AB748496D6%40gmail.com.

--
You received this message because you are subscribed to the Google Groups "GlueX Software Help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gluex-softwar...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gluex-software/31209CF1-058C-4BC2-9701-15AB748496D6%40gmail.com.

Justin Stevens

unread,
Oct 26, 2022, 1:17:27 PM10/26/22
to Naomi Jarvis, Mariana Khachatryan, gluex-s...@googlegroups.com
I’ve found you can often get a more digestible stack trace to find the offending code by running without PROOF on a single ROOT file using just a single CPU.  You can do this using the following commands:

[ifarm] root.exe -l tree_<channel>.root
Attaching file tree_<channel>.root as _file0...
root [1] gROOT->ProcessLine(".x $(ROOT_ANALYSIS_HOME)/scripts/Load_DSelector.C")
root [2] <channel>_Tree->Process("DSelector_<channel>.C”)

If that succeeds then there may be some issue in PROOF which can be resolved by clearing your ~/.proof/ folder (at least that’s where the default use to be) that keeps a cache of compiled code, etc. that sometimes gets in a bad state.

-Justin 

Mariana Khachatryan

unread,
Oct 26, 2022, 2:33:27 PM10/26/22
to Justin Stevens, Naomi Jarvis, gluex-s...@googlegroups.com
Hi Justin,

analyzing single file as shown below works fine and the output files with tree and histograms look fine.

farm1901.jlab.org> root -l -b /lustre19/expphy/cache/halld/gluex_simulations/REQUESTED_MC/ExoticReview2022_PI1_EtaprimePi0_Spring2017_20221020113858pm/root/trees/tree_pi0etapr__B4_M35_genr8_030597_000.root 
root [0] 
Attaching file /lustre19/expphy/cache/halld/gluex_simulations/REQUESTED_MC/ExoticReview2022_PI1_EtaprimePi0_Spring2017_20221020113858pm/root/trees/tree_pi0etapr__B4_M35_genr8_030597_000.root as _file0...
(TFile *) 0x1fa1c10
root [1] .ls
TFile** /lustre19/expphy/cache/halld/gluex_simulations/REQUESTED_MC/ExoticReview2022_PI1_EtaprimePi0_Spring2017_20221020113858pm/root/trees/tree_pi0etapr__B4_M35_genr8_030597_000.root
 TFile* /lustre19/expphy/cache/halld/gluex_simulations/REQUESTED_MC/ExoticReview2022_PI1_EtaprimePi0_Spring2017_20221020113858pm/root/trees/tree_pi0etapr__B4_M35_genr8_030597_000.root
  KEY: TTree pi0etapr__B4_M35_Tree;1 pi0etapr__B4_M35_Tree
root [2] .x $ROOT_ANALYSIS_HOME/scripts/Load_DSelector.C
root [3] TTree* locTree=(TTree*)gDirectory->Get("pi0etapr__B4_M35_Tree")
(TTree *) 0x2b186b0
root [4] locTree->Process("DSelector_etaprpi0.C+")
INITIALIZE NEW TREE
DefaultFlatOff specified
DEFAULT FLAT TREE BRANCHES WILL NOT BE SAVED!
(this will reduce disk footprint of flat trees)
DecayingEtaPrime__X4 0
Tree reaction:
Photon, Proton -> Pi0, EtaPrime, Proton
Pi0 -> Photon, Photon
EtaPrime -> Pi-, Pi+, Eta
Eta -> Photon, Photon
(long long) 0



Also analyzing few tens of files with Prooflite also works fine.
But If I increase number of files even more than Prooflite makes some progress and than at some point the progrees bar becomes pink
and indicates the problem I have mentioned.

Emptying  ~/.proof folder doesn’t help.

Thank you,
Mariana.

Alexander Austregesilo

unread,
Oct 26, 2022, 3:41:02 PM10/26/22
to Mariana Khachatryan, Justin Stevens, Naomi Jarvis, gluex-s...@googlegroups.com

Dear Mariana,

Thank you for the detailed problem report. I could reproduce the crash with your script. However, the crash does not happen if your run over the different run periods separately. Only when you combine Spring 2017 with one of the 2018 run periods, the DSelector crashes. I could track this down to difference in the input trees. For example, the 2017 trees have the branch "NumUnusedShowers_Quality", while the 2018 trees have not. For that reason, the output tree is initialized differently, which is incompatible inside one process.

In the future, you should only combine trees that were create with the same or compatible software versions. Of course, you can just process the different run periods individually.

Cheers,

Alex

To view this discussion on the web visit https://groups.google.com/d/msgid/gluex-software/DF150445-B122-4C57-9833-831838A7A385%40gmail.com.
-- 
Alexander Austregesilo

Staff Scientist - Experimental Nuclear Physics
Thomas Jefferson National Accelerator Facility
Newport News, VA
aaus...@jlab.org
(757) 269-6982

Mariana Khachatryan

unread,
Oct 26, 2022, 3:47:50 PM10/26/22
to Alexander Austregesilo, Justin Stevens, Naomi Jarvis, gluex-s...@googlegroups.com
Hi Alex,

thank you for finding the issue.
I believe this is why some of the workers would be fine while others would show a problem.
When I use prooflite I add files to a chain and then call Process_chain function from prooflite.
I’m curious do you know how is the data distributed to be analyzed in parallel on different workers?

Alexander Austregesilo

unread,
Oct 26, 2022, 4:09:59 PM10/26/22
to Mariana Khachatryan, Justin Stevens, Naomi Jarvis, gluex-s...@googlegroups.com

I believe the data is just distributed sequentially as it was added to the TChain. As soon as the first worker is switched from 2017 to 2018, it reports a problem.

Reply all
Reply to author
Forward
0 new messages