Partially corrupted MC reconstructed trees causing issues in histogram made by DSelector

26 views
Skip to first unread message

Hao Li

unread,
Aug 16, 2024, 9:12:23 PM8/16/24
to GlueX Software Help Email List, Hao Li
Hi all,

I am reporting a subtle issue that affected my cross section results and may have a wider impact to other similar analyses. Here is a brief description of the issue:
  • Problem: In April, I ran MCWrapper to simulate the F18 low-energy ppbar signal MC, requesting 10 million events. The job was split into ~1100 sub-jobs, each producing a thrown tree and a reconstructed tree. After merging the trees, I used DSelector to analyze the merged file in multi-threaded mode, but found the produced histogram for the reconstructed MC has incorrect results with a portion of the total statistics missing. That caused the efficiency to be 10% lower than expected and hence an inaccurately higher acceptance-corrected cross-section. Upon investigation, I discovered that one thread in the DSelector encountered issues, leading to lost counts in the reconstructed MC's histogram.   
  • Reason:  The issue stemmed from a partially corrupted reconstructed MC tree produced in one of the ~1100 MCWrapper jobs, which was then merged into the larger tree as input of my DSelector.  This corrupted file had specific entry issues in certain branches (see the item "Proposed solution" below for more detail and an example), causing problems during analysis, but it was not flagged as corrupted during the merging process, as the file was accessible and looked similar to other files (IsZombie() is not able to identify it). The partially corrupted entries caused one of the DSelector threads to malfunction without triggering a crash, leading to incorrect results in the final histograms.
  • Error message: DSelector multi-threaded job finished and outputted histograms normally, but log file shows error message like (note that sometimes when the memory leak happens it will give the same error message but the job won't finish nor output histograms normally)

 +++ Message from master at qcdcomp-1-2.local : marking qcdcomp-1-2:-1 (0.2) as bad

 +++ Reason: undefined message in TProof::CollectInputFrom(...)

then I checked the log in the bad thread (0.2) and it shows hundreds lines of:

R__unzip: error -5 in inflate (zlib)

Error in <TBasket::ReadBasketBuffers>: fNbytes = 2706, fKeylen = 105, fObjlen = 25544, noutot = 0, nout=0, nin=2601, nbuf=25544

Error in <TBranch::GetBasket>: File: input.root at byte:1698948, branch:Beam__X4_Measured, entry:0, badread=1, nerrors=1, basketnumber=0

  • Potential damage: 
    • Since it will give you a non-empty histogram without crashing the job, it is possible to cause wrong physics results without getting noticed. 
    • If the single trees from each individual MCWrapper job are deleted after the merging, and one of the trees happens to be partially corrupted, then the corrupted entries remain in the merged tree (see example #3 below where the first corrupted entry show up late) and the entire simulation sample is polluted and cannot be recovered.
  • Proposed solution(sort-of): For this time I am gonna just delete those two files before merging the trees. For the long run, either in the stage of 'MCWrapper before finishing the job' or the stage of 'before merging trees using hadd', it needs a script to quickly check if a tree contains such partially corrupted branches, to prevent the merged tree being polluted. Right now I have a scratch script (
    /w/halld-scshelf2101/home/haoli/public/problematic_trees/check.C) to check (but it is slow if you need to screen thousands of files quickly).  It checks each branch recursively for the percentage of entries being corrupted. I have put them on ifarm and there are three example files provided in the same directory: 
    1. A good one (no corrupted entries): tree_antip__B4_mc_gen_051454_000.root
    2. A problematic tree (partially corrupted): tree_antip__B4_mc_gen_051454_014.root

      Branch Name                                  Total Entries       Corrupted(%)        First Corrupted  Last Corrupted
      ==============================================================================================================
      ThrownBeam__X4                               1854                19.53               0              361            
      Thrown__X4                                         1854                12.51               348            1623          
      Beam__X4_Measured                            1854                9.28                686            857            
      NeutralHypo__X4_Measured                     1854                21.74               1209           1611          
      ChargedHypo__X4_Measured                     1854                6.20                116            230            
      ChargedHypo__dEdx_CDC_integral               1854                85.71               0              1588          
      ChargedHypo__SumU_FCAL                       1854                85.71               0              1588          
      ChargedHypo__NumPhotons_DIRC                 1854                85.71               0              1588          
      Proton1__P4_KinFit                                        1854                9.22                1532           1702          
      Proton2__X4_KinFit                                        1854                9.01                345            511   

    3. The tree merged from the two above: merged.root

      Branch Name                                  Total Entries       Corrupted(%)        First Corrupted  Last Corrupted 
      ==============================================================================================================
      ThrownBeam__X4                               3831                9.45                1977           2338           
      Thrown__X4                                         3831                6.06                2325           3600           
      Beam__X4_Measured                            3831                4.49                2663           2834           
      NeutralHypo__X4_Measured                     3831                10.52               3186           3588           
      ChargedHypo__X4_Measured                     3831                3.00                2093           2207           
      ChargedHypo__dEdx_CDC_integral               3831                41.48               1977           3565           
      ChargedHypo__SumU_FCAL                       3831                41.48               1977           3565           
      ChargedHypo__NumPhotons_DIRC                 3831                41.48               1977           3565           
      NDF_KinFit                                                     3831                48.39               1977           3830           
      Proton1__P4_KinFit                                       3831                4.46                3509           3679           
      Proton2__ChiSq_Timing_KinFit                     3831                48.39               1977           3830           
      Proton2__X4_KinFit                                       3831                4.36                2322           2488  

This one is a sample of what a polluted tree looks like after merging a good tree with a corrupted tree. The corrupted entries only start to show up with entryId>1977. 
------------------------------------------------------------------------------------------------------------------------------

Cheers,
Hao

Shepherd, Matthew

unread,
May 21, 2025, 12:04:21 PMMay 21
to Hao Li, GlueX Software Help Email List

Is there a status update on this issue?

Matt


--
You received this message because you are subscribed to the Google Groups "GlueX Software Help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gluex-softwar...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gluex-software/CAJNZi_cvichPmq7Nh8hGhE_BRQTYyhv-uU6uGt4sTYX_yFJFag%40mail.gmail.com.

Peter Hurck

unread,
May 22, 2025, 3:19:30 AMMay 22
to Shepherd, Matthew, Hao Li, GlueX Software Help Email List
Hi,

This completely fell off my radar. I will make a note to revisit this issue and Hao’s proposed solution as soon as possible. We should be able to verify the files properly before the job is reported as success on the OSG.

Cheers,
Peter

Reply all
Reply to author
Forward
0 new messages