generating bggen, crash in mcsmear

25 views
Skip to first unread message

Mark Ito

unread,
Dec 4, 2020, 12:21:18 PM12/4/20
to GlueX Software Help Email List

Folks,

I am getting an intermittent problem when running bggen events on the JLab farm. It doesn't happen every time but 40% of the jobs crash. The jobs crash in mcsmear. This is what shows up on standard error:

Run: [FATAL] Connection error
terminate called after throwing an instance of 'std::runtime_error'
  what():  hddm_s::istream::istream error - invalid hddm header
/group/halld/Software/builds/Linux_CentOS7.7-x86_64-gcc4.8.5/gluex_MCwrapper/gluex_MCwrapper-v2.5.1/MakeMC.sh: line 1481: 246733 Aborted                 (core dumped) mcsmear $MCSMEAR_Flags -PTHREAD_TIMEOUT_FIRST_EVENT=6400 -PTHREAD_TIMEOUT=6400 -o$STANDARD_NAME\_geant$GEANTVER\_smeared.hddm $STANDARD_NAME\_geant$GEANTVER.hddm ./run$formatted_runNumber\_random.hddm\:1\+$fold_skip_num

Standard output says:

RUNNING MCSMEAR
skipping: 43032
mcsmear  -PTHREAD_TIMEOUT_FIRST_EVENT=6400 -PTHREAD_TIMEOUT=6400 -obggen_bggen_030401_219\_geant4\_smeared.hddm bggen_bggen_030401_219\_geant4.hddm ./run030401\_random.hddm:1+43032
An hddm file was not created by mcsmear.  Terminating MC production.  Please consult logs to diagnose

Jobs are run via gluex_MC.py and are submitted to swif. The version set I am using is here. This version set has versions of halld_recon and halld_sim were recently updated from their respective master branches (earlier this week). There are some NPP tweaks in halld_recon, but that should not affect the issue I am having. The gluex_MC.py config file is attached.

This problem seems to ring a bell, but thought I would ask the group befor diving into it more deeply.

  -- Mark



MC_bggen.config

Sean Dobbs

unread,
Dec 4, 2020, 5:07:56 PM12/4/20
to Mark Ito, GlueX Software Help Email List, Thomas Britton
This looks like an error reading in some input file to mcsmear. Maybe
xrootd is throwing some error? Or are the random trigger files being
copied in from somewhere else?

---Sean
> --
> You received this message because you are subscribed to the Google Groups "GlueX Software Help" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to gluex-softwar...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/gluex-software/066947c7-86e6-2693-17b7-2c188e53ea75%40jlab.org.

tjbri...@gmail.com

unread,
Dec 4, 2020, 5:22:30 PM12/4/20
to GlueX Software Help
Oh yes! This might be it.  Dtn1902 (which hosts the files via xrootd) was crashing off and on all morning.  It worked when it was up and not when it was down.  I should see about why uconn wasn’t getting the fall-backs.  Also, maybe a way to disable xrootd….

Mark Ito

unread,
Dec 4, 2020, 6:25:09 PM12/4/20
to Thomas Britton, Sean Dobbs, GlueX Software Help Email List

So...try again?

On 12/4/20 5:20 PM, Thomas Britton wrote:

Oh yes! This might be it.  Dtn1902 (which hosts the files via xrootd) was crashing off and on all morning.  It worked when it was up and not when it was down.  I should see about why uconn wasn’t getting the fall-backs.  Also, maybe a way to disable xrootd….

 

Thomas Britton

Staff Scientist, Scientific Computing

Jefferson Lab

Richard Jones

unread,
Dec 4, 2020, 9:44:02 PM12/4/20
to Mark Ito, GlueX Software Help Email List
I just checked, and verified that the particular file you were asking for is online and available from our randoms storage area here at uconn. -Richard


On Fri, Dec 4, 2020 at 12:21 PM Mark Ito <ma...@jlab.org> wrote:
--
You received this message because you are subscribed to the Google Groups "GlueX Software Help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gluex-softwar...@googlegroups.com.

Mark Ito

unread,
Dec 5, 2020, 6:31:55 PM12/5/20
to gluex-s...@googlegroups.com
Tried again, similar result:

[marki@ifarm1901 log]$ swif status workflow
workflow_id                   = 135866
workflow_name                 = workflow
workflow_user                 = marki
jobs                          = 1000
succeeded                     = 764
problems                      = 236
problem_types                 = SWIF-USER-NON-ZERO,AUGER-FAILED
problem_auger_failed          = 2
problem_swif_user_non_zero    = 234
attempts                      = 1000
create_ts                     = 2020-12-04 18:54:32.0
update_ts                     = 2020-12-04 23:17:47.0
current_ts                    = 2020-12-05 18:18:39.0

So a 24% failure rate.

But I noticed that there are at least two flavors of standard error output. I only gave an example of one in my original post. Find both examples attached to this post.

30401_stderr.bggen_30401_9.err
30401_stderr.bggen_30401_998.err

tjbri...@gmail.com

unread,
Dec 9, 2020, 10:34:08 AM12/9/20
to GlueX Software Help
I tried to reproduce the error running locally on scosg16 with that version set and run number 30401.  I get passed mcsmear and see:
Error [1270]: in [MySQLDataProvider::GetAssignmentShort(int, const string&, time_t, const string&)] No data was selected. Table '/FCAL/energy_dependence_correction_vs_ring' for run='30401', timestampt='0' and variation='default'  

and it crashes in hd_root.  This is not skipping 43032 events in the random trigger file.  I will try to have it skip about that many and see what happens

tjbri...@gmail.com

unread,
Dec 9, 2020, 10:55:58 AM12/9/20
to GlueX Software Help

are you sure that error is reproducible and not that some of them crashed here and others crashed in hd_root? I am unable to reproduced an mcsmear crash but am repeatedly getting an hd_root crash....

Mark Ito

unread,
Dec 11, 2020, 1:24:16 PM12/11/20
to GlueX Software Help
Did a swif retry-jobs on the workflow. Some of the re-submitted jobs succeeded, at a rate similar to that of the original submission. Zeno's paradox...

[marki@ifarm1901 trees]$ swif status workflow
workflow_id                   = 135866
workflow_name                 = workflow
workflow_user                 = marki
jobs                          = 1000
succeeded                     = 916
problems                      = 84
problem_types                 = SWIF-USER-NON-ZERO
problem_swif_user_non_zero    = 84
attempts                      = 1236
create_ts                     = 2020-12-04 18:54:32.0
update_ts                     = 2020-12-10 23:46:36.0
current_ts                    = 2020-12-11 13:11:53.0

Thomas Britton

unread,
Dec 11, 2020, 1:33:40 PM12/11/20
to Mark Ito, GlueX Software Help
I have been unable to reproduce this issue (and unrepeatable results are the worst)

I don’t believe it is missing or corrupted random trigger files. And in this situation I think I have to blame networking because 1) they aren’t here to defend themselves and 2) network transients could manifest themselves this way.

But seriously....I am sadly at a loss. Have you tried the DEV branch?

Thomas Britton

On Dec 11, 2020, at 1:24 PM, Mark Ito <mark...@gmail.com> wrote:


You received this message because you are subscribed to a topic in the Google Groups "GlueX Software Help" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/gluex-software/kmb9M_BxxJY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to gluex-softwar...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gluex-software/8789d5dd-be18-464d-be60-5253bc9df523n%40googlegroups.com.

Mark Ito

unread,
Dec 13, 2020, 7:36:29 PM12/13/20
to GlueX Software Help
Just for closure...I did a swif retry-jobs twice more and got all 1000 jobs to succeed. Don't know why.
Reply all
Reply to author
Forward
0 new messages