Performance, Ncore!=Nroot

5 views
Skip to first unread message

David Collins

unread,
Sep 27, 2015, 7:57:42 AM9/27/15
to enzo...@googlegroups.com
Hi, Everybody--

I'm having some trouble running the performance timers on jobs where the number of tasks is different from the number of root grid tiles. I find that there's something about the call to Reduce_Times in the write_out step.  Has anyone else run into this?  

I see it when restarting an old molecular cloud simulation of mine, which has AMR and 512 root grid tasks, but stalls on 256 cores.  But, oddly, works fine on 128 cores.  Things work fine if I compile without performance counters, but they're what I want to be using for this particular project.

Thanks!
d.

--
-- Sent from a computer.

Nathan Goldbaum

unread,
Sep 27, 2015, 4:08:05 PM9/27/15
to enzo...@googlegroups.com
I've seen similar hangs in the past. See my response in this thread:


I never found the root cause and ended up working around the hangs by turning off the performance timers.
--
You received this message because you are subscribed to the Google Groups "enzo-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to enzo-dev+u...@googlegroups.com.
To post to this group, send email to enzo...@googlegroups.com.
Visit this group at http://groups.google.com/group/enzo-dev.
For more options, visit https://groups.google.com/d/optout.

Sam Skillman

unread,
Sep 27, 2015, 11:37:52 PM9/27/15
to enzo...@googlegroups.com
Hi Dave,

This is a symptom of one of the routines that are being timed not being seen by one of the processors. The last time I ran into this was that there were some grids that didn't call the root level gravity solver, but I think I fixed that at some point. I'd say search for TIMER_START and see if it is any routine that might be conditionally missed in your problem type by one of the mpi tasks. If you find one, you can put a TIMER_REGISTER(routine_name) somewhere to pre-register it as a timer.

There is a small chance that perhaps a proc isn't reaching the TIMER_START in EvolveLevel.C that times each level iteration. i thought i fixed that though. You could try a fix that puts a loop early on that calls TIMER_REGISTER on all the levels up to max refinement

maybe try this:
diff -r cb9712a1f763 src/enzo/ReadParameterFile.C
--- a/src/enzo/ReadParameterFile.C Tue Aug 11 14:11:49 2015 -0500
+++ b/src/enzo/ReadParameterFile.C Sun Sep 27 11:33:44 2015 -0700
@@ -29,6 +29,7 @@
#include <libconfig.h++>
#endif
 
+#include "EnzoTiming.h"
#include "macros_and_parameters.h"
#include "typedefs.h"
#include "global_data.h"
@@ -315,6 +316,16 @@
    ret += sscanf(line, "RefineBy               = %"ISYM, &RefineBy);
    ret += sscanf(line, "MaximumRefinementLevel = %"ISYM,
 &MaximumRefinementLevel);
+
+    // Register all of the levels on all of the processors.
+    for (int i = 0; i < MaximumGravityRefinementLevel; i++) {
+      char level_name[MAX_LINE_LENGTH];
+      sprintf(level_name, "Level_%02"ISYM, level);
+      TIMER_REGISTER(level_name):
+    }
+
+    sprintf(level_name, "Level_%02"ISYM, level);
+
    ret += sscanf(line, "MaximumGravityRefinementLevel = %"ISYM,
 &MaximumGravityRefinementLevel);
    ret += sscanf(line, "MaximumParticleRefinementLevel = %"ISYM,

Sam

Cameron Hummels

unread,
Sep 28, 2015, 3:13:25 PM9/28/15
to enzo...@googlegroups.com
As a short followup to this, I have also been encountering this problem in my work on stampede on scaling tests.  The issue doesn't appear to be NGRIDS!=NCORES; the issue occurs when you restart from an output that was generated with fewer processors than you are currently running.

Example:

You run a simulation with 16 processors to z=3;  If you try to restart that simulation with >16 processors, it will hang on the first cycle.  If you try to restart with <= 16 processors, no problem.  If you turn off enzo_performance, no problem.  The hang appears to be in the SolveHydroEquations performance timer on the processors who don't inherit one of the output root grids.  

Collins and I are trying to track it down from here.

Cameron
Cameron Hummels
NSF Postdoctoral Fellow
Department of Astronomy
California Institute of Technology

David Collins

unread,
Sep 28, 2015, 6:05:21 PM9/28/15
to enzo...@googlegroups.com
Another follow up--

I tracked down the issue to a timer that's not called on all cores.  In my case, I had 128 cores for the original run, and restarted on 512.  So SolveHydroEquations wasn't called on all cores, because some cores had no grids.  Adding TIMER_REGISTER("SolveHydroEquations") to the top of EvolveLevel worked out.  (Sam, I thought I tried this before with no success, but I must have done something wrong.)

So we have a stop-gap for now, but we should come up with a way to generalize this fix.

d.

Sam Skillman

unread,
Sep 28, 2015, 6:16:13 PM9/28/15
to enzo...@googlegroups.com
Can probably register all the timed methods (other than the level timers) right after the enzo_timer is instantiated in enzo.C -- no need to register it every time you get into EvolveLevel. 

Cameron Hummels

unread,
Sep 29, 2015, 12:32:01 AM9/29/15
to enzo...@googlegroups.com
Thanks for the tips, Sam and the testing, Collins.  I've now tested this solution and it works well.  I've updated the docs and code and made a PR:

David Collins

unread,
Sep 29, 2015, 11:53:28 AM9/29/15
to enzo...@googlegroups.com
Fantastic, thanks a ton!

d.
Reply all
Reply to author
Forward
0 new messages