Zapdos on HPC

14 views
Skip to first unread message

Shane Keniley

unread,
Sep 28, 2020, 10:05:16 AM9/28/20
to zapdos-users
Hey all,

I'm getting very frustrating behavior trying to run Zapdos on INL's HPC system. (Falcon1, specifically.) On my own computer running with 8 cores the simulations run fine, but when I try the exact same simulation on INL's HPC I consistently get this error after about an hour of runtime: 

ERROR: LU factorization failed with info=1

This consistently happens whenever the nonlinear residual gets very large, something like 10^51. On my own computer this error never occurs no matter what happens with the residual - if it suddenly gets very large the solve just fails and cuts the timestep, as expected. On HPC I get this error thrown and the simulation crashes. I have the following petsc options (again, this is identical to the one on my PC which runs fine): 

  petsc_options_iname = '-pc_type -pc_factor_shift_type -pc_factor_shift_amount -snes_stol'
  petsc_options_value
= 'lu NONZERO 1.e-10 0'

(I know LU isn't scalable, but I'm only running with 8-12 cores.) 

I can get the simulation to avoid the issue by using -pc_factor_mat_solver_type superlu_dist, but this is unbearably slow. A simulation which takes ~24 hours on my computer with no mat solver type selected has been running for several days on HPC with superlu_dist, and it's still nowhere near being done. I've also tried swapping out LU for ASM as my preconditioner, but at the moment this isn't an option. It avoids the factorization error, but the simulation gets stuck with even smaller timesteps than using LU with superlu_dist.

I can't track down what causes the LU factorization error, and nobody seems to have had the issue on the moose user group. Any advice would be appreciated. 

(On a side note, I tried using -pc_factor_mat_solver_type mumps on HPC, but I get the error that mumps isn't installed. I thought mumps was the default?)


Alexander Lindsay

unread,
Sep 28, 2020, 12:14:25 PM9/28/20
to zapdos...@googlegroups.com
I started typing the message below and then realized we should already be catching the default divergence tolerance. (I'm guessing that your initial residual is less than 1e47). Then I looked into the code and realized we're only going to catch it if your PETSc version is at least 3.8.0. I'm guessing that your workstation has PETSc >= 3.8.0, but the PETSc version you're using on Falcon1 is < 3.8.0. Can you confirm? (Or deny :-) If it's less than 3.8.0 I'd recommend selecting a newer PETSc (and/or migrating to a newer INL HPC system like lemhi that allows you to do that)

Fande could probably help you track down the root of the factorization error. But if you want to circumvent it entirely, I would suggest playing around with the petsc option: -snes_divergence_tolerance <default 1e4>


I'm thinking that when your residual is up around 1e51, you're probably ready to say that the solve didn't converge :-) And yea I'm not surprised that that gets to be a dangerous area for the linear solver.

--
You received this message because you are subscribed to the Google Groups "zapdos-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to zapdos-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/zapdos-users/7de71c9e-6dd8-4e61-8f51-218c4760992fo%40googlegroups.com.

Shane Keniley

unread,
Sep 28, 2020, 12:45:29 PM9/28/20
to zapdos-users
(I'm guessing that your initial residual is less than 1e47).

Yeah, I'm using automatic scaling so the initial residual is usually quite close to 1. 

Then I looked into the code and realized we're only going to catch it if your PETSc version is at least 3.8.0. I'm guessing that your workstation has PETSc >= 3.8.0, but the PETSc version you're using on Falcon1 is < 3.8.0. Can you confirm? (Or deny :-) If it's less than 3.8.0 I'd recommend selecting a newer PETSc (and/or migrating to a newer INL HPC system like lemhi that allows you to do that)

The version of PETSc loaded when I enter "module load use.moose PETSc" is apparently PETSc/3.10.5-GCC. I tried explicitly calling a newer version of PETSc with "module load use.moose PETSc/3.11.4-GCC" in my job submission script but I got the same factorization error as before. 
To unsubscribe from this group and stop receiving emails from it, send an email to zapdos...@googlegroups.com.

Alexander Lindsay

unread,
Sep 28, 2020, 1:53:49 PM9/28/20
to zapdos...@googlegroups.com
That is very odd. Do you have any earlier timesteps what would be a candidate for dtol that doesn't appear to trigger dtol? I am curious about running this in the debugger, but don't want to spend an hour waiting for the point where this divergence should be triggered.

To unsubscribe from this group and stop receiving emails from it, send an email to zapdos-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/zapdos-users/d323301a-a9ce-4f90-98dc-01258da969e8o%40googlegroups.com.

Shane Keniley

unread,
Sep 28, 2020, 3:03:18 PM9/28/20
to zapdos-users
I don't really understand when dtol is supposed to be triggered - when the residual is greater than 1e4? If so, yeah, I regularly get residuals of that magnitude that don't seem to trigger any kind of error. In this particular solve I see  a few residuals on the order of 1e20 and 1e30, and one gets as high as 4e136. In all cases it actually tries to continue solving before failing to converge with DIVERGED_FNORM_NAN. I've attached the terminal output file here. 

air_water_001_out

Alexander Lindsay

unread,
Sep 28, 2020, 3:14:12 PM9/28/20
to zapdos...@googlegroups.com
Yes it's supposed to trigger any time: current_residual > initial_residual * divtol

To unsubscribe from this group and stop receiving emails from it, send an email to zapdos-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/zapdos-users/e8aa0c2b-973e-4233-9602-6ea49ada5e74o%40googlegroups.com.

Alexander Lindsay

unread,
Sep 28, 2020, 3:18:36 PM9/28/20
to zapdos...@googlegroups.com
Oh yea this should never happen:

 0 Nonlinear |R| = 1.710986e+01
      0 Linear |R| = 1.710986e+01
      1 Linear |R| = 1.473719e+01
      2 Linear |R| = 1.040520e-03
      3 Linear |R| = 1.381233e-07
  Linear solve converged due to CONVERGED_RTOL iterations 3
 1 Nonlinear |R| = 4.534498e+136
      0 Linear |R| = 4.534498e+136
      1 Linear |R| = 1.229078e+117
  Linear solve converged due to CONVERGED_RTOL iterations 1
Nonlinear solve did not converge due to DIVERGED_FNORM_NAN iterations 1
 Solve Did NOT Converge!
Aborting as solve did not converge

It should have quit after evaluation of the first nonlinear residual. Would it be possible for you to coarsen this problem to ~1000 dofs and send me the input file? Maybe that's not easy to do...

Shane Keniley

unread,
Sep 28, 2020, 3:30:13 PM9/28/20
to zapdos-users
I can certainly try. I'm pretty sure I've seen this behavior before in smaller problems, but there's no guarantee it'll ever happen if I just reduce this model. I'll give it a shot! 

I'll also try playing around with -snes_divergence_tolerance. I wonder why it's never triggered for me -- I've definitely seen this kind of behavior in the past! I remember thinking how useful it would be to set an upper limit on residual values because I so frequently get huge residuals that are clearly wrong but the problem just keeps wasting time and chugging away at it for a few more iterations before failing. Turns out there is supposed to be an upper limit. Go figure!

Alexander Lindsay

unread,
Sep 28, 2020, 3:31:44 PM9/28/20
to zapdos...@googlegroups.com
Ok, in MOOSE we disable it by default. To enable it, in your Executioner block, put:

[Executioner]
  nl_div_tol = <something>
[]

To unsubscribe from this group and stop receiving emails from it, send an email to zapdos-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/zapdos-users/8d457f2d-ee4b-44a0-b23d-749a2c379cb4o%40googlegroups.com.

Shane Keniley

unread,
Sep 28, 2020, 3:42:25 PM9/28/20
to zapdos-users
...if this infuriating problem that has plagued me for months is fixable with a single line in the input file, I think I might actually start crying. I'm giving it a shot now. 

Incidentally, this issue is also why I include the snes-stol PETSc option I posted above. It doesn't always happen, but sometimes when the residuals get that high the solve will converge due to STOL. I'm not sure what that means but whenever it happens the variables end up having huge discontinuities and gradients that quickly cause the problem to fail. This could kill two birds with one stone. 

Why is it turned off by default? I guess for some problems it might be beneficial to allow the residual to bounce around a bit? 

Alexander Lindsay

unread,
Sep 28, 2020, 3:53:34 PM9/28/20
to zapdos...@googlegroups.com
On Mon, Sep 28, 2020 at 12:42 PM Shane Keniley <smkfi...@gmail.com> wrote:
...if this infuriating problem that has plagued me for months is fixable with a single line in the input file, I think I might actually start crying. I'm giving it a shot now. 

This is actually a relatively new option only added in December 2019. You can see the PR here: https://github.com/idaholab/moose/pull/14154 It actually says why the check is off by default: tests in BISON fail when it's on. BISON does indeed often rely on the nonlinear residual increasing, somewhat substantially, 4-5 orders of magnitude, before eventually converging.

Yes, if you put the divergence tolerance relatively tight (I'm guessing 1e4 would probably do it), I suspect you won't see those stol problems anymore. But we have had enough users complain about this stol issue, that we are changing it as we speak! See https://github.com/idaholab/moose/pull/15842

Alex
To unsubscribe from this group and stop receiving emails from it, send an email to zapdos-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/zapdos-users/6ea3563f-0d23-47d8-beea-402aea523f0eo%40googlegroups.com.

Shane Keniley

unread,
Sep 28, 2020, 7:42:04 PM9/28/20
to zapdos-users
Yeah, that seems to have done the trick. Thank you very much! It's still very strange to me that it never threw an error on my local machine, but I guess that's just due to a difference in CPU architecture.
Reply all
Reply to author
Forward
0 new messages