Floating Point Exception

Shane Keniley

unread,

Apr 6, 2020, 9:05:37 PM4/6/20

to zapdos-users

Hey all,

I'm getting strange behavior with my AD kernels that I hadn't noticed before: I'm getting MPI_Abort and floating point exception errors. I know these are generally caused by infinities/NANs, and indeed, when I run a problem input file through a debugger I get the following backtrace (only showing the last 10 frames):

#0  0x00007ffff75df249 in libMesh::TypeVector<double>::operator*<double> (this=0x7fffffff9aa0, factor=@0x7fffffff9ab8: inf)
    at /home/shane/projects/moose/scripts/../libmesh/installed/include/libmesh/type_vector.h:790
#1  0x00007ffff75def4e in libMesh::operator*<double, double> (factor=@0x7fffffff9ab8: inf, v=...)
    at /home/shane/projects/moose/scripts/../libmesh/installed/include/libmesh/type_vector.h:804
#2  0x00007ffff76dc6be in ADEFieldAdvection<(ComputeStage)0>::computeQpResidual (this=0x2dad410)
    at /home/shane/projects/zapdos/src/kernels/ADEFieldAdvection.C:44
#3  0x00007ffff557f195 in ADKernelTempl<double, (ComputeStage)0>::computeResidual (this=0x2dad410)
    at /home/shane/projects/moose/framework/src/kernels/ADKernel.C:126
#4  0x00007ffff533a775 in ComputeResidualThread::onElement (this=0x7fffffffa080, elem=0x1f67ef0)
    at /home/shane/projects/moose/framework/src/loops/ComputeResidualThread.C:120
#5  0x00007ffff53c824c in ThreadedElementLoopBase<libMesh::StoredRange<libMesh::MeshBase::const_element_iterator, libMesh::Elem const*> >::operator() (this=0x7fffffffa080, range=..., bypass_threading=false)
    at /home/shane/projects/moose/framework/build/header_symlinks/ThreadedElementLoopBase.h:209
#6  0x00007ffff4cfa120 in libMesh::Threads::parallel_reduce<libMesh::StoredRange<libMesh::MeshBase::const_element_iterator, libMesh::Elem const*>, ComputeResidualThread> (range=..., body=...)
    at /home/shane/projects/moose/scripts/../libmesh/installed/include/libmesh/threads_pthread.h:380
#7  0x00007ffff4cbf640 in NonlinearSystemBase::computeResidualInternal (this=0xf47010, tags=...)
    at /home/shane/projects/moose/framework/src/systems/NonlinearSystemBase.C:1396
#8  0x00007ffff4cbb0ce in NonlinearSystemBase::computeResidualTags (this=0xf47010, tags=...)
    at /home/shane/projects/moose/framework/src/systems/NonlinearSystemBase.C:692
#9  0x00007ffff57f804f in FEProblemBase::computeResidualTags (this=0x1f96010, tags=...)
    at /home/shane/projects/moose/framework/src/problems/FEProblemBase.C:5251

I see that in ADEFieldAdvection my _u[_qp] value is somewhere around ~1000, and since we take the exponential of this value, we get a staggeringly large number. I'm assuming that's the source of our negative infinity since I'm getting the FPE in the residual calculation. But why would this throw an error instead of just aborting the solve and cutting the timestep?

I only see this in AD kernels, and only occasionally - obviously the tests I submitted in my last PR ran without this issue, for example. I wouldn't have submitted that PR if I found this before! Since this problem exists in something that's currently on the main branch I'd like to figure this out and submit a fix ASAP. I'm assuming this is an issue with my AD implementation rather than a bug in MOOSE, but I don't know how to track down the core issue.

I've pushed the problematic input file to my drift_diffusion branch here. The input file is argon_csv_bc.i. This particular error occurred at Timestep 51. I tried both a finer and coarser mesh but this error occurred at the same timestep regardless of refinement. Interestingly, I also saw this yesterday on a test I submitted to Falcon1 with a completely different input file and mesh. That was odd because I ran that same test on both my ubuntu workstation and my Mac laptop without any issues.

Any help or advice would be appreciated!

-Shane

Alexander Lindsay

unread,

Apr 7, 2020, 1:24:42 PM4/7/20

to zapdos...@googlegroups.com

I can give this a more detailed look soon, but in the meantime, what is the actual error printed on the screen when the simulation aborts? Is it a message from MOOSE, PETSc, someone else?

--
You received this message because you are subscribed to the Google Groups "zapdos-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to zapdos-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/zapdos-users/17af7528-bfc5-4f31-80e4-53a311d47cc9%40googlegroups.com.

Shane Keniley

unread,

Apr 7, 2020, 2:23:17 PM4/7/20

to zapdos-users

When I run through zapdos-dbg (just the executable, not with a debugger), I get:

Floating point exception signaled(invalid floating point operation)!

To track this down, compile in debug mode, then in gdb do:
   break libmesh_handleFPE
   run ...
   bt

[1]/home/dev/projects/moose/scripts/../libmesh/src/base/libmesh.C, line 138, compiled Feb13 2020at 15:51:15
applicationcalled MPI_Abort(MPI_COMM_WORLD,1) - process 1

Interestingly, I don't get this through zapdos-opt anymore...I don't recall changing anything. This is weird. Still, this is the same error I get on Falcon (which definitely occurs on in opt mode, not just with dbg).

To unsubscribe from this group and stop receiving emails from it, send an email to zapdos...@googlegroups.com.

Alexander Lindsay

unread,

Apr 7, 2020, 5:23:22 PM4/7/20

to zapdos...@googlegroups.com

In debug, this should indeed abort. In opt, it should not.

To unsubscribe from this group and stop receiving emails from it, send an email to zapdos-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/zapdos-users/b14020af-49fc-4d3b-bb0c-81ac57278e2a%40googlegroups.com.

Shane Keniley

unread,

Apr 7, 2020, 10:57:29 PM4/7/20

to zapdos-users

Alright, that makes sense. Looks like I'm having two completely separate issues here and I got them confused, so I do still have an issue that should be addressed! I started debugging because my simulation on Falcon threw an error, so I evidently got my wires crossed and thought this MPI_Abort error was the same error I was seeing on Falcon. They're different! I am definitely still having my simulation terminate (in opt mode) when run on Falcon. The error I'm getting there is:

===================================================================================

= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES

= PID 1075 RUNNING AT r2i2n0.ib0.ice.inl.gov

= EXIT CODE: 1

= CLEANING UP REMAINING PROCESSES

= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES

===================================================================================

I'm guessing that's not a memory issue since it's not EXIT CODE: 9. I've attached the PBS script I'm using to this message in case I'm just doing something incorrectly there. (I basically copied the syntax from an example Mastodon PBS script from the moose website.) I tried debugging this (hence the stack trace above), but the MPI_Abort that I was concerned about above breaks the simulation before I can get to this error. If I try continuing past the MPI_Abort I get a stack dump and this message:

application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
[unset]: write_line error; fd=-1 buf=:cmd=abort exitcode=1
:
system msg for write_line failure : Bad file descriptor
[Inferior 1 (process 19413) exited with code 01]

To view this discussion on the web visit https://groups.google.com/d/msgid/zapdos-users/b14020af-49fc-4d3b-bb0c-81ac57278e2a%40googlegroups.com.

zapdos_test.sh

Alexander Lindsay

unread,

Apr 8, 2020, 12:07:11 AM4/8/20

to zapdos...@googlegroups.com

How many processes are you using? How many degrees of freedom? You say that you’ve only encountered this issue with AD...? Did/do you have an equivalent hand coded Jacobian kernel? If your Jacobian is accurate, conceptually you should be running into the exact same kind of floating point exceptions when running with AD and with the hand-coded equivalent... If we’re not seeing that, then there’s a bug somewhere...

On Apr 7, 2020, at 7:57 PM, Shane Keniley <keni...@illinois.edu> wrote:

To unsubscribe from this group and stop receiving emails from it, send an email to zapdos-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/zapdos-users/8898a375-4c49-46e4-8c6f-61e0b6edeb7c%40googlegroups.com.

<zapdos_test.sh>

Shane Keniley

unread,

Apr 10, 2020, 12:58:47 PM4/10/20

to zapdos-users

Sorry for the late response - I keep getting different errors, so I wanted to take some time to run these tests systematically and make sure I wasn't making much ado about nothing.

So here's the consensus. I have two different input files I've been working on, one with argon and one with dry air. The latter is obviously bigger. It has 36 species, ~600 reactions and 9165 total degrees of freedom and is 1D. (I realize that it's not quite large enough to require parallelization since MOOSE always recommends ~10k local DOFs. I've only been running it with 4-8 MPI processes.) Right now, my air discharge input file (air_plasma_dry_lu) fails at timestep 82 on Falcon1 with the error message:

ERROR: LU factorization failed with info=18

I have pc_factor_shift_type set to NONZERO, so this error surprised me. On my local machine the exact same input file runs perfectly fine so I don't think this is an issue with AD as I previously thought. Maybe it's just a parallelism issue?

(I know LU is not scalable, but since I'm running with so few MPI processes I assumed it would be fine. I have not found any other combination of preconditioners that works with this problem - asm runs with tons of linear iterations and gets stuck at ~5-10 ps timesteps indefinitely, and hypre/boomeramg is terrible for advection-dominated problems in my experience. LU converges beautifully. Haven't tried field splitting this yet.)

The argon input file is much smaller. It has only 4 species and 1094 total DOFs. It fails with the same error, but with different info number (not sure what that means):

ERROR: LU factorization failed with info=1

This input file also runs perfectly fine on my local workstation.

Running with hand-coded jacobians is a good idea. I'll do that and report back. I previously said that this only occurred with AD because I was getting errors on both HPC and my local workstation, and I hadn't seen those errors before locally. But I realized that was an input file error on my part. I haven't run these particular input files without AD yet.

To view this discussion on the web visit https://groups.google.com/d/msgid/zapdos-users/8898a375-4c49-46e4-8c6f-61e0b6edeb7c%40googlegroups.com.

<zapdos_test.sh>

Shane Keniley

unread,

Apr 10, 2020, 6:28:01 PM4/10/20

to zapdos-users

Update: I can confirm that this isn't an AD issue, so that's good at least. I ran the same argon input file with no AD and I still ran into the same error:

ERROR: LU factorization failed with info=1

The input file again runs perfectly fine on my own machine. This only occurs on Falcon. Not really sure what to do about this.

Alexander Lindsay

unread,

Apr 10, 2020, 8:18:31 PM4/10/20

to zapdos...@googlegroups.com

I hope this weekend I can do more than send emails from my phone, but something quick you could try is -pc_factor_mat_solver_type superlu_dist if you haven’t tried it yet. I believe the default parallel lu type is mumps which is relatively memory intensive (with the trade off of being faster). Without first digging deeper I dunno whether those errors you’re seeing might be memory errors...

On Apr 10, 2020, at 3:28 PM, Shane Keniley <keni...@illinois.edu> wrote:

To unsubscribe from this group and stop receiving emails from it, send an email to zapdos-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/zapdos-users/6090a309-1470-49ce-9d41-4144e14700c7%40googlegroups.com.

Alexander Lindsay

unread,

Apr 10, 2020, 8:19:26 PM4/10/20

to zapdos...@googlegroups.com

Also...do you know whether your Jacobian is accurate? Have you tried scaling, either manual or automatic? I ask because of your comment about how poor asm is

On Apr 10, 2020, at 3:28 PM, Shane Keniley <keni...@illinois.edu> wrote:

To unsubscribe from this group and stop receiving emails from it, send an email to zapdos-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/zapdos-users/6090a309-1470-49ce-9d41-4144e14700c7%40googlegroups.com.

Shane Keniley

unread,

Apr 11, 2020, 12:14:22 PM4/11/20

to zapdos-users

I am reasonably certain my jacobian was accurate, or at least as accurate as it can be. It's always worth a look though. The only two points of contention are (a) the AD transport properties, which are linear interpolations and I'm not 100% certain I applied the chain rule correctly (although it's clearly correct enough to run); and (b) the reaction network. I only converted EEDF-related reactions to AD. Again, the rest are clearly correct enough because I've been using it for well over a year now with no issues, but I can always convert those to AD just to make sure that they're perfect. The asm issue is something I'll investigate more today.

I hope this weekend I can do more than send emails from my phone, but something quick you could try is -pc_factor_mat_solver_type superlu_dist if you haven’t tried it yet. I believe the default parallel lu type is mumps which is relatively memory intensive (with the trade off of being faster). Without first digging deeper I dunno whether those errors you’re seeing might be memory errors...

Huh, this worked! The argon input file ran perfectly. I noticed that the tests always fail whenever the nonlinear residual gets unreasonably large:

Time Step 101, time = 1.41303e-06, dt = 5.04911e-08
 0 Nonlinear |R| = ^[[32m1.103421e+03^[[39m
      0 Linear |R| = ^[[32m1.103421e+03^[[39m
      1 Linear |R| = ^[[32m3.000056e-08^[[39m
 1 Nonlinear |R| = ^[[31m2.764981e+03^[[39m
      0 Linear |R| = ^[[32m2.764981e+03^[[39m
      1 Linear |R| = ^[[32m6.464790e-09^[[39m
 2 Nonlinear |R| = ^[[31m3.735332e+75^[[39m


===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES

= PID 14846 RUNNING AT r2i3n0.ib0.ice.inl.gov

=   EXIT CODE: 1
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================

On my own computer this just triggers the DIVERGED_FNORM_NAN error and cuts the timestep without issue, but maybe that overflow causes a memory issue somewhere.

To view this discussion on the web visit https://groups.google.com/d/msgid/zapdos-users/6090a309-1470-49ce-9d41-4144e14700c7%40googlegroups.com.

Alexander Lindsay

unread,

Apr 11, 2020, 1:29:46 PM4/11/20

to zapdos...@googlegroups.com

PJFNK or NEWTON? If the latter, then the accuracy of your Jacobian wouldn’t really affect the convergence of the linear solve. In both cases scaling will be important for the linear solve

On Apr 11, 2020, at 9:14 AM, Shane Keniley <keni...@illinois.edu> wrote:

To unsubscribe from this group and stop receiving emails from it, send an email to zapdos-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/zapdos-users/b7837afa-845e-42de-915c-3dff5c9b0357%40googlegroups.com.

Shane Keniley

unread,

Apr 11, 2020, 1:43:25 PM4/11/20

to zapdos-users

Newton. I've been using automatic scaling along with compute_scaling_once = false.

To view this discussion on the web visit https://groups.google.com/d/msgid/zapdos-users/b7837afa-845e-42de-915c-3dff5c9b0357%40googlegroups.com.

Shane Keniley

unread,

Apr 11, 2020, 3:01:10 PM4/11/20

to zapdos-users

Also, I recently ran a problem with the following conditions:

petsc_options_iname = '-pc_type -pc_asm_overlap -sub_pc_type -sub_pc_factor_shift_type -sub_pc_factor_shift_amount'
petsc_options_value = 'asm 4 ilu NONZERO 1e-10'

And it ran much better than previously, so I believe there were other issues at work when it got stuck at tiny timesteps. It still struggled a bit here, but it got very close to steady state here before the clock time I requested ran out.

How many linear iterations do people typically allow? I have it set to 20 by default, which doesn't seem to be enough because it always goes to 20 linear iterations and the relative tolerance only goes down by ~1-2 orders of magnitude. Near the end of the simulation it was struggling and took over 20 nonlinear iterations to converge every timestep. I can help that a bit by relaxing the absolute tolerance since I think it's just struggling to get the residual down near steady state, but I don't know if that's the only problem.

Alexander Lindsay

unread,

Apr 11, 2020, 3:10:37 PM4/11/20

to zapdos...@googlegroups.com

20 linear iterations is probably too few unless you’re running a really small problem. I’d try a maximum linear iteration count of 200 and set “-ksp_gmres_restart 200” and see how it does

On Apr 11, 2020, at 12:01 PM, Shane Keniley <keni...@illinois.edu> wrote:

To unsubscribe from this group and stop receiving emails from it, send an email to zapdos-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/zapdos-users/0dad7c8c-ee45-47ee-874e-39d3c633322e%40googlegroups.com.

Shane Keniley

unread,

Apr 18, 2020, 11:16:49 PM4/18/20

to zapdos-users

Okay, I apologize for the wild goose chase here. I don't know if all of these errors I've been getting are completely separate or if they all have the same root cause, but at the very least I now have an input file that repeatably crashes on the cluster, my desktop, and my mac with the error:

application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0


application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 2

This is in opt mode, not dbg, and it is only in parallel. It runs fine serially.

Running with the dbg executable returns nothing, which was odd. Usually that at least provides me with a little more information about the error, but nope; Just the same exact MPI_Abort message with no further information. I can't figure out how to run gdb in parallel so that was no use either. After some tedious backtracking and endless print statements I finally found a culprit: ADGasElectronMoments. This error is thrown when this line is reached:

_d_diffem_d_actual_mean_en[_qp] =                                            
      _diff_interpolation.sampleDerivative(std::exp(_mean_en[_qp] - _em[_qp])) * _time_units;

Sure enough, the value of std::exp(_mean_en[_qp] - _em[_qp]) is "-nan", which obviously cannot be used as an interpolation value. I'm curious as to why this doesn't just cause the solve to fail instead of crashing? Interestingly, when I change "_d_diffem_d_actual_mean_en" and the following three derivative terms to a SplineInterpolation the error isn't thrown. It is instead thrown a few lines later:

_diffem[_qp].value() =                                                       
      _diff_interpolation2->sample(std::exp(_mean_en[_qp].value() - _em[_qp].value())) *
      _time_units;

which is the next linear interpolation. (And it is necessarily a linear interpolation since it's an ADMaterialProperty.)

To investigate this I modified ADGasElectronMoments so there were no duplicate material names, and then I turned both ADGasElectronMoments and GasElectronMoments on at the same time and ran with the non-AD kernels and BCs. In this case the ADGasElectronMoments material wasn't actually doing anything, but it was still being calculated so the error was thrown. When the AD Material is off the error is never thrown, even if I use LinearInterpolations instead of splines in GasElectronMoments. To make things stranger, if I turn the AD material on and use LinearInterpolations for _diffem and _muem in the GasElectronMoments material (the hand-coded one, not the AD one) I don't get the error. Headscratcher.

Regardless, there's clearly something going on with the AD implementation and linear interpolations. I've uploaded this erroneous file to a branch: https://github.com/keniley1/zapdos/tree/ad_error The input file is argon_dbd_broken_manual.i. The ADGasElectronMoments material is only included in order to throw the error; all of the BCs and Kernels utilize hand-coded jacobians. I'm not sure if I'm doing something wrong that's causing this error or if this is just an issue with LinearInterpolations, so I don't know how to go about fixing this. Right now the only way I can think of to work around linear interpolations is to (a) use hand-coded jacobians only for the time being, or (b) use some kind of fitted functional form for transport coefficients and rate/townsend coefficients instead of interpolating tabulated values.

Alexander Lindsay

unread,

Apr 19, 2020, 4:03:25 PM4/19/20

to zapdos...@googlegroups.com

There is a parallel debugging section here: https://mooseframework.org/application_development/debugging.html

On Apr 18, 2020, at 8:16 PM, Shane Keniley <keni...@illinois.edu> wrote:

--

You received this message because you are subscribed to the Google Groups "zapdos-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to zapdos-users...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/zapdos-users/768c547f-828f-4b67-9d07-042d52a1cbe2%40googlegroups.com.

Shane Keniley

unread,

Apr 19, 2020, 4:09:37 PM4/19/20

to zapdos-users

Yeah, I tried following that. None of the steps in the "Actually parallel debugging" section worked with gdb for me. For example, if I ran with 4 processes I did get four windows to open, but the debugger ran on its own in the main window. When this error was caught the debugger terminated the program immediately. Setting breakpoints in each of the four windows didn't work.

I'm sure I'm doing something wrong there, but for the moment I already found the error. I'm just trying to figure out how to fix it now.

To unsubscribe from this group and stop receiving emails from it, send an email to zapdos...@googlegroups.com.

Alexander Lindsay

unread,

Apr 19, 2020, 4:20:43 PM4/19/20

to zapdos...@googlegroups.com

Are you seeing “ptrace: operation not permitted?” https://github.com/Microsoft/MIEngine/wiki/Troubleshoot-attaching-to-processes-using-GDB

I thought there were still some things that were unknown based on your head scratcher comments?

It would be really nice to know the full stack trace including the MPI_Abort, since as you say in opt mode a nan shouldn’t necessarily be a fatal error.

On Apr 19, 2020, at 1:09 PM, Shane Keniley <keni...@illinois.edu> wrote:

To unsubscribe from this group and stop receiving emails from it, send an email to zapdos-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/zapdos-users/cccd6ef9-fea4-491e-89e0-2778ad921c25%40googlegroups.com.

Shane Keniley

unread,

Apr 19, 2020, 5:31:22 PM4/19/20

to zapdos-users

Ha, yes, that was it! I didn't notice - the debugger always spews out a bunch of information right when it runs so I just missed that!

Okay, here's the stack trace. Timestep 147, run with mpiexec -n 4:

#0  0x00007fa873e19ca0 in PMPI_Abort ()
   from /opt/moose/mpich-3.3/gcc-9.2.0/lib/libmpi.so.12
#1  0x00007fa87a1a93f5 in libMesh::libmesh_terminate_handler ()
    at /home/shane/projects/moose/scripts/../libmesh/src/base/libmesh.C:321
#2  0x00007fa872bcccf6 in __cxxabiv1::__terminate(void (*)()) ()
    at ../../../../gcc-9.2.0/libstdc++-v3/libsupc++/eh_terminate.cc:47
#3  0x00007fa872bccd41 in std::terminate ()
    at ../../../../gcc-9.2.0/libstdc++-v3/libsupc++/eh_terminate.cc:57
#4  0x00007fa872bccf74 in __cxxabiv1::__cxa_throw (obj=<optimized out>, 
    tinfo=0x7fa872ef1308 <typeinfo for std::out_of_range>, 
    dest=0x7fa872be2150 <std::out_of_range::~out_of_range()>)
    at ../../../../gcc-9.2.0/libstdc++-v3/libsupc++/eh_throw.cc:95
#5  0x00007fa881119bc0 in LinearInterpolationTempl<double>::sampleDerivative (
    this=0x2f35298, x=@0x7ffff9eff298: -nan(0x8000000000000))
    at /home/shane/projects/moose/framework/src/utils/LinearInterpolation.C:82
#6  0x00007fa883e7ea14 in ADGasElectronMoments::computeQpProperties (
    this=0x2f34010)
    at /home/shane/projects/zapdos/src/materials/ADGasElectronMoments.C:106
#7  0x00007fa880d21951 in Material::computeProperties (this=0x2f34010)
    at /home/shane/projects/moose/framework/src/materials/Material.C:106
#8  0x00007fa880d23dc1 in MaterialData::reinit (this=0x2b70b50, 
    mats=std::__debug::vector of length 12, capacity 16 = {...})
    at /home/shane/projects/moose/framework/src/materials/MaterialData.C:73
---Type <return> to continue, or q <return> to quit---
#9  0x00007fa880a8ff04 in FEProblemBase::reinitMaterials (this=0x2514010, blk_id=1, tid=0, swap_stateful=true)
    at /home/shane/projects/moose/framework/src/problems/FEProblemBase.C:2985
#10 0x00007fa88060b7c0 in ComputeElemAuxVarsThread<AuxKernelTempl<double> >::onElement (this=0x7ffff9f01120, elem=0x2bf1760)
    at /home/shane/projects/moose/framework/src/loops/ComputeElemAuxVarsThread.C:109
#11 0x00007fa88068e1b8 in ThreadedElementLoopBase<libMesh::StoredRange<libMesh::MeshBase::const_element_iterator, libMesh::Elem const*> >::operator() (this=0x7ffff9f01120, range=..., bypass_threading=false)
    at /home/shane/projects/moose/framework/build/header_symlinks/ThreadedElementLoopBase.h:209
#12 0x00007fa87ffdddd0 in libMesh::Threads::parallel_reduce<libMesh::StoredRange<libMesh::MeshBase::const_element_iterator, libMesh::Elem const*>, ComputeElemAuxVarsThread<AuxKernelTempl<double> > > (range=..., body=...)
    at /home/shane/projects/moose/scripts/../libmesh/installed/include/libmesh/threads_pthread.h:380
#13 0x00007fa87ff9c423 in AuxiliarySystem::computeElementalVarsHelper<AuxKernelTempl<double> > (this=0x2b86010, warehouse=..., 
    vars=std::__debug::vector of length 1, capacity 1 = {...}, timer=269)
    at /home/shane/projects/moose/framework/src/systems/AuxiliarySystem.C:658
#14 0x00007fa87ff73ea6 in AuxiliarySystem::computeElementalVars (this=0x2b86010, type=...)
    at /home/shane/projects/moose/framework/src/systems/AuxiliarySystem.C:590
#15 0x00007fa87ff724d6 in AuxiliarySystem::compute (this=0x2b86010, type=...)
    at /home/shane/projects/moose/framework/src/systems/AuxiliarySystem.C:376
#16 0x00007fa880a9e10b in FEProblemBase::computeResidualTags (this=0x2514010, tags=std::__debug::set with 3 elements = {...})
    at /home/shane/projects/moose/framework/src/problems/FEProblemBase.C:5125
#17 0x00007fa880a9d7f3 in FEProblemBase::computeResidualInternal (this=0x2514010, soln=..., residual=..., 
    tags=std::__debug::set with 3 elements = {...}) at /home/shane/projects/moose/framework/src/problems/FEProblemBase.C:4988
#18 0x00007fa880a9d504 in FEProblemBase::computeResidual (this=0x2514010, soln=..., residual=...)
    at /home/shane/projects/moose/framework/src/problems/FEProblemBase.C:4943
#19 0x00007fa880a9d31d in FEProblemBase::computeResidualSys (this=0x2514010, soln=..., residual=...)
    at /home/shane/projects/moose/framework/src/problems/FEProblemBase.C:4918
#20 0x00007fa87ff7434e in ComputeResidualFunctor::residual (this=0x1f73c10, soln=..., residual=..., sys=...)
    at /home/shane/projects/moose/framework/src/systems/ComputeResidualFunctor.C:23

Alexander Lindsay

unread,

Apr 20, 2020, 10:58:46 AM4/20/20

to zapdos...@googlegroups.com

Ok below is the code for LinearInterpolation::sampleDerivative. The issue is that with the `-nan`, we're hitting the `std::out_of_range`. In MOOSE we catch and handle MooseException's and some libMesh exceptions, but we don't catch and handle exceptions from the standard library. Maybe we should be able to catch and handle some, hard to say.

template <typename T>
T
LinearInterpolationTempl<T>::sampleDerivative(const T & x) const
{
// endpoint cases
if (x < _x[0])
return 0.0;
if (x >= _x[_x.size() - 1])
return 0.0;

for (unsigned int i = 0; i + 1 < _x.size(); ++i)
if (x >= _x[i] && x < _x[i + 1])
return (_y[i + 1] - _y[i]) / (_x[i + 1] - _x[i]);

throw std::out_of_range("Unreachable");
return 0;
}

What *you* could do is you could wrap the possibly dangerous ADGasElectonicsMoments code in a try-catch block:

try

{

Computation of possibly danger linear interpolation code

}

catch (std::out_of_range &)

{

throw MooseException("We went out of range! Cut the timestep!");

}

As to why you hit this with an AD version and not a hand-coded version...it's possible that NaN comparisons with Reals and NaN comparisons with ADReals shake out differently, e.g. maybe -NaN < _x[0] evaluables to true for Reals but false for ADReals (it's hard for me to imagine but its possible). I would say we should definitely try to make them consistent, but per https://stackoverflow.com/a/38798823/4493669: "C/C++ does not require specific floating-point representation and does not require that any comparison against NaN is false." So different compilers could probably implement different behaviors. So in summary I think the best thing for you to do is to wrap those material property calculations in a try-catch block and catch std::out_of_range exceptions whey they happen.

One final option I though of: you could always use regular LinearInterpolations, and then when trying to interpolate in `computeQpProperties`, you use MetaPhysicL::raw_value in the argument to the interpolator. MetaPhysicL::raw_value converts a dual number to its strictly Real representation (e.g. dropping the derivatives which you don't use in the interpolator anyways):

_d_diffem_d_actual_mean_en[_qp] =
_diff_interpolation.sampleDerivative(MetaPhysicL::raw_value(std::exp(_mean_en[_qp] - _em[_qp]))) * _time_units;

--

You received this message because you are subscribed to the Google Groups "zapdos-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to zapdos-users...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/zapdos-users/e250dfee-0ea2-49d7-b377-0d53827e55f1%40googlegroups.com.

Reply all

Reply to author

Forward