NaN errors in passive scalars

archive-bwl003 account

unread,

Oct 2, 2023, 3:52:43 AM10/2/23

to Nek5000

Hi Neks,

I'm attempting to run a case on a cluster. It went well initially, but I went on to change a parameter, and upon re-running the job, I suddenly got NaN errors in my passive scalars. Please see below for a part of my logfile.

I changed the single parameter back, ran makenek to recompile, but ended up getting the errors again. I thought something went wrong with compilation, so I removed the entire folder and recompiled and re-ran the case, but the errors persisted.

I have not seen this on my personal computer before. In fact, I copied the exact same folder that had errors to my personal computer, and there were no NaN errors upon running the case on my computer. That must mean it's not a physical problem, or a problem with my code.

I actually saw this problem a few weeks ago but was able to fix it by changing "lelt=lelg/lpmin + 3" to "lelt=lelg/lpmin + 10". This time it didn't work by increasing the constant to, say, 20.

Has anyone seen this problem before? Did the compiler fail somewhere, or did my source code get corrupted in between my jobs? I'd appreciate if someone can point me in the right direction to troubleshooting this.

Thanks,

Brandon

Step 6, t= 3.0000000E-03, DT= 5.0000000E-04, C= 0.509 1.0257E+00 1.8815E-01
Solving for Hmholtz scalars
6 Hmholtz TEMP 4 3.7846E-07 1.8465E-05 1.0000E-06
6 Hmholtz PS 1 20 9.8061E-07 2.0766E-02 1.0000E-06
6 Error Hmholtz PS 2 200 NaN NaN 1.0000E-06
6 Error Hmholtz PS 3 200 NaN NaN 1.0000E-06
6 Error Hmholtz PS 4 200 NaN NaN 1.0000E-06
6 Scalars done 3.0000E-03 8.0898E-02
Solving for fluid
6 Project PRES 2.9466E+02 1.3712E+02 4.6535E-01 1 8
6 PRES gmres 200 4.1340E-01 1.0981E+00 1.0000E-04 2.6536E-02 7.6057E-02 F
6 Helmh3 fluid 35 6.4234E-07 1.4000E+00 1.0000E-06
L1/L2 DIV(V) -3.9036E-06 7.9104E-03
L1/L2 QTL -5.8284E-09 7.4198E-03
L1/L2 DIV(V)-QTL -3.8978E-06 2.6779E-03
6 Fluid done 3.0000E-03 9.1479E-02
Dan: 1.7000E+01

YuHsiang Lan

unread,

Oct 4, 2023, 12:23:37 AM10/4/23

to Nek5000

Hi Brandon,

I doubt that changing lelt between "lelt=lelg/lpmin + 3" and "lelt=lelg/lpmin + 10" will make such difference.

Here are some thoughts that could help digging the issue.

You mentioned you copy the folder from one machine to another. Have you recompiled the code after that?

In fact, after you changed the lelt, it re-compiled the whole case. That's the reason I think your trick works once.

If you are moving to a new machine, have you conformed Nek5000 running fine? (an example under short tests would be nice).

Also, double-check the correct mpi compiler and put it in makenek. It will changes the compiler flags for different environment by default.

The compiler can treat the number differently. For example, on recent machine, it will initialize the variable to be zero once it's declared and if you happens to divide such variable, you get NaNs.

Second, you should try to figure out where is location producing the first NaN.

What happens for the first five timesteps? Does NaN show up at the six-th one? Do you have a fully converged results for the first few timesteps (check iteration number for PS2 - PS4). Can you do comparison between the working logfile and the failing logfile?

You can also put flags FFLAGS="-ffpe-trap=invalid" into makenek. "-Wall" and "-g" helps too.

ref: https://stackoverflow.com/questions/5636580/force-gfortran-to-stop-program-at-first-nan?rq=4

Maybe try to use debugger like gdb as well.

Hope this helps,

Yu-Hsiang

--

archive-bwl003 account

unread,

Oct 4, 2023, 2:56:53 AM10/4/23

to Nek5000

Hi Yu-Hsiang,

Thank you for the prompt response. I thought I responded to this message chain but apparently it didn't post for everyone. Here was my response:

I figured out a temporary solution. I had to set lpmin to 16, and it has been working for a few cases so far. I'm not sure why this is the case, but I did generate my restart file from a 16-core run. Setting lpmin to a value above 16 doesn't seem to work.

In response to your message:

I did re-compile the code when moving the folder from one machine to another (the second machine has 16 cores, compared to the first which used 256). It worked on the second machine with 16 cores and lpmin=16. As mentioned above, this might be related to how my restart file was generated with a 16-core run.

I confirmed a couple of the short tests are working fine.

I did check the mpi compiler and I think it works.

During the very first timestep, I already get NaN. Therefore I suspected it might be related to the gfldr function or my restart file.

In summary, I created a restart file on a 16 core machine (with lpmin=16). Now, I'm trying to read it with gfldr on a 256 core machine (with lpmin=256). Could this be an issue? Setting lpmin=16 for the 256 core machine seemed to fix the problem.