Hey all,
I'm getting very frustrating behavior trying to run Zapdos on INL's HPC system. (Falcon1, specifically.) On my own computer running with 8 cores the simulations run fine, but when I try the exact same simulation on INL's HPC I consistently get this error after about an hour of runtime:
ERROR: LU factorization failed with info=1
This consistently happens whenever the nonlinear residual gets very large, something like 10^51. On my own computer this error never occurs no matter what happens with the residual - if it suddenly gets very large the solve just fails and cuts the timestep, as expected. On HPC I get this error thrown and the simulation crashes. I have the following petsc options (again, this is identical to the one on my PC which runs fine):
petsc_options_iname = '-pc_type -pc_factor_shift_type -pc_factor_shift_amount -snes_stol'
petsc_options_value = 'lu NONZERO 1.e-10 0'
(I know LU isn't scalable, but I'm only running with 8-12 cores.)
I can get the simulation to avoid the issue by using -pc_factor_mat_solver_type superlu_dist, but this is unbearably slow. A simulation which takes ~24 hours on my computer with no mat solver type selected has been running for several days on HPC with superlu_dist, and it's still nowhere near being done. I've also tried swapping out LU for ASM as my preconditioner, but at the moment this isn't an option. It avoids the factorization error, but the simulation gets stuck with even smaller timesteps than using LU with superlu_dist.
I can't track down what causes the LU factorization error, and nobody seems to have had the issue on the moose user group. Any advice would be appreciated.
(On a side note, I tried using -pc_factor_mat_solver_type mumps on HPC, but I get the error that mumps isn't installed. I thought mumps was the default?)