Dear MOOSE users
Recently I'm dealing with cases coupled with Cahn-Hilliard/Allen-Cahn/thermal conducting. It works perfectly in 2D and 3D mesh with small size. Then I come to the mesh file over 400 MB, and I try the "Split/Distributed Mesh" with 100 processors (2400 MB memory per process). Unfortunately, I encountered some problems.
Here's the status I had:
Parallelism:
Num Processors: 100
Num Threads: 1
Mesh:
Parallel Type: distributed
Mesh Dimension: 3
Spatial Dimension: 3
Nodes:
Total: 2065551
Local: 22774
Elems:
Total: 2000000
Local: 20006
Num Subdomains: 1
Num Partitions: 100
Partitioner: parmetis
Nonlinear System:
Num DOFs: 18589959
Num Local DOFs: 204966
Variables: { "c" "w" "T" "gr0" "gr1" "gr2" "gr3" "gr4" "gr5" }
Finite Element Types: "LAGRANGE"
Approximation Orders: "FIRST"
Auxiliary System:
Num DOFs: 10065551
Num Local DOFs: 102798
Variables: "bnds" { "var_indices" "unique_grains" } { "M" "dM/dT" }
Finite Element Types: "LAGRANGE" "MONOMIAL" "MONOMIAL"
Approximation Orders: "FIRST" "CONSTANT" "CONSTANT"
Relationship Managers:
Geometric : GrainTrackerHaloRM (2 layers)
Execution Information:
Executioner: Transient
TimeStepper: IterationAdaptiveDT
Solver Mode: Preconditioned JFNK
Here's some problem I encountered:
1) When I use
petsc_options_iname = '-pc_type -ksp_gmres_restart -sub_ksp_type -sub_pc_type -pc_asm_overlap -pc_factor_mat_solver_package'
petsc_options_value = 'asm 1201 preonly ilu 4 superlu_dist'
then comes the following error or directly core dumped
1 Nonlinear |R| = [32m1.362830e+05 [39m
[1537473700.924691] [hpb0315:14918:0] knem_ep.c:84 UCX ERROR KNEM inline copy failed, err = -1 Invalid argument
[1537473700.996639] [hpb0315:14911:0] knem_ep.c:84 UCX ERROR KNEM inline copy failed, err = -1 Invalid argument
[1537473701.109700] [hpb0315:14927:0] knem_ep.c:84 UCX ERROR KNEM inline copy failed, err = -1 Invalid argument
[1537473719.355914] [hpb0313:16349:0] knem_ep.c:84 UCX ERROR KNEM inline copy failed, err = -1 Invalid argument
[1537473719.460046] [hpb0313:16368:0] knem_ep.c:84 UCX ERROR KNEM inline copy failed, err = -1 Invalid argument
[1537473719.786913] [hpb0313:16352:0] knem_ep.c:84 UCX ERROR KNEM inline copy failed, err = -1 Invalid argument
[1537473720.138977] [hpb0313:16369:0] knem_ep.c:84 UCX ERROR KNEM inline copy failed, err = -1 Invalid argument
[1537473720.255803] [hpb0313:16365:0] knem_ep.c:84 UCX ERROR KNEM inline copy failed, err = -1 Invalid argument
[1537473720.800708] [hpb0313:16369:0] knem_ep.c:84 UCX ERROR KNEM inline copy failed, err = -1 Invalid argument
[1537473721.175852] [hpb0313:16366:0] knem_ep.c:84 UCX ERROR KNEM inline copy failed, err = -1 Invalid argument
[1537473721.181278] [hpb0313:16366:0] knem_ep.c:84 UCX ERROR KNEM inline copy failed, err = -1 Invalid argument
[1537473722.047582] [hpb0313:16347:0] knem_ep.c:84 UCX ERROR KNEM inline copy failed, err = -1 Invalid argument
[1537473722.816282] [hpb0313:16367:0] knem_ep.c:84 UCX ERROR KNEM inline copy failed, err = -1 Invalid argument
[1537473723.035251] [hpb0313:16353:0] knem_ep.c:84 UCX ERROR KNEM inline copy failed, err = -1 Invalid argument
[1537473723.497716] [hpb0313:16353:0] knem_ep.c:84 UCX ERROR KNEM inline copy failed, err = -1 Invalid argument
[1537473736.753293] [hpb0313:16364:0] knem_ep.c:84 UCX ERROR KNEM inline copy failed, err = -1 Invalid argument
[1537473736.854541] [hpb0313:16364:0] knem_ep.c:84 UCX ERROR KNEM inline copy failed, err = -1 Invalid argument
[1537473737.164661] [hpb0313:16359:0] knem_ep.c:84 UCX ERROR KNEM inline copy failed, err = -1 Invalid argument
[1537473739.069100] [hpb0313:16355:0] knem_ep.c:84 UCX ERROR KNEM inline copy failed, err = -1 Invalid argument
0 Linear |R| = [32m1.362830e+05 [39m
when I use
petsc_options_iname = '-pc_type -ksp_gmres_restart -sub_pc_type -sub_ksp_type -pc_asm_overlap -pc_factor_mat_solver_package'
petsc_options_value = 'asm 1201 ksp preonly 4 mumps'
problem is the same.
2) When I use
petsc_options_iname = '-pc_type -pc_factor_mat_solver_package -ksp_gmres_restart'
petsc_options_value = ' ksp mumps 1201'
It runs and converges well at the beginning, then It slows down after few time steps and eventually stops without any output and log, no core files are generated.
I tried modifying the parameters and other preconditioning option, the problem is much the same. So I don't know where I did wrong or there is actually suitable PETSc option to deal with such problem with large mesh. I would like to hear your response.
Sincerely,
Yang