I can run the 5-spot and CO2 in the shortcourse/examples on one node with no problems. When I increase the amount of nodes without altering the input file, the example runs to completion, but the output is interspersed with:
(...similar errors removed….)
HDF5-DIAG: Error detected in HDF5 (1.8.8) MPI-process 0:
#000: H5Dio.c line 228 in H5Dwrite(): not a dataset
major: Invalid arguments to routine
minor: Inappropriate type
HDF5-DIAG: Error detected in HDF5 (1.8.8) MPI-process 0:
#000: H5D.c line 391 in H5Dclose(): not a dataset
major: Invalid arguments to routine
minor: Inappropriate type
HDF5-DIAG: Error detected in HDF5 (1.8.8) MPI-process 0:
#000: H5D.c line 141 in H5Dcreate2(): not a location ID
major: Invalid arguments to routine
minor: Inappropriate type
#001: H5Gloc.c line 253 in H5G_loc(): invalid object ID
major: Invalid arguments to routine
minor: Bad value
HDF5-DIAG: Error detected in HDF5 (1.8.8) MPI-process 0:
#000: H5Dio.c line 228 in H5Dwrite(): not a dataset
major: Invalid arguments to routine
minor: Inappropriate type
HDF5-DIAG: Error detected in HDF5 (1.8.8) MPI-process 0:
#000: H5D.c line 391 in H5Dclose(): not a dataset
major: Invalid arguments to routine
minor: Inappropriate type
HDF5-DIAG: Error detected in HDF5 (1.8.8) MPI-process 0:
#000: H5G.c line 766 in H5Gclose(): not a group
major: Invalid arguments to routine
minor: Inappropriate type
HDF5-DIAG: Error detected in HDF5 (1.8.8) MPI-process 0:
#000: H5F.c line 1991 in H5Fclose(): decrementing file ID failed
major: Object atom
minor: Unable to close file
#001: H5I.c line 1450 in H5I_dec_app_ref(): can't decrement ID ref count
major: Object atom
minor: Unable to decrement reference count
#002: H5F.c line 1767 in H5F_close(): can't close file, there are objects still open
major: File accessability
minor: Unable to close file
HDF5: infinite loop closing library
D,G,S,T,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F
1 fnrm: 3.32E-12 xnrm: 1.59E+09 pnrm: 2.47E+02 inrmr: 1.80E-12 inrmu: 1.48E+02 rsn: 0
2 fnrm: 9.59E-14 xnrm: 1.59E+09 pnrm: 2.33E+01 inrmr: 6.76E-14 inrmu: 2.32E+01 rsn: 0
3 fnrm: 1.34E-16 xnrm: 1.59E+09 pnrm: 1.94E-01 inrmr: 7.55E-18 inrmu: 1.71E-01 rsn: stol
If I run 100_100_100 in pflotran-dev/example_problems on one node, i have no problems, but if I increase the nodes to two, I get this:
srun: First task exited 30s ago
srun: tasks 16-19,21-22,24-29,31: running
srun: tasks 0-15,20,23,30: exited
srun: Terminating job step 1002177.0
slurmd[venus2]: *** STEP 1002177.0 KILLED AT 2012-10-24T12:07:54 WITH SIGNAL 9 ***
srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
slurmd[venus2]: *** STEP 1002177.0 KILLED AT 2012-10-24T12:07:54 WITH SIGNAL 9 ***
I'm not sure what the difference is between the problems in shortcourse and that in the example_problems. I believe we alter the NXYZ line when we want to change the grid resolution, but how do we know what to change it to without running into errors for some grid sizes?