I’m having an issue with very slow i/o. Each model integration takes approximately 3-5 seconds (a few are 100-200s). However, the timing for steam output is consistently > 4000-5000 seconds.
I’m running MPAS v5.1 on 224 cores with the x1.40962 quasi-uniform mesh. The model produces 6-hourly standard and diagnostic output for 90 forecast days.
Libraries (everything compiled with gcc):
PIO 1.9.23
MPICH 3.1.3
ZLIB 1.2.8
HDF5 1.8.14
Parallel-netCDF 1.7.0
netCDF-c 4.4.0
netCDF-fortran 4.4.3
Commands to compile MPAS:
make gfortran CORE=init_atmosphere PRECISION=single DEBUG=true
make clean CORE=atmosphere
make gfortran CORE=atmosphere PRECISION=single DEBUG=true
Debug mode is used to see log files for every core. I’m not sure if this is an issue with MPAS or the cluster. I’ve removed some of the output variables to reduce the history file sizes (reduction from ~350MB to ~180MB) that write every 6 forecast hours. Diagnostic files are 17 MB and also write output every 6 forecast hours. Restart files are written every 15 forecast days. Attached log files and additional info from a current (ongoing) MPAS run.
Nick Dawson, Ph.D.
ATMOSPHERIC SCIENTIST
Idaho Power | Power Supply
Work 208-388-5291
Mobile 208-860-0954
---------------------------------------------------------------------
This transmission may contain information that is privileged, confidential and/or exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution, or use of the information contained herein (including any reliance thereon) is STRICTLY PROHIBITED. If you received this transmission in error, please immediately contact the sender and destroy the material in its entirety, whether in electronic or hard copy format. Thank you.
-----------------------------------------------------------------------
You received this message because you are subscribed to the Google Groups "MPAS-Atmosphere Help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mpas-atmosphere-...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
<log.0000.err><namelist.atmosphere><streams.atmosphere><log.0000.out>
SUMMARY:
After additional testing, the problem appears to be with MPAS communication between nodes. More insights are in the ADDITIONAL INFO section below, but when running MPAS on one node (28 cores), the performance is very good. When scaled to more than one node, MPAS spends >90% of its time communicating between processes which slows I/O by 5 orders of magnitude!
I/O time for MPAS on 1 node (28 cores; pio_ntasks=1, stride=28): 0.5 seconds
I/O time for MPAS on 2 nodes (56 cores; pio_ntasks=2, stride=28): 4000-5000 seconds
I/O time for MPAS on 8 nodes (224 cores; pio_ntasks=8, stride=28): 4000-5000 seconds
I can run additional tests or provide more specific info (e.g. cluster specs) if you would like. These results surprised me as WRFv3.9.1 scales much better on the cluster (but uses intel compilers/math as opposed to gcc).
ADDITIONAL INFO:
I forgot that I edited the i/o code when originally compiling MPAS due to a floating point exception error that occurred when using pnetcdf. To get around this error, I edited /src/framework/mpas_io.F to force i/o to use netcdf. The suggestion is located in this thread: https://groups.google.com/forum/?fromgroups#!searchin/mpas-atmosphere-help/segmentation$20fault%7Csort:relevance/mpas-atmosphere-help/_QhRwmDjPok/JKjtOIQ6ZgkJ
Specifically, lines 260 and 263 of /src/framework/mpas_io.F (MPAS v5.1) were changed from: “pio_iotype = PIO_iotype_pnetcdf” to “pio_iotype = PIO_iotype_netcdf”. After changing these lines, MPAS runs successfully without error. This means that MPAS was already using standard netcdf for i/o in my model runs.
I changed the config_pio_num_iotasks to the number of nodes (=8) and the config_pio_stride to the total number of cores per node (=28) following the advice here: https://groups.google.com/forum/?fromgroups#!category-topic/mpas-atmosphere-help/compilation/KhBbc8pPn1c
Turning debugging off and specifying the pio_num_tasks and config_pio_stride did not impact i/o times. The issue appears when running MPAS with >1 node.
Our cluster admins said that ~90% of the software activity is making system calls when >1 nodes are used. In the admin’s experience, this is a problem when standard WRF (not MPAS) uses too many cores and it spends all of its time communicating between processes.
Nick Dawson, Ph.D.
ATMOSPHERIC SCIENTIST
Idaho Power | Power Supply
Work 208-388-5291
Mobile 208-860-0954
Dom,
After our admins monitored the cluster log files while MPAS was running, they discovered a thread on one of the nodes that was essentially dead with intermittent communication which was greatly slowing down i/o (halting it at times). The issue was narrowed down to a network driver fault and the driver is being replaced. I will repeat the tests again after the driver has been replaced and reply with updated information.
I’m pretty sure that when using multiple nodes, my jobs kept running on the faulty node and that was causing i/o problems. Tests after the repairs should quickly confirm or deny this assumption.
I wasn’t sure if the problem was with MPAS software, cluster hardware/software, or something I messed up. Hopefully you haven’t had to invest too much time into the investigation of the problem since it appears the issue was with the cluster. Thank you for looking into this issue!
Dom,
First, some info on our system (cluster name is r2) for each compute node.
Motherboard: Dell PE R630
CPU: Dual Intel Xeon E5-2680 v4 14 core 2.4GHz
Memory: 192 GB
Ethernet: Quad Port GigE
Infiniband: Mellanox ConnectX-3 VPI FDR, QSFP+, 40/56Gbe
MPAS (and all libraries) compiled with GNU compilers. Library information is in my first email.
SUMMARY OF TESTS:
The best performance was with 2 nodes with the iotasks and stride namelist options set to 1 and 28, respectively. Tests with higher number of nodes performed worse. Generally, the 1/28 combo for pio performed the best for each set of node tests. In other words, setting iotasks=1 and stride=# cores per node worked best. Possible slow-down issues: PIO library, using GNU compiled software on Intel cores, using netcdf instead of pnetcdf.
ADDITIONAL INFORMATION:
The issue we had with the Infiniband driver is now fixed. This has improved performance, but pio settings in the namelist can change performance quite a bit. Attached are results from a few tests (all times in seconds) of a 7 day forecast. Integration times and total times are taken from the log.000.out files and i/o times (for the history stream) are estimated by looking at the log.000.err file. Note: highlighted tests in the attached excel file are estimated because I don’t have time to let them fully run today and won’t be back to the office until late next week.
Info about the model runs:
MPAS running in single precision mode, debug=off
120-km quasi-uniform mesh (converted to single precision)
Initialized with gdas data from 12z 2017/11/30 (f00)
Mesoscale reference physics
netcdf is used for i/o (pnetcdf causes issues with floating point errors).
Diagnostic and history output generated every 6 forecast hours
Jobs submitted via slurm with run command of “mpirun -n ## /path/atmosphere_model” where ## is the number of cores
Stream output lists also attached (subset of master lists)