slow MPAS i/o

223 views
Skip to first unread message

Dawson, Nick

unread,
Aug 29, 2017, 10:07:34 AM8/29/17
to mpas-atmos...@googlegroups.com

I’m having an issue with very slow i/o. Each model integration takes approximately 3-5 seconds (a few are 100-200s). However, the timing for steam output is consistently > 4000-5000 seconds.

 

I’m running MPAS v5.1 on 224 cores with the x1.40962 quasi-uniform mesh. The model produces 6-hourly standard and diagnostic output for 90 forecast days.

 

Libraries (everything compiled with gcc):

PIO 1.9.23

MPICH 3.1.3

ZLIB 1.2.8

HDF5 1.8.14

Parallel-netCDF 1.7.0

netCDF-c 4.4.0

netCDF-fortran 4.4.3

 

Commands to compile MPAS:

make gfortran CORE=init_atmosphere PRECISION=single DEBUG=true

make clean CORE=atmosphere

make gfortran CORE=atmosphere PRECISION=single DEBUG=true

 

Debug mode is used to see log files for every core. I’m not sure if this is an issue with MPAS or the cluster. I’ve removed some of the output variables to reduce the history file sizes (reduction from ~350MB to ~180MB) that write every 6 forecast hours. Diagnostic files are 17 MB and also write output every 6 forecast hours. Restart files are written every 15 forecast days. Attached log files and additional info from a current (ongoing) MPAS run.

 

 

Nick Dawson, Ph.D.

ATMOSPHERIC SCIENTIST

Idaho Power | Power Supply

 

Work 208-388-5291

Mobile 208-860-0954

 

---------------------------------------------------------------------
Idaho Power Legal Disclaimer

This transmission may contain information that is privileged, confidential and/or exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution, or use of the information contained herein (including any reliance thereon) is STRICTLY PROHIBITED. If you received this transmission in error, please immediately contact the sender and destroy the material in its entirety, whether in electronic or hard copy format. Thank you.

---------------------------------------------------------------------

log.0000.err
namelist.atmosphere
streams.atmosphere
log.0000.out

Dominikus Heinzeller

unread,
Aug 29, 2017, 10:16:39 AM8/29/17
to Dawson, Nick, mpas-atmos...@googlegroups.com
Hi Nick.

The I/O performance you are reporting is much worse than what one would expect. Since you didn’t specify an I/O format, it will use pnetcdf by default.

Can you try using a different iotype (netcdf4 - should be slower, in principle, but given your timings one never knows) or play with the number of I/O tasks per node, i.e. modify

&io
    config_pio_num_iotasks = 0
    config_pio_stride = 1
/

in the namelist? For example one iotask per node, just for a test.

The debug mode shouldn’t have much influence on the I/O timing, but what happens if you compile the model with debug=.false.?

Cheers

Dom




--
You received this message because you are subscribed to the Google Groups "MPAS-Atmosphere Help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mpas-atmosphere-...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
<log.0000.err><namelist.atmosphere><streams.atmosphere><log.0000.out>

Dawson, Nick

unread,
Aug 29, 2017, 2:40:59 PM8/29/17
to Dominikus Heinzeller, mpas-atmos...@googlegroups.com

SUMMARY:

After additional testing, the problem appears to be with MPAS communication between nodes. More insights are in the ADDITIONAL INFO section below, but when running MPAS on one node (28 cores), the performance is very good. When scaled to more than one node, MPAS spends >90% of its time communicating between processes which slows I/O by 5 orders of magnitude!

 

I/O time for MPAS on 1 node (28 cores; pio_ntasks=1, stride=28): 0.5 seconds

I/O time for MPAS on 2 nodes (56 cores; pio_ntasks=2, stride=28): 4000-5000 seconds

I/O time for MPAS on 8 nodes (224 cores; pio_ntasks=8, stride=28):  4000-5000 seconds

 

I can run additional tests or provide more specific info (e.g. cluster specs) if you would like. These results surprised me as WRFv3.9.1 scales much better on the cluster (but uses intel compilers/math as opposed to gcc).

 

ADDITIONAL INFO:

I forgot that I edited the i/o code when originally compiling MPAS due to a floating point exception error that occurred when using pnetcdf. To get around this error, I edited /src/framework/mpas_io.F to force i/o to use netcdf. The suggestion is located in this thread: https://groups.google.com/forum/?fromgroups#!searchin/mpas-atmosphere-help/segmentation$20fault%7Csort:relevance/mpas-atmosphere-help/_QhRwmDjPok/JKjtOIQ6ZgkJ

 

Specifically, lines 260 and 263 of /src/framework/mpas_io.F (MPAS v5.1) were changed from: “pio_iotype = PIO_iotype_pnetcdf”  to “pio_iotype = PIO_iotype_netcdf”. After changing these lines, MPAS runs successfully without error. This means that MPAS was already using standard netcdf for i/o in my model runs.

 

I changed the config_pio_num_iotasks to the number of nodes (=8) and the config_pio_stride to the total number of cores per node (=28) following the advice here: https://groups.google.com/forum/?fromgroups#!category-topic/mpas-atmosphere-help/compilation/KhBbc8pPn1c

 

Turning debugging off and specifying the pio_num_tasks and config_pio_stride did not impact i/o times. The issue appears when running MPAS with >1 node.

 

Our cluster admins said that ~90% of the software activity is making system calls when >1 nodes are used. In the admin’s experience, this is a problem when standard WRF (not MPAS) uses too many cores and it spends all of its time communicating between processes.

 

 

Nick Dawson, Ph.D.

ATMOSPHERIC SCIENTIST

Idaho Power | Power Supply

 

Work 208-388-5291

Mobile 208-860-0954

 

Dominikus Heinzeller

unread,
Aug 31, 2017, 4:14:29 PM8/31/17
to Dawson, Nick, mpas-atmos...@googlegroups.com
Hi Nick.

Thanks for all this information, investigation is ongoing. Can you tell us about the system you are running MPAS on?

Also, there is no need to modify mpas_io.F, you can choose the io_type at runtime in streams.init_atmosphere and streams.atmosphere (see user’s guide).

Cheers,

Dom

Dawson, Nick

unread,
Aug 31, 2017, 4:59:02 PM8/31/17
to Dominikus Heinzeller, mpas-atmos...@googlegroups.com

Dom,

 

After our admins monitored the cluster log files while MPAS was running, they discovered a thread on one of the nodes that was essentially dead with intermittent communication which was greatly slowing down i/o (halting it at times). The issue was narrowed down to a network driver fault and the driver is being replaced. I will repeat the tests again after the driver has been replaced and reply with updated information.

 

I’m pretty sure that when using multiple nodes, my jobs kept running on the faulty node and that was causing i/o problems. Tests after the repairs should quickly confirm or deny this assumption.

 

I wasn’t sure if the problem was with MPAS software, cluster hardware/software, or something I messed up. Hopefully you haven’t had to invest too much time into the investigation of the problem since it appears the issue was with the cluster. Thank you for looking into this issue!

Dominikus Heinzeller

unread,
Sep 1, 2017, 6:08:48 AM9/1/17
to Dawson, Nick, mpas-atmos...@googlegroups.com
Nick,

thank you for that update! I am really interested to see if the driver update solves your problem.

I ran different versions of MPAS over the years on whatever many systems, but it was only yesterday that I encountered the same problem as you did. This was on a Cray XC40. Using only one pio task (i.e. set pio stride to the total number of MPI tasks across all nodes) worked as a (slow) solution, not nearly as slow as trying to use multiple I/O tasks on different nodes, but still not acceptable.

My solution on this system was to use MPAS v5.2 (or at least the modifications that went into it) with PIO2 instead of PIO1. This seems to work, despite some spurious error messages

/zhome/academic/HLRS/ifu2/ifudh/thirdparty/src/pio-2.x.y_20170901/ParallelIO/src/clib/bget.c 778 memory request exceeds block size 108724248 33554432

which I still need to figure out. Thus, I advise to use PIO2 with caution.

Cheers,

Dom

Dawson, Nick

unread,
Sep 1, 2017, 2:15:21 PM9/1/17
to Dominikus Heinzeller, mpas-atmos...@googlegroups.com

Dom,

 

First, some info on our system (cluster name is r2) for each compute node.

 

Motherboard: Dell PE R630

CPU: Dual Intel Xeon E5-2680 v4 14 core 2.4GHz

Memory: 192 GB

Ethernet: Quad Port GigE

Infiniband: Mellanox ConnectX-3 VPI FDR, QSFP+, 40/56Gbe

 

MPAS (and all libraries) compiled with GNU compilers. Library information is in my first email.

 

SUMMARY OF TESTS:

The best performance was with 2 nodes with the iotasks and stride namelist options set to 1 and 28, respectively. Tests with higher number of nodes performed worse. Generally, the 1/28 combo for pio performed the best for each set of node tests. In other words, setting iotasks=1 and stride=# cores per node worked best. Possible slow-down issues: PIO library, using GNU compiled software on Intel cores, using netcdf instead of pnetcdf.

 

ADDITIONAL INFORMATION:

The issue we had with the Infiniband driver is now fixed. This has improved performance, but pio settings in the namelist can change performance quite a bit. Attached are results from a few tests (all times in seconds) of a 7 day forecast. Integration times and total times are taken from the log.000.out files and i/o times (for the history stream) are estimated by looking at the log.000.err file. Note: highlighted tests in the attached excel file are estimated because I don’t have time to let them fully run today and won’t be back to the office until late next week.

 

Info about the model runs:

MPAS running in single precision mode, debug=off

120-km quasi-uniform mesh (converted to single precision)

Initialized with gdas data from 12z 2017/11/30 (f00)

Mesoscale reference physics

netcdf is used for i/o (pnetcdf causes issues with floating point errors).

Diagnostic and history output generated every 6 forecast hours

Jobs submitted via slurm with run command of “mpirun -n ## /path/atmosphere_model” where ## is the number of cores

Stream output lists also attached (subset of master lists)

stream_list.atmosphere.diagnostics
stream_list.atmosphere.output
mpas_tests_r2.xlsx

Dominikus Heinzeller

unread,
Sep 7, 2017, 3:44:15 PM9/7/17
to Dawson, Nick, mpas-atmos...@googlegroups.com
Hi again, Nick,

the solution to get around the memory error with PIO2 was to use the following flag when configuring PIOv2 with cmake:

-DPIO_USE_MALLOC=On

Cheers,

Dom
Reply all
Reply to author
Forward
0 new messages