ATTN: Alan Wallcraft

Fernandez, Alvaro

unread,

Oct 26, 2021, 9:36:50 AM10/26/21

to fo...@hycom.org

[AMD Official Use Only]

Dear Dr. Wallcraft,

I was investigating benchmarking HYCOM on our latest EPYC CPUs and refreshing our white paper.

We last benchmarked HYCOM in mid-2019. Is HYCOM still under active development?

Looking at the github repository , it seems that the latest version dates back to at least a year ago.

Is the HYCOM github page still the best place to find the latest version (HYCOM.org · GitHub)?

Sincerely,

Alvaro Fernández, Ph.D.

Senior Member of Technical Staff | AMD

HPC Applications Engineering

E: alvaro.f...@amd.com

+1.409.218.8799

facebook | amd.com

Alan Wallcraft

unread,

Oct 26, 2021, 11:09:02 AM10/26/21

to HYCOM.org Forum, Fernandez, Alvaro

HYCOM was added to GitHub in 2019, starting with version 2.3.00. The latest commit to the master branch was on May 19. In general this is a mature code base with relatively little new development.

Benchmark cases that were available in early 2019 most likely used version 2.2.98 or 2.2.99F (say). These used f77 fixed format (*.f or *.F) for the source code and so are difficult to compare directly to the current .F90 version although they are largely identical under the hood. So one option would be to use the same source code from 2019.

If you want to upgrade to a clone of the HYCOM-src master the only issue is that blkdat.input has a few new entries. I can update the blkdat.input files to allow this if you want to go this route.

We have config files for a Cray Shasta system (AMD EPYC 7H12 ) on GitHub, but note that the Intel Fortran versions work for intel/2021.1 and probably don't work for intel/2021.3.0.

Alan.

Fernandez, Alvaro

unread,

Oct 26, 2021, 12:42:51 PM10/26/21

to Alan Wallcraft, HYCOM.org Forum

[AMD Official Use Only]

Thanks Alan,

We've gone through two processor generations since the last paper.

I think it best to update as much as possible.

If I understand correctly, I could build the latest version but blkdat.input needs to be updated in the github repo, and there is a Cray Shasta system configuration for an EPYC 2nd generation (which sounds useful.) I can manage the differences in compilers.

Are the latest, most challenging data sets available at the github repo? We have 256 MB of L3D cache per proc, and we find that older testcases fit in cache completely - we want to stress the systems.

From: Alan Wallcraft <alan.wa...@hycom.org>
Sent: Tuesday, October 26, 2021, 10:10 AM
To: HYCOM.org Forum
Cc: Fernandez, Alvaro
Subject: Re: ATTN: Alan Wallcraft

[CAUTION: External Email]

Alan Wallcraft

unread,

Oct 27, 2021, 1:22:57 PM10/27/21

to HYCOM.org Forum, Fernandez, Alvaro, Alan Wallcraft

Dowmload:

hycom4day_GLBT0.04_987_CODE.tar.gz

hycom4day_GLBT0.04_987_DATA.tar.gz

This is our latest 0.04 degree global test case. It can in principle run on anywhere from ~100 to ~64K cores. For DoD benchmarking it has been configured for one model day but "emulating" 4 model days by reducing the time step. This is so the original atmospheric forcing files (covering 1 day) can still be used while producing a long enough run time. It would be OK to change the limits file to cover 0.25 days (replace 1.00000 with 0.25000) and hence "emulate" 1 model day to reduce the wall clock time (which is currently about 50 minutes on 4087 cores).

narwhal03 121> head limits
0.00000 1.00000

This is setup for HYCOM 2.3.01 and you can clone the HYCOM-src master, rather than using the tar bundle source code, without changing any data files if you want.

On 1018 cores it requires about 1GB of memory per MPI task and takes 5 wall hours to run 4 model days. I have not tried fewer cores than this, but in principle on 127 core it will need 8 GB per MPI task (and take at least 40 hours to run).

Alan.

Fernandez, Alvaro

unread,

Nov 1, 2021, 2:13:11 PM11/1/21

to Alan Wallcraft, HYCOM.org Forum

[AMD Official Use Only]

Hi Alan,

Thanks for sending me the data. And I finally got back to this.

I’m starting off by re-running GLBT0.08 prior to going with your new dataset, as both a sanity check and a way to see generational uplift.

I’m getting the error below when running with 123 MPI ranks, no OpenMP:

…

input: nreg = 3

timer statistics, processor 1 out of 123

-----------------------------------------------

xc**** calls = 1 time = 0.00000 time/call = 0.00000000

total calls = 1 time = 0.01427 time/call = 0.01427158

processor 1: memory (words) now,high = 0 0

processor 1: memory (GB) now,high = 0.000 0.000

processor 1: eq. 3-D arrays now,high = 0.000 0.000

**************************************************

xcspmd: patch.input for wrong nreg

**************************************************

mpi_finalize called on processor 1

mpi_finalize called on processor 49

I am running 123 ranks and the patch file appears to be the right one.

npes npe mpe idm jdm ibig jbig nreg minsea maxsea avesea

123 12 12 4500 3298 375 275 3 0 103125 72798

I don’t recall having to worry about the nreg parameter at all. Any ideas what this might be about?

Alvaro

AMD

M: 409.218.8799

Alan Wallcraft

unread,

Nov 2, 2021, 9:29:37 AM11/2/21

to HYCOM.org Forum, Fernandez, Alvaro, Alan Wallcraft

nreg=3 is correct. It is for global tripole domains, which have a different halo exchange along the j=jdm edge. This is configured at compile time and the way this is done has changed a bit is version 2.3.

If you cloned githib you need to change Make.csh to be similar to the one in CODE.tar.gz:

72c62
< setenv OCN_EOS -DEOS_9T ## EOS 9-term
---
> #setenv OCN_EOS -DEOS_9T ## EOS 9-term
74c64
< #setenv OCN_EOS -DEOS_17T ## EOS 17-term
---
> setenv OCN_EOS -DEOS_17T ## EOS 17-term
78,79c68,69
< setenv OCN_GLB -DARCTIC ## global tripolar simulation
< #setenv OCN_GLB ""
---
> #setenv OCN_GLB -DARCTIC ## global tripolar simulation
> setenv OCN_GLB ""

In particular:

setenv OCN_EOS -DEOS_9T ## EOS 9-term

# Optional CPP flags
# Global or regional
setenv OCN_GLB -DARCTIC ## global tripolar simulation

Alan.

Fernandez, Alvaro

unread,

Nov 8, 2021, 4:36:01 PM11/8/21

to Alan Wallcraft, HYCOM.org Forum

[AMD Official Use Only]

Hi Alan,

It’s possible the data tarball was corrupted when I downloaded it.

I’ve been successful in compiling HYCOM with the instructions below and the AMD compiler, but the GLBT04 tarball fails halfway during decompression.

Are there checksums you can share for these tarballs?

The link for the download has expired of course, so if it is corrupt, I may have to pester you for another link – sorry…

Alan Wallcraft

unread,

Nov 9, 2021, 8:49:51 AM11/9/21

to HYCOM.org Forum, Fernandez, Alvaro, Alan Wallcraft

When I ask for a link I get the same as before

https://drive.google.com/file/d/1vzKHureH-ehHsJnNcBN-JPNQpd-QZoiX/view?usp=sharing

COAPS 1892> md5sum hycom4day_GLBT0.04_987_CODE.tar.gz
2ed6960ab01d6ba0ed3865c2ec2dc93a hycom4day_GLBT0.04_987_CODE.tar.gz

COAPS 1893> md5sum hycom4day_GLBT0.04_987_DATA.tar.gz
a0ef48a96a9b2c49ce68e8b4b53d0873 hycom4day_GLBT0.04_987_DATA.tar.gz

COAPS 1894> ll hycom4d*
-rw-r-----. 1 awallcraft awallcraft 494579 Oct 26 13:38 hycom4day_GLBT0.04_987_CODE.tar.gz
-rw-r-----. 1 awallcraft awallcraft 10582682540 Oct 26 13:59 hycom4day_GLBT0.04_987_DATA.tar.gz

Fernandez, Alvaro

unread,

Nov 11, 2021, 1:04:27 PM11/11/21

to Alan Wallcraft, HYCOM.org Forum

[AMD Official Use Only]

Good morning Alan,

Single node runs in progress on two separate nodes, at 1060 and 1200 timesteps currently.

Can you confirm the nominal memory footprint for HYCOM running this workload, as well as the nominal I/O expected?

We appear to be using 1 TB of RAM in each, and writing ~3.54 GiB every iteration. See below for details.

Details

I have 1 TB of RAM and 128 cores (2x64c / per socket):

[alvaro@pluto31 ~]$ ps aux | head -1; ps aux | sort -rnk 4 | head -127

USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND

alvaro 73955 99.7 0.8 149599944 9047248 ? Rl Nov10 1616:58 ./hycom

alvaro 73954 99.6 0.8 149507356 9239548 ? Rl Nov10 1616:12 ./hycom

alvaro 73953 99.6 0.8 149167828 9051004 ? Rl Nov10 1616:13 ./hycom

…

As per ps above, each of the 127 ranks is using up 0.8% of that TB
The total footprint is thus ~ 1 TB. We have swap off on these boxes.

IO is directed to local NVME drives in this cluster and not to a parallel I/O filesystem.

Examining iostat difference between two iterations, the kB written seems the biggest.

Device tps kB_read/s kB_wrtn/s kB_read kB_wrtn

nvme0n1p1 4.01 344.17 3618.79 59091541 621320860

Device tps kB_read/s kB_wrtn/s kB_read kB_wrtn

nvme0n1p1 4.04 343.55 3633.82 59091661 625030952

Substracting the kB written from one iteration to the next:

625030952 kB wrtn

-621320860 kB wrtn

===========

3,710,092 ~ 3.54 GiB kB wrtn every time step

Alan Wallcraft

unread,

Nov 12, 2021, 8:42:47 AM11/12/21

to HYCOM.org Forum, Fernandez, Alvaro, Alan Wallcraft

An estimate of the explicitly allocated memory per MPI task is written into the .log file. The smallest number of cores I have run is 506:

processor     1: memory (GB)    now,high =               1.726           1.726
processor    13: memory (GB)    now,high =               1.726           1.726
processor   506: memory (GB)    now,high =               1.699           1.699
processor     1: memory (GB)    now,high =               1.764           1.764
processor    13: memory (GB)    now,high =               1.764           1.764
processor   506: memory (GB)    now,high =               1.737           1.737

The 1st 3 are just before the 1st time step, so you should have them, and the last 3 are at the end of the run.

The estimate will be a bit low, e.g. due to I/O buffers.

All the input files are read once. There is a "surface" archive written every model hour and a full archive written every 6 model hours. With 120 time steps per model hour.

I/O is not a significant overhead (on the order of 60 wall seconds total) when using a high performance parallel file system (and MPI2 I/O from the 1st MPI task in each row of the 2-d domain decomposition, which is the default). It might be significant to a serial filesystem and in that case it may be better to use serial I/O from the 1st MPI task only by setting the compile time macro SERIAL_IO:

setenv OCN_MISC "-DSERIAL_IO"

HYCOM will run very poorly if it pages to virtual memory, but I think it should fit in 1 TB of physical memory. On 128 core, a full run should take perhaps 16 wall hours. If you looks at the time stamps on the output *archs*.a files (one per model hour) they should be written about every 40 minutes, with the larger *archm*.a files (one every 6 model hours) written about every 4 wall hours.

The total I/O wall time is from the "zaio" timers:

gaffney06 490> grep zaio HY02046.log
zaiost - Array I/O is MPI-2 I/O from one task per row
zaiost 1st: memory (words) now,high =         3321026         3321026
zaiost 1st: memory (GB)    now,high =               0.025           0.025
zaiost 1st: eq. 3-D arrays now,high =               3.280           3.280
zaiost last: memory (words) now,high =         2718026         2718026
zaiost last: memory (GB)    now,high =               0.020           0.020
zaiost last: eq. 3-D arrays now,high =               2.684           2.684
zaio_hints:
   zaio**   calls =       18   time =    0.16205   time/call =    0.00900267
   zaiord   calls =       35   time =    3.86891   time/call =    0.11054016
zaio_hints:
   zaio**   calls =       56   time =    0.98672   time/call =    0.01761996
   zaiord   calls =       27   time =    3.39283   time/call =    0.12566023
   zaiowr   calls =     1612   time =   56.09647   time/call =    0.03479930

Alan.

Reply all

Reply to author

Forward