some error when run example/GLBT0.08/expt_73.7

Skip to first unread message

xing cong

May 17, 2024, 6:59:25 AMMay 17
to Forum
Dear HYCOM Support Team,

I hope this message finds you well.

I have get this data successfully, I have compiled the src code with intel MPI and openMP and I have writed a slurm script to submit a job, but I meet a error when I run this progress with these data, can you help me?

the error is :
xcspmd: patch.input for wrong nreg

this is error img:
截屏2024-05-17 18.57.45.png

the patch.input file:
截屏2024-05-17 18.57.33.png

Alan Wallcraft

May 17, 2024, 11:09:59 AMMay 17
to Forum, xing cong
Support for global tripole (Arctic bi-pole) grids has to be enabled at compile time.  Set OCN_GLB in   Make.csh

# Global or regional
#setenv OCN_GLB -DARCTIC ## global tripolar simulation
setenv OCN_GLB ""

Then delete everything from the old make:

/bin/rm *.o *.mod hycom *.log

and remake.

If you are going to run both global and regional models then you need two executables and hence two src directories.


xing cong

May 17, 2024, 12:15:53 PMMay 17
to Forum, Alan Wallcraft, xing cong
Thank you  very much, the error is sloved~

But now I have another error, because my major is computer science, so I can not understand some terms in HYCOM. 🙏🙏

I am very need to run HYCOM in my HPC cluster, 😭

the error is :
tsofrq = 180
error in blkini - input tsofrq but should be mtracr

related code in the file blkdat.F90 : 
      if (mnproc.eq.1) then
      endif !1st tile
      call blkini(tsofrq,'tsofrq')
      call blkinr(tofset,'tofset','(a6," =",f10.4," degC/century")')
      call blkinr(sofset,'sofset','(a6," =",f10.4,"  psu/century")')

Alan Wallcraft

May 17, 2024, 4:29:30 PMMay 17
to Forum, xing cong, Alan Wallcraft
There are two new blkdat.input entries in the latest version of HYCOM (mtracr and lbmont) that are missing from HYCOM-examples:

   0      'trcflg' = tracer flags      (one digit per tr, most sig. replicated)
   0      'mtracr' = number of diagnostic tracers

   0      'lbflag' = lateral barotropic bndy flag (0=none, 1=port, 2=input)
   0      'lbmont' = baro nesting archives have sshflg=2 (0=F,1=T)

Add these lines and the case should run.


xing cong

May 19, 2024, 3:09:02 AMMay 19
to Forum, Alan Wallcraft, xing cong
Thanks, it is very help for the error, just add  ` 0      'mtracr' = number of diagnostic tracers`  line in blkdat.input , the error can be sloved. 

Now, I can run this example for about 40 min, but it get another error 😂, the error is :

31174470 (2019/198 21) region-wide mean Density Dev: 2.6279637576
error in zaiowr - can't write record 96 on array I/O unit 13.
mpierr = 268493344
Abort(9) on node 893 (rank 893 in comm 496): application called MPI_Abort(comm=0x84000002, 9) - process 893
mlx5: h07r2n13: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000011 00000000 00000000 00000000
00000000 00008914 0a014211 38a99dd2
[h07r2n13:21971:0:21971] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x168)

I am very grateful for your previous assistance and hope you can help me again ❤️❤️.

xing cong

May 20, 2024, 2:29:28 AMMay 20
to Forum, xing cong, Alan Wallcraft
Dear HYCOM Support Team, I hope this message finds you well. 

The problem is solved, it occurs due to insufficient memory to store data. (can't write record 96 on array I/O unit 13.)

Currently, I can run the example for about an hour.

When it finishes, I receive this log. Does this mean I have successfully run the code and simulated the data?

step 31174560 day 43298.00000 -- archiving completed --

timer statistics, processor 293 out of 1525

xcaget calls = 2324 time = 5.87015 time/call = 0.00252588
xcaput calls = 330 time = 0.22571 time/call = 0.00068396
xcsum calls = 3758 time = 197.19908 time/call = 0.05247448
xcmaxr calls = 7047 time = 320.04184 time/call = 0.04541533
xctilr calls = 83706 time = 1149.04671 time/call = 0.01372717
zaio** calls = 68 time = 7.11530 time/call = 0.10463669
zaiord calls = 330 time = 76.45293 time/call = 0.23167556
zaiowr calls = 2324 time = 444.27128 time/call = 0.19116664
xc**** calls = 1 time = 3342.55012 time/call = 3342.55011995
cnuity calls = 720 time = 256.02532 time/call = 0.35559072
tsadvc calls = 720 time = 268.26082 time/call = 0.37258448
momtum calls = 720 time = 402.06764 time/call = 0.55842727
barotp calls = 720 time = 593.22508 time/call = 0.82392372
thermf calls = 720 time = 28.88789 time/call = 0.04012208
ic**** calls = 720 time = 140.44647 time/call = 0.19506454
mx**** calls = 720 time = 383.37529 time/call = 0.53246569
conv** calls = 720 time = 0.00005 time/call = 0.00000006
diapf* calls = 720 time = 0.00005 time/call = 0.00000006
hybgen calls = 720 time = 167.46184 time/call = 0.23258588
overtn calls = 1 time = 1.65750 time/call = 1.65749601
archiv calls = 25 time = 451.80732 time/call = 18.07229282
incupd calls = 720 time = 0.00037 time/call = 0.00000052
aslsav calls = 720 time = 20.92804 time/call = 0.02906672
asseln calls = 720 time = 26.02908 time/call = 0.03615149
total calls = 1 time = 3031.60495 time/call = 3031.60495153

processor 1: memory (words) now,high = 28644078 28644078
processor 1: memory (GB) now,high = 0.213 0.213
processor 1: eq. 3-D arrays now,high = 75.479 75.479

processor 12: memory (words) now,high = 28644078 28644078
processor 12: memory (GB) now,high = 0.213 0.213
processor 12: eq. 3-D arrays now,high = 75.479 75.479

processor 1525: memory (words) now,high = 28147522 28147522
processor 1525: memory (GB) now,high = 0.210 0.210
processor 1525: eq. 3-D arrays now,high = 74.171 74.171


Alan Wallcraft

May 20, 2024, 11:59:58 AMMay 20
to Forum, xing cong, Alan Wallcraft
There is an accuracy criteria in README.GLBT0.08_737 but since you are not using this for a formal benchmark you have successfully run the code.

However, 50 minutes is 5x slower than the EXAMPLE (which is on a HP Cray EX system, i.e. a high end supercomputer).

narwhal08 203> ../total.csh
HY01525.log:   total    calls =        1   time =  623.0 10.38 wall mins

The I/O is 19x slower:

narwhal08 209> ../zaio.csh
HY01525.log:   zaiowr   calls =     2324   time =   23.3

This might be right.  One way to check is to try running on other core counts and confirm that the scaling is good (constant corehrs is perfect scaling).


xing cong

Jun 14, 2024, 2:38:11 PMJun 14
to Forum, Alan Wallcraft, xing cong
Dear Professor,

I hope this message finds you well. I would like to extend my sincere gratitude for your invaluable assistance in the past. Your guidance has been instrumental in my research endeavors.

I am currently facing an issue with running an OMPI-type program on a platform that exclusively uses the GNU compiler. Previously, I successfully executed similar programs on a platform equipped with the Intel compiler and OMPI. However, upon transitioning to the GNU compiler environment, I encountered the following error:

Error termination. Backtrace:
At line 2453 of file blkdat.F90 (unit = 2099, file = './blkdat.input')
Fortran runtime error: Bad integer for item 1 in list input

FC = mpifort
FCFFLAGS = -fbacktrace -march=native -O2 -fdefault-real-8 -ffloat-store -fopenmp -w -mcmodel=small
-fdefault-double-8 -fPIC -fno-second-underscore
CC = mpicc
CPP = cpp -P
LD = $(FC)

Thank you once again for your time and consideration. I look forward to your response.

Warm regards,

Alan Wallcraft

Jun 15, 2024, 12:15:28 PMJun 15
to Forum, xing cong, Alan Wallcraft
Please provide the last few lines of model output before the model dies.  These will have been printed out after reading them from blkdat.input, e.g.:

iversn =        22
iexpt  =       998
idm    =       500
jdm    =       382


incflg =         0
incstp =        12
incupf =         1
ra2fac =    0.1250
wbaro  =    0.1250
btrlfr =         T
btrmas =         F


relax  =         F
trcrlx =         F
priver =         F
epmass =         F

Also send the blkdat.input lines around the last few in the printout. The error is probably for the blkdat variable immediately after the last one in the printout.

Reply all
Reply to author
0 new messages