forrtl: severe (67): input statement requires too much data

1,010 views
Skip to first unread message

Andreas Schreiber

unread,
Feb 20, 2017, 10:27:44 AM2/20/17
to 'Andreas Schreiber' via pencil-code-discuss
Dear all,

I have a run with

###########################################
integer, parameter :: ncpus=1600,nprocx=4,nprocy=20,nprocz=20
integer, parameter :: nxgrid=400,nygrid=400,nzgrid=400
integer, parameter :: npar=nxgrid*nygrid*nzgrid*10, mpar_loc=npar/160, npar_mig=204800
integer, parameter :: npar_stalk=1.0e4
###########################################

and run it on a computer where I am the only user! 

But after the first run (max_walltime=16000 ! hard walltime = 4,5 hr ) it does not continue, with error message
###########################################
Subcommand: run
forrtl: severe (67): input statement requires too much data, unit 88, file /isaac/u/aschreib/pc_projects/scaling_tests/a_3d_max_scaling/1600_42020/data/proc1077/var.dat
Image              PC                Routine            Line        Source             
run.x              0000000000411FFC  Unknown               Unknown  Unknown
run.x              00000000004460DE  Unknown               Unknown  Unknown
run.x              00000000004437FB  Unknown               Unknown  Unknown
run.x              000000000073BE5D  Unknown               Unknown  Unknown
run.x              000000000073A3B9  Unknown               Unknown  Unknown
run.x              00000000005092F5  Unknown               Unknown  Unknown
run.x              00000000004039DE  Unknown               Unknown  Unknown
libc-2.22.so       00002B7801FED6E5  __libc_start_main     Unknown  Unknown
run.x              00000000004038E9  Unknown               Unknown  Unknown
forrtl: severe (67): input statement requires too much data, unit 88, file /isaac/u/aschreib/pc_projects/scaling_tests/a_3d_max_scaling/1600_42020/data/proc983/var.dat
Image              PC                Routine            Line        Source             
run.x              0000000000411FFC  Unknown               Unknown  Unknown
run.x              00000000004460DE  Unknown               Unknown  Unknown
run.x              00000000004437FB  Unknown               Unknown  Unknown
run.x              000000000073BE5D  Unknown               Unknown  Unknown
run.x              000000000073A3B9  Unknown               Unknown  Unknown
run.x              00000000005092F5  Unknown               Unknown  Unknown
run.x              00000000004039DE  Unknown               Unknown  Unknown
libc-2.22.so       00002B3F6BAFB6E5  __libc_start_main     Unknown  Unknown
run.x              00000000004038E9  Unknown               Unknown  Unknown
forrtl: severe (67): input statement requires too much data, unit 88, file /isaac/u/aschreib/pc_projects/scaling_tests/a_3d_max_scaling/1600_42020/data/proc505/var.dat
###########################################

Furthermore, timestamp gets inconsistent:
###########################################
...
SVN: -------            v.         (                   ) $Id$
 The verbose level is ip=          14  (ldebug= F )
 This is a 3-D run
 nxgrid, nygrid, nzgrid=         400         400         400
 Lx, Ly, Lz=  2.000000000000000E-002  2.000000000000000E-002
  2.000000000000000E-002
       Vbox=  8.000000000000001E-006
 Timestamps in snapshot INCONSISTENT. Using (max) t=  8.307890670577762E-002 
 with ireset_tstart=           2 .
 Timestamps in snapshot INCONSISTENT. Using t=  8.307890670577762E-002 .
###########################################

The first run ends with
###########################################
   7500     0.080 1.065E-05  1.542E+04  1.279E-08  5.282E-08  1.209E-06  2.246E-05  7.209E-07  7.209E-07  4.866E-03  1.184E-05  9.470E-11  9.999E-01  1.000E+00  1.000E+00  0.000E+00  7.854E+00  3.600E+01 -2.715E-03  2.356E+00  6.467E+00 -1.943E-22  1.467E-08  9.485E-08  3.049E-07  3.077E-05  4.661E-05
   7600     0.081 1.065E-05  1.564E+04  1.294E-08  5.291E-08  1.227E-06  2.255E-05  7.292E-07  7.292E-07  4.877E-03  1.189E-05  9.513E-11  9.999E-01  1.000E+00  1.000E+00  0.000E+00  7.854E+00  3.200E+01 -2.376E-02  2.356E+00  6.571E+00 -2.776E-22  1.486E-08  9.480E-08  3.113E-07  3.072E-05  4.655E-05
 
 Maximum walltime exceeded
 
 Simulation finished after         7684  time-steps
 
 Writing final snapshot at time t =  8.201366450196135E-002
 
 Wall clock time [hours] =   4.44     (+/-  2.78E-10)
 Wall clock time/timestep/(meshpoint+particle) [microsec] = 2.958E-03
 

Fri Feb 17 19:18:30 2017
Running
  pc_deprecated_slice_links
###########################################

So everything looks fine.

Makefile.local is:
###########################################
MPICOMM   =   mpicomm

HYDRO     =   hydro
DENSITY   =   density
ENTROPY   = noentropy
MAGNETIC  = nomagnetic
RADIATION = noradiation
PSCALAR   = nopscalar

GRAVITY   = nogravity
FORCING   = noforcing
SHEAR     =   shear

SHOCK     = shock_highorder

PARTICLES = particles_dust
PARTICLES_MAP = particles_map
FOURIER = fourier_fftpack

REAL_PRECISION = double
SELFGRAVITY = selfgravity
POISSON     = poisson
PARTICLES_SELFGRAVITY = particles_selfgravity
PARTICLES_STALKER     =   particles_stalker
###########################################

Computer is fresh and new. So maybe I do an compiler error? 
My guess is, that I am somehow  allocating too much storage for the particles and snapshots are not written completely? Or maybe, output somehow gets too large and is not read in consistently, since not enough storage is allocated when resuming the run.

Anyone an idea whats going on?

best,
Andreas

Frederick Gent

unread,
Feb 20, 2017, 10:57:53 AM2/20/17
to pencil-co...@googlegroups.com
When I see such output the usual cause is one or more corrupt var files. An easy way to check is to inspect the data with
>ls -l data/proc*/var.dat
I expect you to find at least one var file reports a different data size to the rest. If this is not the case then let us know.

Cheers,

Fred
--
You received this message because you are subscribed to the Google Groups "pencil-code-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pencil-code-dis...@googlegroups.com.
To post to this group, send email to pencil-co...@googlegroups.com.
Visit this group at https://groups.google.com/group/pencil-code-discuss.
For more options, visit https://groups.google.com/d/optout.

Andreas Schreiber

unread,
Feb 20, 2017, 11:19:31 AM2/20/17
to 'Andreas Schreiber' via pencil-code-discuss
Indeed, some are a tiny bit smaller. But there should be no reason from code handling and cluster perspective, since run was not aborted.

best
Andreas


------------------------------
Andreas Schreiber
Max-Planck-Institute for Astronomy
Koenigstuhl 17
D-69117 Heidelberg, Germany

Room: E 114 (Elsässer Lab)

Alex Richert

unread,
Feb 20, 2017, 5:04:10 PM2/20/17
to pencil-co...@googlegroups.com
I may have run into something similar a while back; try setting lpersist=F in init_pars and run_pars.
Alex

Frederick Gent

unread,
Feb 22, 2017, 4:09:40 AM2/22/17
to pencil-co...@googlegroups.com
If you have any snapshots, you could copy the earlier snapshots of the correct size to the mis-sized var.dat files. Providing you do not have any persistent variables, which could be unsynchronised, the restart should work with the later time default and any jumps between processors will likely smooth out over a few iterations.

see comment below

Cheers,

Fred
This indicates that on at least one processor the var.dat file was not updated to the latest snapshot. Why this happened and the job appeared to complete ok needs investigating. This usually occurs if the job allocated time runs out before the files have finished writing, but below seems to contradict this possibility.

My output looks like this on CSC, the 'Done' doesn't print until writing is actually complete. For such a large array this could feasibly be a long delay. Try a very short run with significantly larger run time to check how long the writing takes.

 Writing final snapshot at time t =   38644.1952107956

 Wall clock time [hours] =  0.290     (+/-  2.78E-10)
 Wall clock time/timestep/meshpoint [microsec] = 5.532E-02

0.125u 0.304s 17:30.01 0.0%     0+0k 0+88io 0pf+0w
Tue Jan 10 01:44:30 EET 2017
9.462u 30.143s 0:46.95 84.3%    0+0k 0+1744io 0pf+0w
Done

Philippe-A. Bourdin

unread,
Feb 22, 2017, 7:14:33 AM2/22/17
to pencil-code-discuss
Hello,


> you could copy the earlier snapshots of the correct size to the mis-sized var.dat files.

I would strongly suggest to go back to fully consistent snapshot files by copying correct "VAR#" files over "var.dat" and simply adjusting "tsnap.dat" and "vsnap.dat" inside the data directory.
This will also re-generate the currently broken snapshot files.

Best greetings,
Philippe.
 

Andreas Schreiber

unread,
Feb 23, 2017, 8:14:11 AM2/23/17
to 'Andreas Schreiber' via pencil-code-discuss
Dear all,

error seem to be found on cluster side, not on pencil code side! My guess is, that when files are getting accessed by multiple processes the nodes do crash. I had observed that already on the login node when doing "cat output.out" while pencil code run was writing into output.out its sdtout. I still wait for an explaination from our cluster team who did fix that and by now pencil code runs can be continued. Will inform you guys once I get a reply on that.

best and thanks for being so supportive!
Andreas

------------------------------
Andreas Schreiber
Max-Planck-Institute for Astronomy
Koenigstuhl 17
D-69117 Heidelberg, Germany

Room: E 114 (Elsässer Lab)
--
Reply all
Reply to author
Forward
0 new messages