forrtl: severe (67): input statement requires too much data

Andreas Schreiber

unread,

Feb 20, 2017, 10:27:44 AM2/20/17

to 'Andreas Schreiber' via pencil-code-discuss

Dear all,

I have a run with

###########################################

integer, parameter :: ncpus=1600,nprocx=4,nprocy=20,nprocz=20

integer, parameter :: nxgrid=400,nygrid=400,nzgrid=400

integer, parameter :: npar=nxgrid*nygrid*nzgrid*10, mpar_loc=npar/160, npar_mig=204800

integer, parameter :: npar_stalk=1.0e4

###########################################

and run it on a computer where I am the only user!

But after the first run (max_walltime=16000 ! hard walltime = 4,5 hr ) it does not continue, with error message

###########################################

Subcommand: run

forrtl: severe (67): input statement requires too much data, unit 88, file /isaac/u/aschreib/pc_projects/scaling_tests/a_3d_max_scaling/1600_42020/data/proc1077/var.dat

Image PC Routine Line Source

run.x 0000000000411FFC Unknown Unknown Unknown

run.x 00000000004460DE Unknown Unknown Unknown

run.x 00000000004437FB Unknown Unknown Unknown

run.x 000000000073BE5D Unknown Unknown Unknown

run.x 000000000073A3B9 Unknown Unknown Unknown

run.x 00000000005092F5 Unknown Unknown Unknown

run.x 00000000004039DE Unknown Unknown Unknown

libc-2.22.so 00002B7801FED6E5 __libc_start_main Unknown Unknown

run.x 00000000004038E9 Unknown Unknown Unknown

forrtl: severe (67): input statement requires too much data, unit 88, file /isaac/u/aschreib/pc_projects/scaling_tests/a_3d_max_scaling/1600_42020/data/proc983/var.dat

Image PC Routine Line Source

run.x 0000000000411FFC Unknown Unknown Unknown

run.x 00000000004460DE Unknown Unknown Unknown

run.x 00000000004437FB Unknown Unknown Unknown

run.x 000000000073BE5D Unknown Unknown Unknown

run.x 000000000073A3B9 Unknown Unknown Unknown

run.x 00000000005092F5 Unknown Unknown Unknown

run.x 00000000004039DE Unknown Unknown Unknown

libc-2.22.so 00002B3F6BAFB6E5 __libc_start_main Unknown Unknown

run.x 00000000004038E9 Unknown Unknown Unknown

forrtl: severe (67): input statement requires too much data, unit 88, file /isaac/u/aschreib/pc_projects/scaling_tests/a_3d_max_scaling/1600_42020/data/proc505/var.dat

###########################################

Furthermore, timestamp gets inconsistent:

###########################################

...

SVN: ------- v. ( ) $Id$

The verbose level is ip= 14 (ldebug= F )

This is a 3-D run

nxgrid, nygrid, nzgrid= 400 400 400

Lx, Ly, Lz= 2.000000000000000E-002 2.000000000000000E-002

2.000000000000000E-002

Vbox= 8.000000000000001E-006

Timestamps in snapshot INCONSISTENT. Using (max) t= 8.307890670577762E-002

with ireset_tstart= 2 .

Timestamps in snapshot INCONSISTENT. Using t= 8.307890670577762E-002 .

###########################################

The first run ends with

###########################################

7500 0.080 1.065E-05 1.542E+04 1.279E-08 5.282E-08 1.209E-06 2.246E-05 7.209E-07 7.209E-07 4.866E-03 1.184E-05 9.470E-11 9.999E-01 1.000E+00 1.000E+00 0.000E+00 7.854E+00 3.600E+01 -2.715E-03 2.356E+00 6.467E+00 -1.943E-22 1.467E-08 9.485E-08 3.049E-07 3.077E-05 4.661E-05

7600 0.081 1.065E-05 1.564E+04 1.294E-08 5.291E-08 1.227E-06 2.255E-05 7.292E-07 7.292E-07 4.877E-03 1.189E-05 9.513E-11 9.999E-01 1.000E+00 1.000E+00 0.000E+00 7.854E+00 3.200E+01 -2.376E-02 2.356E+00 6.571E+00 -2.776E-22 1.486E-08 9.480E-08 3.113E-07 3.072E-05 4.655E-05

Maximum walltime exceeded

Simulation finished after 7684 time-steps

Writing final snapshot at time t = 8.201366450196135E-002

Wall clock time [hours] = 4.44 (+/- 2.78E-10)

Wall clock time/timestep/(meshpoint+particle) [microsec] = 2.958E-03

Fri Feb 17 19:18:30 2017

Running

pc_deprecated_slice_links

###########################################

So everything looks fine.

Makefile.local is:

###########################################

MPICOMM = mpicomm

HYDRO = hydro

DENSITY = density

ENTROPY = noentropy

MAGNETIC = nomagnetic

RADIATION = noradiation

PSCALAR = nopscalar

GRAVITY = nogravity

FORCING = noforcing

SHEAR = shear

SHOCK = shock_highorder

PARTICLES = particles_dust

PARTICLES_MAP = particles_map

FOURIER = fourier_fftpack

REAL_PRECISION = double

SELFGRAVITY = selfgravity

POISSON = poisson

PARTICLES_SELFGRAVITY = particles_selfgravity

PARTICLES_STALKER = particles_stalker

###########################################

Computer is fresh and new. So maybe I do an compiler error?

My guess is, that I am somehow allocating too much storage for the particles and snapshots are not written completely? Or maybe, output somehow gets too large and is not read in consistently, since not enough storage is allocated when resuming the run.

Anyone an idea whats going on?

best,

Andreas

Frederick Gent

unread,

Feb 20, 2017, 10:57:53 AM2/20/17

to pencil-co...@googlegroups.com

When I see such output the usual cause is one or more corrupt var files. An easy way to check is to inspect the data with
>ls -l data/proc*/var.dat
I expect you to find at least one var file reports a different data size to the rest. If this is not the case then let us know.

Cheers,

Fred

--
You received this message because you are subscribed to the Google Groups "pencil-code-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pencil-code-dis...@googlegroups.com.
To post to this group, send email to pencil-co...@googlegroups.com.
Visit this group at https://groups.google.com/group/pencil-code-discuss.
For more options, visit https://groups.google.com/d/optout.

Andreas Schreiber

unread,

Feb 20, 2017, 11:19:31 AM2/20/17

to 'Andreas Schreiber' via pencil-code-discuss

Indeed, some are a tiny bit smaller. But there should be no reason from code handling and cluster perspective, since run was not aborted.

best

Andreas

------------------------------

Andreas Schreiber

Max-Planck-Institute for Astronomy

Koenigstuhl 17

D-69117 Heidelberg, Germany

Email: aschr...@mpia.de

Phone: +49-6221-528-392

Room: E 114 (Elsässer Lab)

Alex Richert

unread,

Feb 20, 2017, 5:04:10 PM2/20/17

to pencil-co...@googlegroups.com

I may have run into something similar a while back; try setting lpersist=F in init_pars and run_pars.
Alex

Frederick Gent

unread,

Feb 22, 2017, 4:09:40 AM2/22/17

to pencil-co...@googlegroups.com

If you have any snapshots, you could copy the earlier snapshots of the correct size to the mis-sized var.dat files. Providing you do not have any persistent variables, which could be unsynchronised, the restart should work with the later time default and any jumps between processors will likely smooth out over a few iterations.

see comment below

Cheers,

Fred

This indicates that on at least one processor the var.dat file was not updated to the latest snapshot. Why this happened and the job appeared to complete ok needs investigating. This usually occurs if the job allocated time runs out before the files have finished writing, but below seems to contradict this possibility.

My output looks like this on CSC, the 'Done' doesn't print until writing is actually complete. For such a large array this could feasibly be a long delay. Try a very short run with significantly larger run time to check how long the writing takes.

Writing final snapshot at time t =   38644.1952107956

Wall clock time [hours] = 0.290     (+/- 2.78E-10)
Wall clock time/timestep/meshpoint [microsec] = 5.532E-02

0.125u 0.304s 17:30.01 0.0%     0+0k 0+88io 0pf+0w
Tue Jan 10 01:44:30 EET 2017
9.462u 30.143s 0:46.95 84.3%    0+0k 0+1744io 0pf+0w
Done

Philippe-A. Bourdin

unread,

Feb 22, 2017, 7:14:33 AM2/22/17

to pencil-code-discuss

Hello,

> you could copy the earlier snapshots of the correct size to the mis-sized var.dat files.

I would strongly suggest to go back to fully consistent snapshot files by copying correct "VAR#" files over "var.dat" and simply adjusting "tsnap.dat" and "vsnap.dat" inside the data directory.
This will also re-generate the currently broken snapshot files.

Best greetings,
Philippe.

Andreas Schreiber

unread,

Feb 23, 2017, 8:14:11 AM2/23/17

to 'Andreas Schreiber' via pencil-code-discuss

Dear all,

error seem to be found on cluster side, not on pencil code side! My guess is, that when files are getting accessed by multiple processes the nodes do crash. I had observed that already on the login node when doing "cat output.out" while pencil code run was writing into output.out its sdtout. I still wait for an explaination from our cluster team who did fix that and by now pencil code runs can be continued. Will inform you guys once I get a reply on that.

best and thanks for being so supportive!

Andreas

------------------------------

Andreas Schreiber

Max-Planck-Institute for Astronomy

Koenigstuhl 17

D-69117 Heidelberg, Germany

Email: aschr...@mpia.de

Phone: +49-6221-528-392

Room: E 114 (Elsässer Lab)

--

Reply all

Reply to author

Forward