calculation hangs without error message after few time steps

7 views
Skip to first unread message

Matei Radulescu

unread,
Sep 23, 2008, 7:07:28 PM9/23/08
to amrita-ebook
Dear James,

In a calculation of a detonation interacting with a half-cylinder, in
a large domain and high resolution, the calculation eventually hangs
after a number of time steps.

An example is the script run_circle_detonation i uploaded today.
After the 39th output (few hours on a single processor machine), the
calculation hangs. Doing "top" to obtain the processes running shows
that the Amrita process has stopped, and only a Perl process remains,
which controls 100% of my CPU, but consumes practically no RAM.

I am puzzled why the calculation hangs, without a NAN or any warning
that it had quit. The problem typically goes away or happens at a
later time in the calculation for less ambituous domains /
resolution. I have tried increasing some of the array sizes, but it
does not seem to do anything. The calculation itself also consumes
some 20% of my RAM, so I am not lacking the hardware ressources. Any
idea what the problem is and how I can fix this?
Thanks,
matei

Gary Sharpe

unread,
Sep 24, 2008, 5:49:02 AM9/24/08
to amrita-ebook
Matei,
have you tried restarting the calculation from the 39th output and
seeing
if it gets any further. I sometimes (but not reproducibly) encounter
this problem
running in parallel on a cluster. However, I can always restart from a
saved model
and it will keep going beyond where it stopped.
Gary

James Quirk

unread,
Sep 24, 2008, 7:31:00 AM9/24/08
to amrita-ebook
Matei,

On Tue, 23 Sep 2008, Matei Radulescu wrote:

>
> Dear James,
>
> In a calculation of a detonation interacting with a half-cylinder, in
> a large domain and high resolution, the calculation eventually hangs
> after a number of time steps.
>
> An example is the script run_circle_detonation i uploaded today.
> After the 39th output (few hours on a single processor machine), the
> calculation hangs. Doing "top" to obtain the processes running shows
> that the Amrita process has stopped, and only a Perl process remains,
> which controls 100% of my CPU, but consumes practically no RAM.

As described in the VKI notes, AMRITA has a front-end and a back-end.
The Perl job is the AMRITA interpreter that parses your script and the
back-end is the mesh refinement engine that carries out the
actual simulation. The two ends runs as separate UNIX processes
that communicate via named pipes. Change directory to
/tmp/amrita-xyz/amrita/pipe where xyz is your login name
and you'll see a series of islin/islout files that are used
for the communication. Thus what you're observing is the back-end
dieing leaving the front-end hanging. Note that amrkill can be used to
clean up a hanging front end.

At this end your script runs fine, see the attached file sch46.pdf that
shows a time shot around iteration 150. Therefore without having access
to your machine so that I can check what compiler you're using and examine
your shell environment, there's no real way for me to pinpoint the
problem.

Gary's issue, when running in parallel, is slightly different.
There is a known problem with MPI that AMRITA's error messages
get lost owing to the way MPI buffers stderr from Fortran programs.
But this issue is fixed in my development version.

You should, however, take up his suggestion of check-pointing
intermediate solutions so that you can resume an aborted run.
There is also a problem with your script that suggests you are
not using Amrita's folding editor. Specifically, there is a closing
brace missing in the fold where ArraySizes is called; I inserted
a line 23 containing: " }".

Without the added line, the entire script is mangled when viewed with
amrgi, and as a matter of ettiquette you can't really expect A.N. Other
to wade through such a listing.

You should also get into the habit of removing detritus from a script.
In the example you sent me, you do not actually plot vorticity
but all the custom code remains in the script. And it may well
turn out that it is the vorticity stuff that is causing your problem
rather than the base AMRITA installation; Brian has already alluded
to one problem you had in an earlier message. Thus as a first step
you should whittle the script down to it bare essentials and
then see if the spurious hanging remains.

James

Matei Radulescu

unread,
Sep 24, 2008, 6:03:23 PM9/24/08
to amrita-ebook
James, Gary,
I'll try your recommendations and will get back to you regarding my
success, shortly, hopefully.
Thanks for your help.
matei
> At this end your script runs fine, see the attached file sch46.pdf that
> shows a time shot around iteration 150.  Therefore without having access
> to your machine so that I can check what compiler you're using and examine
> your shell environment, there's no real way for me to pinpoint the
> problem.
(You did not attach the file). However, I do not understand why it
bugs on my system and not on yours.
I can give you access to my machine if you feel that checking the
system can help here. Any particular hints of where and why it may
fail?
I'll give it a shot first and let you know.

>
> Gary's issue, when running in parallel, is slightly different.
> There is a known problem with MPI that AMRITA's error messages
> get lost owing to the way MPI buffers stderr from Fortran programs.
> But this issue is fixed in my development version.
>
> You should, however, take up his suggestion of check-pointing
> intermediate solutions so that you can resume an aborted run.

yes, will try that first. I had tried this before on a slightly
different script
and tried restarting with cfl as low as 0.1, but to no avail.

> There is also a problem with your script that suggests you are
> not using Amrita's folding editor. Specifically, there is a closing
> brace missing in the fold  where ArraySizes is called; I inserted
> a line 23 containing: "            }".

yes, I realize that. I still do not feel too comfortable with the
origami editor.

>
> Without the added line, the entire script is mangled when viewed with
> amrgi, and as a matter of ettiquette you can't really expect A.N. Other
> to wade through such a listing.
>
Point well taken.

> You should also get into the habit of removing detritus from a script.
> In the example you sent me, you do not actually plot vorticity
> but all the custom code remains in the script. And it may well
> turn out that it is the vorticity stuff that is causing your problem
> rather than the base AMRITA installation; Brian has already alluded
> to one problem you had in an earlier message. Thus as a first step
> you should whittle the script down to it bare essentials and
> then see if the spurious hanging remains.

Will clean up and try it on a bare bones version.

James Quirk

unread,
Sep 24, 2008, 6:42:10 PM9/24/08
to Matei Radulescu, amrita-ebook
Matei,

On Wed, 24 Sep 2008, Matei Radulescu wrote:

>
> James, Gary,
> I'll try your recommendations and will get back to you regarding my
> success, shortly, hopefully.
> Thanks for your help.
> matei
> > At this end your script runs fine, see the attached file sch46.pdf that
> > shows a time shot around iteration 150.  Therefore without having access
> > to your machine so that I can check what compiler you're using and examine
> > your shell environment, there's no real way for me to pinpoint the
> > problem.
> (You did not attach the file). However, I do not understand why it
Sorry. See the attachment here.

> bugs on my system and not on yours.
It could be something in your .bashrc/.cshrc file, or it could
be a bug in your version of gcc, which has happened before.


> I can give you access to my machine if you feel that checking the
> system can help here. Any particular hints of where and why it may
> fail?
I am gearing up to release AMRITAv3.03 and so I'm unlikely
to have the time to track the problem down. What I suggest is that
I give you a pre-release in the next week to 10 days and
if 3.03 fixes the problem great. If not, I will add it as
bug to be fixed.

James
sch46.pdf

Matei Radulescu

unread,
Sep 24, 2008, 7:02:20 PM9/24/08
to amrita-ebook
> > > At this end your script runs fine, see the attached file sch46.pdf that
> > > shows a time shot around iteration 150.  Therefore without having access
> > > to your machine so that I can check what compiler you're using and examine
> > > your shell environment, there's no real way for me to pinpoint the
> > > problem.
> > (You did not attach the file).  However, I do not understand why it
>
> Sorry. See the attachment here.

HMMM. Something is very odd about the time steps the solver takes.
The time step at which my run fails is after the detonation diffracts
around the cylinder and a new detonative mach stem is forming on the
bottom horizontal boundary. It seems that your counter runs faster
than mine !?
In any case, can you run the script and see if the detonation makes it
until the end of the domain? you may have to add more steps.
>
> > bugs on my system and not on yours.
>
> It could be something in your .bashrc/.cshrc file, or it could
> be a bug in your version of gcc, which has happened before.

my .bashrc simply contains:
eval `$HOME/AMRITA/AMRITAv3.00/tools/amrshell -setup bash`

typing
$ gcc --version
i get
gcc (GCC) 4.1.2 20070925 (Red Hat 4.1.2_27)
---funny, that's exactly a day short of a year!
do you recognize any problems with this particular version?
>
> > I can give you access to my machine if you feel that checking the
> > system can help here.  Any particular hints of where and why it may
> > fail?
>
> I am gearing up to release AMRITAv3.03 and so I'm unlikely
> to have the time to track the problem down. What I suggest is that
> I give you a pre-release in the next week to 10 days and
> if 3.03 fixes the problem great. If not, I will add it as
> bug to be fixed.

Let's first sort out this difference in time steps. can you run a
fixed number of time steps, say 10 steps with cfl =0.5 using my script
and output the time and schlieren, and I do the same, and we compare?
>  sch46.pdf
> 316KViewDownload

Matei Radulescu

unread,
Sep 24, 2008, 8:00:00 PM9/24/08
to amrita-ebook
James,
>
> Let's first sort out this difference in time steps. can you run a
> fixed number of time steps, say 10 steps with cfl =0.5 using my script
> and output the time and schlieren, and I do the same, and we compare?
>
Just to set the clock straight, after marching 10 steps with my script
with cfl=0.5
i get to the time 0.4303065231255236
I have uploaded the corresponding mailit entitled
run_circle_detonation_test.mailit
which contains the ps files for the schlieren, etc...

matei

James Quirk

unread,
Sep 24, 2008, 9:00:56 PM9/24/08
to amrita-ebook
Matei,

The discrepancy arose because I misread your original e-mail. You said
that the problem occured after the 39th output, but I took that to mean
the 39 iteration. Hence when I ran the script I reduced the march down
from 20 steps to 5 so that I would have enough checkpointed solutions to
look at what was going on.

Now given that you were able to run the simulation out to late time
I'm fairly sure I know what the problem is. Amr_sol, as shipped, has an
internal limit on the number of mesh patches set at 2^10 i.e. 4096.
This limit is irrespective of the number fed into ArraySizes and
arisees from the way the mesh interconnectivity was shoehorned
into an 8MByte machine back in 1991. Given the memory available
today I have revisted the data storage and v3.03 will allow for
2^20 meshes which should be good for the next 10-15 years.

Normally the the mesh limit does not come in to play but
you are using five grid levels with a refinement ratio of 2
and so you end up will an unnecessarily large number of small
pathches (there are 2000+ at the early time I ran). I would
probably only have run 2 levels wiht a ratio of 4 and adjusted
the base grid to suit. Note there is not much to be gained
by using etra grid levels as the work on the finest mesh
dominates. Also, the more levels you use the less frequently
the time step can be adjusted to suit the CFL condition,
which is not good for highly dynamic flows.

Anyhow, I suggest you do the following. Modify your
script to output checkpointed solutions with the flowout command.
Run the simulation until it fails then read in the last
available solution and run the following:

set npatches = 0
datastructure
do l=0,5
set npatches += $$nga'l
end do
echo $npatches

and I'm fairly sure that you'll find the number of patches
is close to the 2^10 limit.

If this proves to be the case, I'll ship you a prerelease
of 3.03 with its enlarged storage tables to see if that
fixes the problem.

James

On Wed, 24 Sep 2008, Matei Radulescu wrote:

Matei Radulescu

unread,
Sep 25, 2008, 10:57:22 AM9/25/08
to amrita-ebook

> internal limit on the number of mesh patches set at 2^10 i.e. 4096.

you mean 2^12=4096?

> This limit is irrespective of the number fed into ArraySizes and
> arisees from the way the mesh interconnectivity was shoehorned
> into an 8MByte machine back in 1991. Given the memory available
> today I have revisted the data storage and v3.03 will allow for
> 2^20 meshes which should be good for the next 10-15 years.
>
> Normally the the mesh limit does not come in to play but
> you are using five grid levels with a refinement ratio of 2
> and so you end up will an unnecessarily large number of small
> pathches (there are 2000+ at the early time I ran). I would
> probably only have run 2 levels wiht a ratio of 4 and adjusted
> the base grid to suit. Note there is not much to be gained
> by using etra grid levels as the work on the finest mesh
> dominates.

The reason here was to be able to do an appropriate resolution study
and not have to compare apples with oranges.

> Also, the more levels you use the less frequently
> the time step can be adjusted to suit the CFL condition,
> which is not good for highly dynamic flows.

Isn't the time step always adjusted based on your most resolved Dx?
In that sense, should it not matter how many levels "above" that there
are?

>
> Anyhow, I suggest you do the following. Modify your
> script to output checkpointed solutions with the flowout command.
> Run the simulation until it fails then read in the last
> available solution and run the following:
>
> set npatches = 0
> datastructure
> do l=0,5
> set npatches += $$nga'l
> end do
> echo $npatches
>
> and I'm fairly sure that you'll find the number of patches
> is close to the 2^10 limit.

If indeed the max no. of patches is 2^12=4096, than indeed, at a
slightly earlier time of the breaking point, there are 4081 patches,
and its likely to increase as the run advances do to the physical
situation.

It thus seems that you identified the problem. I suspected something
along those lines, since the breaking point didn't correlate with a
numerical difficulty per se, but rather with the size of the domain
and grid levels i was using. (see my original post)
>
> If this proves to be the case, I'll ship you a prerelease
> of 3.03 with its enlarged storage tables to see if that
> fixes the problem.
>
I'd very much appreciate that. Its been a few weeks now i've been
trying to work around this problem!
I can give you access to my machine, or i can download it via ftp from
where_ever.

Matei Radulescu

unread,
Sep 25, 2008, 11:01:44 AM9/25/08
to amrita-ebook
The culprit for the lack of error message reported in my original
post, upon crashing, was due to my missing bracket indicated by James,
perhaps causing the error not be captured appropriately. Adding the
"}" at the critical location makes the error capturing work, and i
get::
Error at line 216 of file run_circle_detonation_simple:
march generated NaN!

Line 216 is:
march 20 steps with cfl=0.5

error near:


James Quirk

unread,
Sep 25, 2008, 11:34:29 AM9/25/08
to Matei Radulescu, amrita-ebook
On Thu, 25 Sep 2008, Matei Radulescu wrote:

>
>
> > internal limit on the number of mesh patches set at 2^10 i.e. 4096.
>
> you mean 2^12=4096?

Sorry for the brain fade, but I'm suffering with a bad head cold
at the moment. If you take a look in the file:

$AMRITA/include/f77/AMR_SOL/AMRITA

you will see a variable MASK2 which is three nibbles wide i.e. 0xFFF.


>
> > This limit is irrespective of the number fed into ArraySizes and
> > arisees from the way the mesh interconnectivity was shoehorned
> > into an 8MByte machine back in 1991. Given the memory available
> > today I have revisted the data storage and v3.03 will allow for
> > 2^20 meshes which should be good for the next 10-15 years.
> >
> > Normally the the mesh limit does not come in to play but
> > you are using five grid levels with a refinement ratio of 2
> > and so you end up will an unnecessarily large number of small
> > pathches (there are 2000+ at the early time I ran). I would
> > probably only have run 2 levels wiht a ratio of 4 and adjusted
> > the base grid to suit. Note there is not much to be gained
> > by using etra grid levels as the work on the finest mesh
> > dominates.
>
> The reason here was to be able to do an appropriate resolution study
> and not have to compare apples with oranges.
>
> > Also, the more levels you use the less frequently
> > the time step can be adjusted to suit the CFL condition,
> > which is not good for highly dynamic flows.
>
> Isn't the time step always adjusted based on your most resolved Dx?
> In that sense, should it not matter how many levels "above" that there
> are?

The time step is decided by looking at the most restrictive case
from all the grid levels. But because of the way the temporal refinement
is orchestrated, the stable time set can only be selected when the
coarse grid is intregrated. Thus with lmax=5 and r=2, there will
be 2^5 steps on the fine mesh with a fixed dt.


>
> >
> > Anyhow, I suggest you do the following. Modify your
> > script to output checkpointed solutions with the flowout command.
> > Run the simulation until it fails then read in the last
> > available solution and run the following:
> >
> > set npatches = 0
> > datastructure
> > do l=0,5
> > set npatches += $$nga'l
> > end do
> > echo $npatches
> >
> > and I'm fairly sure that you'll find the number of patches
> > is close to the 2^10 limit.
>
> If indeed the max no. of patches is 2^12=4096, than indeed, at a
> slightly earlier time of the breaking point, there are 4081 patches,
> and its likely to increase as the run advances do to the physical
> situation.

Good. That's what I figured.

>
> It thus seems that you identified the problem. I suspected something
> along those lines, since the breaking point didn't correlate with a
> numerical difficulty per se, but rather with the size of the domain
> and grid levels i was using. (see my original post)
> >
> > If this proves to be the case, I'll ship you a prerelease
> > of 3.03 with its enlarged storage tables to see if that
> > fixes the problem.
> >
> I'd very much appreciate that. Its been a few weeks now i've been
> trying to work around this problem!
> I can give you access to my machine, or i can download it via ftp from
> where_ever.
> Thanks for your help.

I'll put something together for you, but it will propably be
towards the end of next week. I have the serial case working,
with the upped storage limits, but I want to get the parallel
version working before I ship it, j, just to make sure that
I havn't missed any snafus.

James


>
> matei
>
> >

Matei Radulescu

unread,
Sep 26, 2008, 9:42:16 AM9/26/08
to amrita-ebook
> > > Also, the more levels you use the less frequently
> > > the time step can be adjusted to suit the CFL condition,
> > > which is not good for highly dynamic flows.
>
> > Isn't the time step always adjusted based on your most resolved Dx?
> > In that sense, should it not matter how many levels "above" that there
> > are?
>
> The time step is decided by looking at the most restrictive case
> from all the grid levels. But because of the way the temporal refinement
> is orchestrated, the stable time set can only be selected when the
> coarse grid is intregrated. Thus with lmax=5 and r=2, there will
> be 2^5 steps on the fine mesh with a fixed dt.

Does that mean that I could run into the possibility that the CFL
condition may not be verified during the time which dt is fixed?
Say all of a sudden, something ignites & gives large characteristic
velocities?
Do you have a flag somewhere to see if indeed the CFL condition is not
broken during the fixed time steps on the finest grids necessary to
make up on big step on the coarsest grid?
Thanks, let me know when this becomes available.

Speaking of which, if you release a new version, you could also fix a
minor bug in the 1step chemistry routine:
$HOME/AMRITA/AMRITAv3.00/stdlib/equations/lib/ReactiveEulerEquations/
ComputeZndProfile.amr
on line 85, it reads:
DT = 1.0/1000
i had to change this to something like
C DT = 0.25/
FLOAT(Npts) #mir
But haven't tested it sufficiently.
I ran into this when I was studying the small scale flowfield of the
triple point structure and needed a very fine initial discretization
of the znd profile, with up to 1000 pts/half reaction length.
The original line 85 restricted my znd profile to a max of ~256 pts
per half reaction length or so.

James Quirk

unread,
Sep 26, 2008, 10:21:52 AM9/26/08
to Matei Radulescu, amrita-ebook
On Fri, 26 Sep 2008, Matei Radulescu wrote:

>
> > > > Also, the more levels you use the less frequently
> > > > the time step can be adjusted to suit the CFL condition,
> > > > which is not good for highly dynamic flows.
> >
> > > Isn't the time step always adjusted based on your most resolved Dx?
> > > In that sense, should it not matter how many levels "above" that there
> > > are?
> >
> > The time step is decided by looking at the most restrictive case
> > from all the grid levels. But because of the way the temporal refinement
> > is orchestrated, the stable time set can only be selected when the
> > coarse grid is intregrated. Thus with lmax=5 and r=2, there will
> > be 2^5 steps on the fine mesh with a fixed dt.
>
> Does that mean that I could run into the possibility that the CFL
> condition may not be verified during the time which dt is fixed?
> Say all of a sudden, something ignites & gives large characteristic
> velocities?

Yes, thermal runaway processes could present a problem.

> Do you have a flag somewhere to see if indeed the CFL condition is not
> broken during the fixed time steps on the finest grids necessary to
> make up on big step on the coarsest grid?

No. But there is nothing to prevent the savvy user from building
one into his or her patch-integrator.

> > > I'd very much appreciate that. Its been a few weeks now i've been
> > > trying to work around this problem!
> > > I can give you access to my machine, or i can download it via ftp from
> > > where_ever.
> > > Thanks for your help.
> >
> > I'll put something together for you, but it will propably be
> > towards the end of next week. I have the serial case working,
> > with the upped storage limits, but I want to get the parallel
> > version working before I ship it, j, just to make sure that
> > I havn't missed any snafus.
> >
> Thanks, let me know when this becomes available.

I'm currently running your case in serial mode and have just got
to phase 32. Therefore I should know later today whether or not
v3.03 fixes the problem.

>
> Speaking of which, if you release a new version, you could also fix a
> minor bug in the 1step chemistry routine:
> $HOME/AMRITA/AMRITAv3.00/stdlib/equations/lib/ReactiveEulerEquations/
> ComputeZndProfile.amr
> on line 85, it reads:
> DT = 1.0/1000
> i had to change this to something like
> C DT = 0.25/
> FLOAT(Npts) #mir
> But haven't tested it sufficiently.
> I ran into this when I was studying the small scale flowfield of the
> triple point structure and needed a very fine initial discretization
> of the znd profile, with up to 1000 pts/half reaction length.
> The original line 85 restricted my znd profile to a max of ~256 pts
> per half reaction length or so.

I will take a look at it to see how DT can be paramaterized.

James

James Quirk

unread,
Sep 28, 2008, 11:42:26 AM9/28/08
to Matei Radulescu, amrita-ebook
Matei,

Using AMRITAv3.03, I've been able to run your script out to 1000
iterations, see the attached tmp.pdf . And so it looks like the upgrade I
mentioned has done the trick. Of course, I would not be surprised that if
you ran a full-blow parameter study for a detonation diffracting around a
cyclinder, you would sooner or later run into problems, for the
roe-glister integrator is not positivity preserving i.e. it can result in
negative pressures.

James

tmp.pdf

Matei Radulescu

unread,
Sep 29, 2008, 8:59:43 AM9/29/08
to amrita-ebook
James,
Thanks, I'm glad the problem is resolved.
Regarding the possibility of negative pressures, that was originally
my concern. However, looking at the profiles, I did not encounter
this problem, yet. What part of the flow is likely to suffer from
this type of problem, the steady expansion originating from the
throat, which reduces the pressure to low values?

Can you recommend another scheme that is positivity preserving, and
would be appropriate for this type of calculation?
matei
> tmp.pdf
> 800KViewDownload

James Quirk

unread,
Sep 29, 2008, 9:24:15 AM9/29/08
to Matei Radulescu, amrita-ebook
Matei,

On Mon, 29 Sep 2008, Matei Radulescu wrote:

>
> James,
> Thanks, I'm glad the problem is resolved.
> Regarding the possibility of negative pressures, that was originally
> my concern. However, looking at the profiles, I did not encounter
> this problem, yet. What part of the flow is likely to suffer from
> this type of problem, the steady expansion originating from the
> throat, which reduces the pressure to low values?

Yes. Any region of strongly expanding flow is a candidate.

>
> Can you recommend another scheme that is positivity preserving, and
> would be appropriate for this type of calculation?

It's not as simple as switching to a more robust scheme, for you'll
likely find its resolution is poor. A better way forward is to
employ two schemes and switch between them dynamically. But BCG
is not set up to deliver such a scheme.

James

Reply all
Reply to author
Forward
0 new messages