Fork failing, but WHY??

ce...@compuserve.com

unread,

Sep 17, 1999, 3:00:00 AM9/17/99

to

I need some help trying to figure out why a client's program is crashing with
a EAGAIN (11) error returned from fork(S).

Here's the environment (basically the same for the two machines described
below): Clean install of OSR 5.0.5 with RS505A, UDK Compatibility and
OSS497B installed, at least 128Mb ram, at least 512Mb swap. Same stune file.
All testing / sleuthing on the client machine was done on a Saturday, with
no other users.

Here's the scenario: I wrote a program using a 4GL called Speedware, which
shells to the OS to create a temporary file (with touch), run a section of
code that uses that file, and upon return, shells out again to remove the
temporary file.

On my development system, I can run this procedure an "infinate" number of
times successfully. On my client's system, things start failing quickly.
Namely, upon the first time running it, the first fork succeeds, creating the
temporary file. Upon returning, however, the second fork fails, leaving the
temporary file behind. If the logic is re-run, the first fork fails, so the
program crashes (with "file not found") because it does not find the
temporary file.

I turned on auditing, which basically told me that the fork(s) was failing
with EAGAIN (11, no more processes). Nothing else seemed strange in the
audit report.

Checking the messages(M) man page, it says that EAGAIN is caused by (a) the
process table being full (but this is dynamically allocated in OSR 5.0.5, and
I have proven that the table will grow on my client's machine, so it seems
unlikely that this is the problem); (b) swap is fill (but sar and other swap
reporting programs tell me that it isn't even being used (512Mb free of
512Mb)); (c) EXEC failed due to insufficient number of pages available to
load executable (but I'm not exec(S)-ing...); or (d) a lock failed on a file
or record that was already locked (which I suppose is a possibility, but how
do I find out? And why wouldn't this be the case on my machine?).

I have tried uping some of the kernel tunable parameters. I changed MAXUP (#
processes per user id) from 100 to 200, and the client machine crashes in
exactly the same spot. (I would expect that if this was the problem then I'd
be able to run the program a few more times before a crash, if it crashed at
all.) I changed NOFILES (number of open files per process) from 110 to 220
to 2500, and again, the program crashed in the exact same spot. (lsof shows
a maximum of about 75 files for any given process for the user at any given
time when running silmultaneously with a test run that causes the crash.)

I have made sure that the two systems are running the same version of
Speedware, and that all of Speedware's configuration files are the same.

I have downloaded my client's database to my machine, and was able to run the
process without any crashes. I copied my test database to my client's
machine, and it crashes every time. This leads me to believe the problem is
not data related.

There are no messages displayed on the system console, or in
/usr/adm/messages, or /usr/adm/syslog, when this problem arises.

The client wasn't having this problem initially with the system. It seemed
to start after I was on site one Saturday. Unfortunately, they didn't tell
me this for several weeks, and my notes are sketchy as to anything out of the
ordinary that I did. (My memory is out of the question! :->) I may have
installed OSS497B that day. So, I tried to uninstall OSS497B, but that
didn't help. (I have since re-installed it.)

The fork(S) man page didn't seem to shed any light on the subject.

It was suggested to me that there might be something different in the
environment, such as PATH and permissions. But if that were the case,
wouldn't the first fork fail the first time it was executed? (I don't
believe the Speedware program should not be changing the environment while
it's running.)

I ran Verify System (thorough), and nothing of note showed up.

I just ordered SarCheck, so maybe that will help (but I need to wait for it
to be shipped to me). I'm going to try using the Skunkware version later
today...

If anyone has any suggestions, or needs additional information, or just wants
to console me (:->), please post.

The only thing left that I can think of is to re-install the OS, but that may
take quite a long time. (I'll probably try an "in place upgrade" first, so I
don't have to reconfigure everything if it works. Otherwise, I guess I'll
try a clean install <UGH!>.)

Thanks for any help, comments, or suggestions!

Carl Sopchak

Sent via Deja.com http://www.deja.com/
Share what you know. Learn what you don't.

ce...@compuserve.com

unread,

Sep 17, 1999, 3:00:00 AM9/17/99

to

Some additional information (i think this got lost the first time I tried to
post it)...

1) The program crashes when a 'normal' user runs it, and when root runs it.
This leads me to believe that it is not a permissions problem.

2) I just ran SarCheck UltraLight (from Skunkware 98) over the output from
"sar -A 5 36" which was run in the background while I caused the program to
crash a couple of times. Besides complaining of a sampling interval that was
too short, it gave the following:

<Quote>

The following indication(s) of a memory shortage were seen: The reclaim rate
was at least one quarter of the page fault rate in only 0.0 percent of the
samples. This statistic can be used to confirm the presence of an occasional
memory-poor condition.

The average swap out transfer request rate was 23.8 per second, which is an
indication of a memory-poor condition.

The amount of freeswp did not change during the monitoring period, indicating
that the system has plenty of memory.

I'm confused! Two paragraphs claim a memory-poor system, while the third
says that there's plenty of memory. Might this be due to the short sampling
period? Would these messages be given if there were NO swapping over the
sampling period? I recall running across information in SCO documentation
stating that EAGAIN could be given in cases of insufficient memory. Is there
a kernel parameter that limits memory usage per process or per user id?

I can't wait to get the full version of SarCheck to see if it can shed any
more light on this...

If anyone has any suggestions or comments, they would be greatly appreciated!

Thanks,

Carl

larry

unread,

Sep 17, 1999, 3:00:00 AM9/17/99

to

I'm not familiar with Speedware, but some thoughs anyay.

ce...@compuserve.com wrote in message <7rtv3s$jke$1...@nnrp1.deja.com>...

>I need some help trying to figure out why a client's program is crashing
with
>a EAGAIN (11) error returned from fork(S).
>
>Here's the environment (basically the same for the two machines described
>below): Clean install of OSR 5.0.5 with RS505A, UDK Compatibility and
>OSS497B installed, at least 128Mb ram, at least 512Mb swap. Same stune
file.
> All testing / sleuthing on the client machine was done on a Saturday, with
>no other users.
>
>Here's the scenario: I wrote a program using a 4GL called Speedware, which
>shells to the OS to create a temporary file (with touch), run a section of
>code that uses that file, and upon return, shells out again to remove the
>temporary file.

I'm not familiar with using shell as a verb without at least an implied
reference to shellfish or military ordinance. I assume that by shell to, you
mean starting another instance of the shell program, and shell out means
exit.

>On my development system, I can run this procedure an "infinate" number of
>times successfully. On my client's system, things start failing quickly.
>Namely, upon the first time running it, the first fork succeeds, creating
the
>temporary file. Upon returning, however, the second fork fails, leaving
the
>temporary file behind. If the logic is re-run, the first fork fails, so the
>program crashes (with "file not found") because it does not find the
>temporary file.
>
>I turned on auditing, which basically told me that the fork(s) was failing
>with EAGAIN (11, no more processes). Nothing else seemed strange in the
>audit report.
>
>Checking the messages(M) man page, it says that EAGAIN is caused by (a) the
>process table being full (but this is dynamically allocated in OSR 5.0.5,
and
>I have proven that the table will grow on my client's machine, so it seems
>unlikely that this is the problem); (b) swap is fill (but sar and other
swap
>reporting programs tell me that it isn't even being used (512Mb free of
>512Mb)); (c) EXEC failed due to insufficient number of pages available to
>load executable (but I'm not exec(S)-ing...);

Of course you are. How do you think the new instance of the shell is being
run?

Try monitoring the process from another terminal or multiterm and run ps.

Write as much information to the screen and/or a file before and right after
the call -- free memory, processes, file existance, .. . You can lookup the
details in the docs easier than I can.

Try deleting stuff from the program until you only have what causes the
problem and nothing extraneous.

>
>Carl Sopchak

>
>
>Sent via Deja.com http://www.deja.com/
>Share what you know. Learn what you don't.

hth

larry

Bela Lubkin

unread,

Sep 17, 1999, 3:00:00 AM9/17/99

to

Carl Sopchak wrote:

> I need some help trying to figure out why a client's program is crashing with
> a EAGAIN (11) error returned from fork(S).
>
> Here's the environment (basically the same for the two machines described
> below): Clean install of OSR 5.0.5 with RS505A, UDK Compatibility and
> OSS497B installed, at least 128Mb ram, at least 512Mb swap. Same stune file.
> All testing / sleuthing on the client machine was done on a Saturday, with
> no other users.
>
> Here's the scenario: I wrote a program using a 4GL called Speedware, which
> shells to the OS to create a temporary file (with touch), run a section of
> code that uses that file, and upon return, shells out again to remove the
> temporary file.

I'm not familiar with Speedware. But the fact that you've installed the
UDK runtime sets off a possible alarm: are the Speedware binaries you're
running UnixWare binaries? I've done some kernel analysis below. I
didn't look into the UDK shim libraries. It is possible that the UDK
libc's fork() function does more than just call the underlying
OpenServer fork(); it might do other things which could also return
EAGAIN; or it might in fact set errno = EAGAIN itself, under certain
conditions. (I'm not saying this is *likely*, just that I haven't
investigated it, making a hole in my analysis below.)

> On my development system, I can run this procedure an "infinate" number of
> times successfully. On my client's system, things start failing quickly.
> Namely, upon the first time running it, the first fork succeeds, creating the
> temporary file. Upon returning, however, the second fork fails, leaving the
> temporary file behind. If the logic is re-run, the first fork fails, so the
> program crashes (with "file not found") because it does not find the
> temporary file.

If I hadn't seen the rest of the description, I would think to myself:
"aha -- the process that contains the Speedware execution environment
has grown tremendously between forks #1 and #2; so much so that by
attempt #2, it cannot be contained twice in memory". fork() does *not*
actually duplicate much of the process. However, it sets all of the
process's writable pages to copy-on-write. This invokes kernel memory
accounting which worries that those pages might *actually* get copied;
thus, fork() will fail if the system couldn't contain two complete
copies of the process.

So one thing to do is examine the size of the process at both points in
time. Put sleep(1000) calls in to give yourself time to poke around
(you can prod the process back to life with `kill -ALRM <pid>`).
`ps -o vsz -p <pid>` gives a decent approximation of process size.

> I turned on auditing, which basically told me that the fork(s) was failing
> with EAGAIN (11, no more processes). Nothing else seemed strange in the
> audit report.

"no more processes" is an archaic translation of EAGAIN. The current
one is "resource temporarily unavailable".

> Checking the messages(M) man page, it says that EAGAIN is caused by (a) the
> process table being full (but this is dynamically allocated in OSR 5.0.5, and
> I have proven that the table will grow on my client's machine, so it seems
> unlikely that this is the problem); (b) swap is fill (but sar and other swap
> reporting programs tell me that it isn't even being used (512Mb free of
> 512Mb)); (c) EXEC failed due to insufficient number of pages available to

> load executable (but I'm not exec(S)-ing...); or (d) a lock failed on a file

> or record that was already locked (which I suppose is a possibility, but how
> do I find out? And why wouldn't this be the case on my machine?).

You can get EAGAIN from locking system calls, meaning "you can't lock
that now, someone else has it [resource temporarily unavailable], but
feel free to try again later". Doesn't apply to fork().

Suppose Speedware had an environment variable
EAT_THIS_MUCH_MEMORY_AFTER_FIRST_FORK... Setting it to 400MB, only on
the client system, would cause the behavior you're seeing. Ok, not very
likely. But suppose it has a bug that makes it sensitive to some other
environmental difference, and leads to the same behavior.

Use `lsof` to get a look at the shared objects linked into each process.
Confirm that they're the same on both systems.

> I ran Verify System (thorough), and nothing of note showed up.
>
> I just ordered SarCheck, so maybe that will help (but I need to wait for it
> to be shipped to me). I'm going to try using the Skunkware version later
> today...
>
> If anyone has any suggestions, or needs additional information, or just wants
> to console me (:->), please post.
>
> The only thing left that I can think of is to re-install the OS, but that may
> take quite a long time. (I'll probably try an "in place upgrade" first, so I
> don't have to reconfigure everything if it works. Otherwise, I guess I'll
> try a clean install <UGH!>.)

Reinstalling the OS to fix mysterious problems is a Windows solution.

=============================================================================

As far as I can determine from current OpenServer 5.0.5 source, fork()
can return EAGAIN in the following cases:

- in various special cases having to do with virtual 8086 processes
(Merge or VP/ix), which I will not try to enumerate

* in many ways which print kernel warnings on the console, which I
will not try to enumerate; the warnings make them obvious; generally
having to do with lack of memory

* if creating the process would exceed MAXUP (maximum processes per
user ID -- run `sysdef | grep MAXUP`)

* if there is no memory to grow a page table

* if the kernel is unlicensed and this process would exceed the limit
for processes on an unlicensed kernel

- if one of its regions would represent the 61440th simultaneous
in-core mapping of a file

- if one of its regions has the "grow down" flag but isn't the stack

- if the stack region doesn't have the "grow down" flag

- if one of its regions starts or ends above virtual address BFFFFFFF

- if one of its regions has end address < start address

- if two of its regions overlap

I've marked with "*" the cases which seem at all likely.

Make sure you've checked syslog for any kernel warnings (but you already
have). Check MAXUP against how many processes the test user is running
(`ps -u user`) (but you already have). Check available memory with
`sar -r 1 1`: on 5.0.5 this displays availrmem and availsmem, which are
really the critical variables.

The only probable candidate is the unlicensed kernel limit. You can
test this by e.g. running:

sleep 1000 &

over and over; if you can generate 100 of those, your kernel is properly
licensed. (You could use the license manager etc., but the question
isn't "does the license manager think we have a valid license", it is
"does the *kernel* think so?".)

All of the region stuff is extremely improbable. But you could try to
check it anyway, just to be sure. Change the 4GL program so that the
two things it calls are shell scripts which start by doing a long sleep,
giving you some time to fiddle around in a debugger. During the sleep,
capture region information for the process. In `crash`, run "preg #pid"
on both the parent and child (the first time through); and on the parent
only, the 2nd time, since there is no child (fork failed!). If you want
to get fancy, you can also capture information about the regions
themselves. The "preg" command shows you which regions the process has
mapped in, e.g.:

> preg #4084
SLOT PREG REG# REGVA TYPE FLAGS
74 0 600 0x8046000 stack rd wr cm
1 643 0x8048000 text rd ex cm
2 746 0x8062000 data rd wr cm
3 611 0x80001000 lbtxt rd ex pr
4 763 0x80051000 lbdat rd wr ex pr

You can carry the REG# information into the "reg" command:

> reg 600 643 746 611 763
REGION TABLE SIZE = 853
Region-list:

SLOT PGSZ VALID SMEM NONE SOFF REF SWP NSW FORW BACK INOX TYPE FLAGS
600 2 2 2 0 70 1 0 0 746 ract - priv nosh stack
643 26 17 0 0 72 1 0 0 766 746 763 stxt nosh nosmem
746 124 103 95 0 98 1 0 0 643 600 763 priv nosh
611 80 61 0 0 1 1 0 0 730 763 24 map nosh nosmem
763 10 10 9 0 81 1 0 0 611 5 24 map nosh

Then you can crosscheck; e.g. this process's pregion #3 maps region #611
starting at process virtual address 80001000. We see that region #611
is 80 pages long. 80 * 4K is 0x50000, so the region spans virtual
addresses 80001000-80050FFF. Pregion #4 starts at 80051000: no overlap,
and the regions butt up against each other, which seems like a probable
arrangement.

Good luck...

>Bela<

Warren Young

unread,

Sep 19, 1999, 3:00:00 AM9/19/99

to

ce...@compuserve.com wrote:
>
> I need some help trying to figure out why a client's program is crashing with
> a EAGAIN (11) error returned from fork(S).

After reading through your thorough post, I can't suggest anything. I
came to the same conclusion you did: you're running out of either
process table entries or system memory, due to the fact that fork()
clones the entire process's memory space.

All I can think of is to try a simple test program that clones itself
repeatedly (up to a limit of, say, 50 processes) to see if it's
something basic in the system that's failing. If you can clone a basic
process this many times, you know the fork() syscall works, and that you
can create enough processes. The only thing left would be not having
enough memory to clone the process.

By the way, this post would be better in comp.unix.sco.programmer.

Good luck,
--
= Warren Young: www.cyberport.com/~tangent | Yesterday it worked.
= ICBM Address: 36.8274040N, 108.0204086W, | Today it is not working.
= alt. 1714m | Windows is like that.

Warren Young

unread,

Sep 20, 1999, 3:00:00 AM9/20/99

to

Warren Young wrote:
>
> ce...@compuserve.com wrote:
> >
> > I need some help trying to figure out why a client's program is crashing with
> > a EAGAIN (11) error returned from fork(S).
>
> After reading through your thorough post, I can't suggest anything. I
> came to the same conclusion you did: you're running out of either
> process table entries or system memory, due to the fact that fork()
> clones the entire process's memory space.

Actually "running out of process table entries" isn't quite accurate.
Most Unix systems limit the amount of processes a single user can create
not because of an internal table limit, but in order to limit the amount
of damage a "fork bomb" (a.k.a. "rabbit") can do.

So, one difference between your system and your customer's might be that
your customer is creating more processes on their machine than you are,
and not necessarily with the same programs. E.g., they might have a
background web server that is creating several dozen children, etc.

ce...@compuserve.com

unread,

Sep 24, 1999, 3:00:00 AM9/24/99

to

I FIGURED IT OUT!!

Apparently, I WAS running out of swap, although I'm not too sure why at this
point. But since I've spent over 50 hours trying to nail this thing down,
I'm not going to bother wasting any more time on it.

Apparently, the fork() is copying the entire process's memory [or at least
reserving it]. It also appears that the process's memory allocation grew
between the first and second fork() due to the amount of data it was
processing, so the second and subsequent fork()s were running out of swap.
Also, apparently, all of the tools that I was using to look at swap were
either giving point-in-time values (or perhaps averages), and because the
fork() was only doing a 'touch', the son process wasn't lasting very long,
and the lack of swap was not apparent.

I have added swap with the 'swap -a', and trippled the swap space for the
machine. The program doesn't crash (predictably) any more!

I'd like to thank those that posted suggestions, especially Bela, since his
post got me to try 'sar -r 1 1'. This running in the background showed me
that availsmem went from over 150,000 to under 67,000, which means that
80,000 pages(?) were used. If it needed to take another 80,000, it wouldn't
be available. Voila! My problem seemed apparent. Doubling the swap to 1Gb
showed the availsmem go from over 278,000 to 7,909, and back to over 278,000
two seconds later! (But the program ran without crashing!) Again, I don't
know why the process is using 270,000 pages of swap, but as long as there's
enough, I don't care! I added another 512Mb swap, for safety sake. If they
continue to run into the problem, I'll just add more swap!!

Anyway, thanks again.

Carl

Sent via Deja.com http://www.deja.com/

Before you buy.