Aaaarrghhh core file still not created!!! :-(

Carlos Moreno

unread,

Mar 22, 2003, 11:24:42 AM3/22/03

to

Hi again,

I'm still clueless as what other reason might be causing
the core file not to be created when my application crashes
due to a SIGSEGV.

The application is a server (based on sockets communications)
running on a RedHat Linux 7.3, on a dual-Athlon machine.

So, I don't want to run it through the debugger, since the
application is systematically crashing between midnight and
5AM (more or less at random, but always around that time),
so I'm not in front of the monitor to respond to it and
analyze. (and that would keep the service down for several
hours)

That's why I want the system to generate a core file when
the application crashes (I have a script that detects when
it crashes, and automatically restarts it), so that I can
come the morning after and analyze what happened. (BTW,
the application -- all the modules -- is compiled and
linked with the -g switch)

I put the line:

ulimit -c unlimited

In both .bashrc and .bash_profile (didn't want to put
it in /etc/profile since that would make it as well for
all users). If I login and run "ulimit -a", it does
report unlimited for the corresponding line.

I'm starting the application through the /etc/rc.local
script, with the line:

su -l -s /bin/sh -c "/home/user/app >> logout 2>> logerr &" user

If I'm not mistaken, this should make a true login,
including the execution of .bash_profile and .bashrc,
before executing the application.

Any ideas on why it's not working? Is it possible
that the core file is so excessively large that there
is no space left on the disk to create it? (I mean,
maybe the crash is happenning because of an infinite
loop that allocates memory and that doesn't fail
gracefully when allocation fails?). Somehow I don't
think this would make sense -- the machine has 1GB
of memory, and the /home filesystem is mapped to a
partition that has more than 10GB of free space.

If creation of the core file failed, should I find
any error message in any logs? Where would I find
it?

Thanks,

Carlos
--

Kasper Dupont

unread,

Mar 22, 2003, 11:50:18 AM3/22/03

to

Carlos Moreno wrote:
>
> If creation of the core file failed, should I find
> any error message in any logs? Where would I find
> it?

Does any of these reasons aply to your problem?
http://www.daimi.au.dk/~kasperd/comp.os.linux.development.faq.html#core

--
Kasper Dupont -- der bruger for meget tid på usenet.
For sending spam use mailto:aaa...@daimi.au.dk
for(_=52;_;(_%5)||(_/=5),(_%5)&&(_-=2))putchar(_);

Carlos Moreno

unread,

Mar 22, 2003, 2:30:45 PM3/22/03

to

Kasper Dupont wrote:
> Carlos Moreno wrote:
>
>>If creation of the core file failed, should I find
>>any error message in any logs? Where would I find
>>it?
>
> Does any of these reasons aply to your problem?
> http://www.daimi.au.dk/~kasperd/comp.os.linux.development.faq.html#core

Thanks!! That is a very informative/helpful link!!

The only reason that seems to apply is the fact that
my application does use threads (four or five threads).

I don't know if the kernel I'm using can dump core
for a threaded application (rpm -q kernel reports two
lines: 2.4.18-3 and 2.4.18-17.x -- not sure what it
means the fact that it displays two lines). BTW,
the system is a dual-athlon, so it uses the SMP
version of the kernel.

In your FAQ's, you mention that such kernels create
a file called core followed by the PID. That's what
I observed when doing some tests for the core file
creation (i.e., to verify the creation of the core
file, I wrote a test program purposely coded to
crash; it did, of course, and it created core
files with filenames core.<PID number>). Not sure
if that means that it should create the core file
for my multi-threaded application.

About the possibility of a kernel Oops. I'm not
sure where I should check for it. What I did was:

# cd /var/log
# grep -i oops *

And it returned nothing (well, it returned a few
lines with the words loops and moops, but nothing
about a kernel oops). Am I looking at the right
place?

At this rate, it would seem like I'll have to
settle for running directly through the debugger
(I'll switch my biological clock: sleep during
the day, and be awake overnight to debug the
bugger! :-))

Thanks!

Carlos
--

Kasper Dupont

unread,

Mar 22, 2003, 2:51:59 PM3/22/03

to

Carlos Moreno wrote:
>
> Kasper Dupont wrote:
> > Carlos Moreno wrote:
> >
> >>If creation of the core file failed, should I find
> >>any error message in any logs? Where would I find
> >>it?
> >
> > Does any of these reasons aply to your problem?
> > http://www.daimi.au.dk/~kasperd/comp.os.linux.development.faq.html#core
>
> Thanks!! That is a very informative/helpful link!!
>
> The only reason that seems to apply is the fact that
> my application does use threads (four or five threads).
>
> I don't know if the kernel I'm using can dump core
> for a threaded application (rpm -q kernel reports two
> lines: 2.4.18-3 and 2.4.18-17.x -- not sure what it
> means the fact that it displays two lines).

IIRC 2.4.18 can dump multithreaded programs. The two
version numbers sounds like something modified by RedHat.
Though I think the later is slightly inaccurate. Don't
you mean 2.4.18-1.7.x or 2.4.18-17.7.x? You can see the
version number of the one you are using by typing
uname -r

> BTW,
> the system is a dual-athlon, so it uses the SMP
> version of the kernel.

Then the version number of the kernel should end with
smp. Look on the output from uname -r

>
> In your FAQ's, you mention that such kernels create
> a file called core followed by the PID.

In recent kernels that can be enabled also for single
threaded programs. In RH7.3 the default is to append
PID to all core dumps. Change /etc/sysctl.conf if you
want core dumps without pid. (Pid is still used for
multithreaded programs IIRC.)

> Not sure
> if that means that it should create the core file
> for my multi-threaded application.

I think it is able to.

>
> About the possibility of a kernel Oops. I'm not
> sure where I should check for it. What I did was:
>
> # cd /var/log
> # grep -i oops *
>
> And it returned nothing (well, it returned a few
> lines with the words loops and moops, but nothing
> about a kernel oops). Am I looking at the right
> place?

Yes, here is how it looks in my case:

[root:pts/3:/var/log] grep -i oops *
ksyms.0:c02d2cac loops_per_jiffy_Rba497f13
ksyms.1:c02d2cac loops_per_jiffy_Rba497f13
ksyms.2:c02d2bac loops_per_jiffy_Rba497f13
ksyms.3:c02d2bac loops_per_jiffy_Rba497f13
ksyms.4:c02d2bac loops_per_jiffy_Rba497f13
ksyms.5:c02d2bac loops_per_jiffy_Rba497f13
ksyms.6:c02d2bac loops_per_jiffy_Rba497f13
messages:Mar 5 22:05:20 marvin kernel: Oops: 0002
rpmpkgs:ksymoops-2.4.4-1.i386.rpm
[root:pts/3:/var/log]

If you are using RH you will find Oopses logged
in /var/log/messages. As you see I have had one
since I installed the system one month ago. (And
a lot of panics:-( )

>
> At this rate, it would seem like I'll have to
> settle for running directly through the debugger
> (I'll switch my biological clock: sleep during
> the day, and be awake overnight to debug the
> bugger! :-))

Well, it might still be that the core limit is
zero after all. You could try printing/changing
the core limit from within the program.

Carlos Moreno

unread,

Mar 22, 2003, 8:06:29 PM3/22/03

to

Well, my newsreader is not showing me the reply to this
message (for a change! )8-[ )

Anyway, thanks Kasper for the further advice (which I
read through google)

> I don't know if the kernel I'm using can dump core
> for a threaded application (rpm -q kernel reports two
> lines: 2.4.18-3 and 2.4.18-17.x

You are correct -- it is 2.4.18-17.7.x; and uname -r
reports 2.4.18-3smp

I'll probably follow your advice and place a call to
"ulimit -a" at the beginning of main() in my program,
to see if the attempt to set it to unlimited before
running the app is failing).

Thanks!

Carlos
--

Mark

unread,

Mar 24, 2003, 8:15:04 AM3/24/03

to

> So, I don't want to run it through the debugger, since the
> application is systematically crashing between midnight and
> 5AM (more or less at random, but always around that time),
> so I'm not in front of the monitor to respond to it and
> analyze. (and that would keep the service down for several
> hours)
>

I had something similar once with a driver I wrote. I don't know if it might
be the same problem or not, but I'll tell you anyway. It might give you
something to start with.

I was testing my driver to see if it was stable or not, and it seemed to be
pretty stable. But when I would run it overnight, it would fail and lockup
the kernel as well. This happened always at the same time. I don't know the
exact time, but that is not very interesting anyway. So after trying it a
couple of times I got really depressed because I was not able to reproduce
the lockup. So I checked my logs to see what else happend around the time of
the crash, but I didn't find anything usefull. But what I did find when I
was checking my cron jobs, whas that a couple of minutes before the system
crashed updatedb was started. I have no clue what is does, but I do know
that is uses lot's of cpu- and harddisk time. This caused the rest of the
system to slow down, and caused my driver to get out of sync. It had
something to do with the interrupt handler and some other stuff. I could
also reproduce the error by compiling a kernel.

But maybe because you are using a couple of threads, you threads might get
out of sync with a high system load. So if I were you I'dd check my cron
jobs. It does not help you to find the problem, but it might help you to
reproduce the problem when you want it to, instead of having to wait up
during the night hoping that it will crash as soon as possible.

Good luck,
Mark

Kasper Dupont

unread,

Mar 24, 2003, 9:36:33 AM3/24/03

to

Mark wrote:
>
> But what I did find when I
> was checking my cron jobs, whas that a couple of minutes before the system
> crashed updatedb was started. I have no clue what is does, but I do know
> that is uses lot's of cpu- and harddisk time.

IIRC it builds a list of all files on your system.
You can then lookup files using the slocate command.

Carlos Moreno

unread,

Mar 24, 2003, 3:05:39 PM3/24/03

to

Mark wrote:
>
> But maybe because you are using a couple of threads, you threads might get
> out of sync with a high system load. So if I were you I'dd check my cron
> jobs. It does not help you to find the problem, but it might help you to
> reproduce the problem when you want it to, instead of having to wait up
> during the night hoping that it will crash as soon as possible.

Yep, that would definitely be something that would help me,
if I could make something to guarantee that it's going to
crash within a few minutes.

I checked the /var/log/messages file, and found nothing
around the time of the crash (BTW, it's not always at a
particular time -- it's more like between midnight and
6AM, maybe 80% of the times between 3 and 5AM).

We have a cron table entry to create a backup of the
critical tables of our application (managed by postgres),
but that one runs every day at exactly 5:10AM, and none
of the crashes have occured at exactly 5:10 or 5:11, so
I have no reason to suspect that those could be causing
it.

The funny thing is that approx. 4AM to 6AM is precisely
the peak low (all-day lowest) activity of our system; it
would seem like after having been running for a while,
now the lack of activityis what triggers the condition
(I mean, it's not the lack of activity alone, because
otherwise it would tend to re-crash often, shortly
after the moment it is restarted following a crash)

Well, I guess back to reviewing the code.

Thanks!

Carlos
--

Saul

unread,

Apr 9, 2003, 3:13:47 AM4/9/03

to

Carlos Moreno <moreno_at_mo...@xx.xxx> wrote in message
[snip]

>
> Thanks!! That is a very informative/helpful link!!
>
> The only reason that seems to apply is the fact that
> my application does use threads (four or five threads).
>

Try setting a signal handler for SIGSEGV, and in it do a fork(). The
forked child (which usually gets created) will return from the handler
and dump.

Saul

Kasper Dupont

unread,

Apr 9, 2003, 5:39:20 AM4/9/03

to

Saul wrote:
>
> Try setting a signal handler for SIGSEGV, and in it do a fork(). The
> forked child (which usually gets created) will return from the handler
> and dump.

That is a nice trick. It can be used for more cases. I guess it will
help getting a core dump on systems where it is not supported for
threads. And this dump will be as close to the real situation where
the SIGSEGV occured as it can possibly be with other threads still
running and changing the state.

But it can also be used if you want a core dump without killing the
process. In fact SysVinit does that. Not that init will often get
SIGSEGV, but if it does, the handler will fork a child which dumps
core immediately, and IIRC the parent sleeps for 30 seconds before
it returns and retries the faulting instruction.

It is also possible to perform a chdir in the handler to change the
location where the core is dumped. Could be nice if you don't know
where it was, or if you do not have write permissions in the current
directory.

You probably need to reset the handler to default before returning,
otherwise you could end up with the handler being invoked again.