'crash' utility failure

Fernando Ronci

unread,

Jan 12, 2002, 10:16:04 PM1/12/02

to sco...@xenitec.on.ca

Hi,

What can be wrong with a dump image after a kernel panic (0x0000000E) of OSR
5.0.0 that it cannot be read by the 'crash' utility?
I saved the image in /usr/tmp/dump.1201 and when I issue:
crash -d /usr/tmp/dump.1201

I get:
dumpfile = /usr/tmp/dump.1201, namelist = /unix, outfile = stdout
Read error on page table entry at 0xfc60ff0

What does the "Read error on page table entry at 0xfc60ff0" error mean?

I need to know why my system began panicing at random intervals (about 3
times a day) after a major hardware upgrade.
I moved the old SCSI HD to a new motherboard with a PIII 900 MHz, new
memory, new NIC, legacy ISA computone and legacy Adaptec card. After reading
Tony's document at http://www.aplawrence.com/Unixart/trape.shtml
I'm almost sure the culprit is either the memory or the new NIC driver (a
very ordinary Encore PCI card), but 'crash' needs to be able to read the
dump image to confirm that.

BTW, I picked the dump image from /dev/swap with the command:
dd if=/dev/swap of=/usr/tmp/dump.1201 bs=1024 count=`expr \`memsize
/dev/swap\` / 1024 + 1` as exemplified in TA # 105840. Swap and dump devices
are the same.
Hints and comments to help me get rid of the trap E are all welcome.

Thank you,

Fernando Ronci
E-mail: fern...@waycom.com.ar

Bill Vermillion

unread,

Jan 13, 2002, 10:02:20 AM1/13/02

to

In article <00a001c19be0$a6975640$7001a8c0@estacion3>,
Fernando Ronci <fern...@waycom.com.ar> wrote:

...

>I need to know why my system began panicing at random intervals
>(about 3 times a day) after a major hardware upgrade.

>I moved the old SCSI HD to a new motherboard with a PIII
>900 MHz, new memory, new NIC, legacy ISA computone and
>legacy Adaptec card. After reading Tony's document at
>http://www.aplawrence.com/Unixart/trape.shtml I'm almost sure the
>culprit is either the memory or the new NIC driver (a very ordinary
>Encore PCI card), but 'crash' needs to be able to read the dump
>image to confirm that.

It could even be a problem on the motherboard. Why not just pull
the NIC card to see if the system is stable? I'm guessing you have
a problem that occured when the dump happened to corrupt that file.
That could also be memory. You also said you have a SCSI system
and it could be a cabling/termination on the new system causing
problems. Sometimes just moving cables around in a SCSI
environment can causes crashes if cables aren't good and
termination is not good.

>Hints and comments to help me get rid of the trap E are all welcome.

Most of the time I saw a trap E it was memory. Is the new system
using ECC memory. I'd trust nothing less in a critical server.

Just guesses.

Bill

--
Bill Vermillion - bv @ wjv . com

Bela Lubkin

unread,

Jan 14, 2002, 5:04:56 AM1/14/02

to sco...@xenitec.on.ca

Fernando Ronci wrote:

> What can be wrong with a dump image after a kernel panic (0x0000000E) of OSR
> 5.0.0 that it cannot be read by the 'crash' utility?
> I saved the image in /usr/tmp/dump.1201 and when I issue:
> crash -d /usr/tmp/dump.1201
>
> I get:
> dumpfile = /usr/tmp/dump.1201, namelist = /unix, outfile = stdout
> Read error on page table entry at 0xfc60ff0
>
> What does the "Read error on page table entry at 0xfc60ff0" error mean?

It means that the dump image is corrupt, for crash's purposes.

During a panic, the system is in an unstable state (by definition). Not
all panics produce clean, readable dumps.

It's also possible that it was a clean dump, but you corrupted it when
you saved it.

> BTW, I picked the dump image from /dev/swap with the command:
> dd if=/dev/swap of=/usr/tmp/dump.1201 bs=1024 count=`expr \`memsize
> /dev/swap\` / 1024 + 1` as exemplified in TA # 105840. Swap and dump devices
> are the same.

Next time the system panics, go to single-user mode and immediately run:

crash -d /dev/swap

If this works, you have a good dump image in /dev/swap; your copying
methodology is broken.

You should get the tools /etc/crash, /etc/sysdump, /etc/scodb from an
OSR506 system; and the sysdump(ADM), scodb(ADM) man pages. The tools
are backwards-compatible. /etc/sysdump can save a dump image, does a
better job than `dd`, and it merges together the kernel and dump so that
you don't have to worry about getting them out of sync. The 5.0.6
version of crash has a number of improvements. /etc/scodb is a
user-level version of the kernel debugger; it can give you a different
perspective on a crash dump.

> I need to know why my system began panicing at random intervals (about 3
> times a day) after a major hardware upgrade.
> I moved the old SCSI HD to a new motherboard with a PIII 900 MHz, new
> memory, new NIC, legacy ISA computone and legacy Adaptec card. After reading
> Tony's document at http://www.aplawrence.com/Unixart/trape.shtml
> I'm almost sure the culprit is either the memory or the new NIC driver (a
> very ordinary Encore PCI card), but 'crash' needs to be able to read the
> dump image to confirm that.

It's probably the memory (or possibly the CPU, if something is grossly
wrong, like the heat sink fan is broken). There isn't really anything
to confirm if it's memory.

Even without being able to examine the crash dump, you should write down
the EIP addresses reported by each panic. If they're all different,
it's some sort of flaky hardware problem like bad RAM or overheating
CPU. If they're all the same, you can use various tools on the kernel
(crash; adb; nm) to translate the address to a symbolic label. If the
label is inside the NIC driver or some other obvious place like that,
you'll know the cuplrit.

>Bela<

Fernando Ronci

unread,

Jan 18, 2002, 7:51:47 AM1/18/02

to

Bela Lubkin <be...@caldera.com> wrote in message news:<2002011402...@mammoth.ca.caldera.com>...

Hi all again !
It's been a week since I did the hardware upgrade. Panics persist and
I still haven't fully figured out what is causing them. Firstly I
changed the memory, but panics continued to randomly appear two or
three times a day, in the same fashion as they did before the memory
change.
However after a lot of observation, I've now got the strong feeling
that they're keyboard related panics. Actually two of them occured
while I was working at the console keyboard.
Then after some research I found (and applied) OSS424A, the keyboard
BTLD for OSR 5.0.0 running on fast CPUs. Quoting from TA 105014:
"SLS OSS424A contains the "kbp" BTLD which modifies a constant in the
console driver that controls the frequency at which the keyboard is
polled."

Furthermore, I also installed SLS OSS627A, the Intel Pentium Pro and
Pentium II Microcode Driver for SCO OpenServer 5.0.0 and 5.0.2.
Quoting from TA 111421:
"Support Level Supplement (SLS) OSS627A, the Intel Pentium Pro and
Pentium II Microcode Driver, provides the proper microcode fixes to
apply to the system's Pentium Pro, Pentium II, Pentium II Xeon, or
Pentium II Celeron processor. It also adds support for erasure of the
Pentium III Processor Serial Number (PSN) and may resolve other
Pentium III issues."

In this past week, I changed no less than half a dozen PS/2 keyboards,
some very new, others very old. None of them -together with OSS424A-
was able to stop the panics. Besides I couldn't read any of the crash
images with 'crash -d /dev/swap', neither in single nor in multi user
mode. I always got an error like "Read error on page table entry at
0xfc60ff0". The address was not always the same. I couldn't take a
close look at EIP addresses either as I had to set PANICBOOT=YES in
/etc/default/boot. This server is a critical one, performs a lot of
tasks (radius auth., SMTP, POP3, PPP, DNS) and I can't leave it down
for more than a few minutes, considering that most panics occur while
I'm out of office. If I cannot fix this between today and tomorrow, I
will unplug the keyboard and try to boot the server without it.

As all of you know OpenServer way far too more than me, I will be
infinitely thankful if some helps me get rid of this problem. OSR
5.0.0 is rather old, has memory leak issues and so forth, but I had it
up and running for almost 6 years and no single panic at all !!. So
I'm convinced I've got a piece of hardware (probably the keyboard)
that is interrupting in such a way that OSR 5.0.0 doesn't know how to
handle and thus panics.

Tom Melvin

unread,

Jan 18, 2002, 11:07:26 AM1/18/02

to sco...@xenitec.on.ca

Fernando Ronci commented on:

> Bela Lubkin <be...@caldera.com> wrote in message news:<2002011402...@mammoth.ca.caldera.com>...
> > Fernando Ronci wrote:
> >
> > > What can be wrong with a dump image after a kernel panic (0x0000000E) of OSR
> > > 5.0.0 that it cannot be read by the 'crash' utility?
> > > I saved the image in /usr/tmp/dump.1201 and when I issue:
> > > crash -d /usr/tmp/dump.1201

[snip]

> Hi all again !
> It's been a week since I did the hardware upgrade. Panics persist and
> I still haven't fully figured out what is causing them. Firstly I
> changed the memory, but panics continued to randomly appear two or
> three times a day, in the same fashion as they did before the memory
> change.

[snip]

> /etc/default/boot. This server is a critical one, performs a lot of
> tasks (radius auth., SMTP, POP3, PPP, DNS) and I can't leave it down
> for more than a few minutes, considering that most panics occur while
> I'm out of office. If I cannot fix this between today and tomorrow, I
> will unplug the keyboard and try to boot the server without it.

It may not work in all cases BUT have a look at

http://www.tkrh.demon.co.uk/panic.html

There is a script there that will grab a dump on the fly, run crash
on it and mail you the results.

I don't think it will cause any damage, if you have any early /etc/rc2.d
scripts that over-write swap then it won't work.

Tom

--
========================================================================
Tom Melvin t...@tkrh.demon.co.uk http://www.tkrh.demon.co.uk
Veterinary Solutions Ltd Sysop Compuserve Unixforum
========================================================================

Bill Vermillion

unread,

Jan 18, 2002, 2:39:26 PM1/18/02

to

In article <1c9ceead.0201...@posting.google.com>,

Fernando Ronci <fern...@waycom.com.ar> wrote:
>Bela Lubkin <be...@caldera.com> wrote in message news:<2002011402...@mammoth.ca.caldera.com>...
>> Fernando Ronci wrote:

>> > What can be wrong with a dump image after a kernel panic (0x0000000E) of OSR
>> > 5.0.0 that it cannot be read by the 'crash' utility?
>> > I saved the image in /usr/tmp/dump.1201 and when I issue:
>> > crash -d /usr/tmp/dump.1201
>> >
>> > I get:
>> > dumpfile = /usr/tmp/dump.1201, namelist = /unix, outfile = stdout
>> > Read error on page table entry at 0xfc60ff0

>> > What does the "Read error on page table entry at 0xfc60ff0" error mean?

>> It means that the dump image is corrupt, for crash's purposes.

>> During a panic, the system is in an unstable state (by
>> definition). Not all panics produce clean, readable dumps.

>> It's also possible that it was a clean dump, but you corrupted it
>> when you saved it.

>> > I need to know why my system began panicing at random

>> > intervals (about 3 times a day) after a major hardware
>> > upgrade. I moved the old SCSI HD to a new motherboard with
>> > a PIII 900 MHz, new memory, new NIC, legacy ISA computone
>> > and legacy Adaptec card. After reading Tony's document at
>> > http://www.aplawrence.com/Unixart/trape.shtml I'm almost sure
>> > the culprit is either the memory or the new NIC driver (a very
>> > ordinary Encore PCI card), but 'crash' needs to be able to read
>> > the dump image to confirm that.

The bad dump, the crashes, the errors coming in with different
locations still sound to me like SCSI problems. You said OLD SCSI
drive above. But did you move the old controller, and cables, and
everything that worked before, or just the old SCSI drive to a new
computer. It really smells/feels like cabling termination. The
higher speed of the new CPU could be causing enough RF type
problems to cause SCSI failure. Make sure you have a >>ACTIVE<<
terminator on the line. So often spurious SCSI problem give
messages that make you think the error is elsehwere.

>In this past week, I changed no less than half a dozen PS/2 keyboards,
>some very new, others very old.

I don't know why you keep thinking it is a keyboard error. The
results you dislayed don't indicate that to me.

>As all of you know OpenServer way far too more than me, I will be
>infinitely thankful if some helps me get rid of this problem. OSR
>5.0.0 is rather old, has memory leak issues and so forth, but I had it
>up and running for almost 6 years and no single panic at all !!. So
>I'm convinced I've got a piece of hardware (probably the keyboard)
>that is interrupting in such a way that OSR 5.0.0 doesn't know how to
>handle and thus panics.

But you had OSR5 up an running on an old motherboard. I really
suspect that if you put the OSR 5.0.6 on that old system it would
run rock solid. You are attributing what appears to me to be HW
errors to SW. Almost invariably that is not the case.

Fernando Ronci

unread,

Jan 18, 2002, 8:44:39 PM1/18/02

to

b...@wjv.com (Bill Vermillion) wrote in message news:<Gq5Fx...@wjv.com>...

> In article <1c9ceead.0201...@posting.google.com>,
> Fernando Ronci <fern...@waycom.com.ar> wrote:
>

[snip]

> >In this past week, I changed no less than half a dozen PS/2 keyboards,
> >some very new, others very old.
>
> I don't know why you keep thinking it is a keyboard error. The
> results you dislayed don't indicate that to me.
>

I gave up the keyboard theory now.
I'm back investigating memory. I'm going to try with another brand. (ECC of course)

Tnak you,

Fernando Ronci
E-mail: fern...@waycom.com.ar