Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

SCO OSR5.0.6 System Crashed during LoneTAR Verify

1 view
Skip to first unread message

Lucky Leavell

unread,
Jun 13, 2003, 9:40:23 AM6/13/03
to comp.unix.sco.misc
OS: OSR5.0.6a Enterprise
Lone-TAR: v3.2.4.1

The last two nights, a client's system has died while in the LT verify pass.
I do not know if this is related or a coincidence.

I find nothing at all in the /usr/adm/messages or syslog to indicate any
problem. I have both AUTOBOOT and PANICBOOT in /etc/default/boot set to YES
but when they come in the system is down and must be manually rebooted.

I upgraded the system from OSR5.0.5 Host to 5.0.6 Enterprise on May 26 and
this did happen once or twice the first week and then not again until
yesterday.

Any ideas as to why this is happening? Are there any other places I can
check for clues?

Thank you,
Lucky

Lucky Leavell Phone: (800) 481-2393 (US/Canada)
UniXpress - Your Source for SCO OR: (812) 366-4066
1560 Zoar Church Road NE FAX: (812) 366-3618
Corydon, IN 47112-7374 Email: lu...@UniXpress.com
WWW Home Page: http://www.UniXpress.com

to...@aplawrence.com

unread,
Jun 13, 2003, 10:25:37 AM6/13/03
to
Lucky Leavell <lu...@unixpress.com> wrote:
: OS: OSR5.0.6a Enterprise
: Lone-TAR: v3.2.4.1

: The last two nights, a client's system has died while in the LT verify pass.
: I do not know if this is related or a coincidence.

: I find nothing at all in the /usr/adm/messages or syslog to indicate any
: problem. I have both AUTOBOOT and PANICBOOT in /etc/default/boot set to YES
: but when they come in the system is down and must be manually rebooted.

So what does it show on the screen?

: I upgraded the system from OSR5.0.5 Host to 5.0.6 Enterprise on May 26 and


: this did happen once or twice the first week and then not again until
: yesterday.

: Any ideas as to why this is happening? Are there any other places I can
: check for clues?

The clue is in the panic message if that's what is happening. If not,
then there's stuff on my web site in the faq (which unfortunately has been dead
for a few hours now - I have Interland working on it but it's
still down) about tracing shutdown and haltsys.

So, since it might be down for a while (they've had to escalate the
ticket - apparently something very broken), here it is:

(from the http://aplawrence.com/SCOFAQ)

How do I find out who or what halted my system?

First, look in crontab for a call to haltsys or init. Someone may have
added this for silly reasons.

If you think some privileged user or process has run /etc/haltsys, add
these lines to it right after the PATH= line
{
echo $0 `tty` `id`
MYPROC=$$
NEXTPROC=$MYPROC
while [ $NEXTPROC != 0 ]
do
ps -lp $NEXTPROC
MYPROC=$NEXTPROC
NEXTPROC=`ps -p $MYPROC -o "ppid=" `
done
} | mail -s "haltsys was run" root


This will give you a full trace of where it was called from. You can
use a similar technique with /etc/shutdown.

You might also write a "K" script and put it in /etc/rc0.d.
Unfortunately, by that time there isn't as much information to glean
from the system. Adding to /etc/rc0 doesn't gain you much either, but
at least you know it was not a crash and you *might* still see a
suspect process in a ps listing.

If your only concern is when the system went down,
who -a /etc/wtmp | grep uadmin

will give you that. Note that on "out of the box" systems, the
information in /etc/wtmp is cleared out weekly by a cron job that runs
/etc/cleanup; you may want to adjust this script if you need longer
records.

Jeff Hyman tells me that the old 3.2v4.2 "last" included shutdown
information, so
last | grep shutdown

would work on those releases. It doesn't on OSR5.

Bela Lubkin commented:
Change this to:

} | mail -s "$0 $@ was run" root
sync
sleep 5
sync

The sync and pause routine is necessary because mail delivery can take a
while (especially if you've installed spamassassin ;-), you don't want
to fire off mail when you know the very next thing you're doing is
shutting down.

> This will give you a full trace of where it was called from. You can use
> a similar technique with /etc/shutdown.

This is true enough, but misses /etc/reboot as well as /etc/uadmin.

_All_ of the SCO-provided shutdown techniques (init [056], shutdown,
haltsys, reboot) eventually funnel through /etc/uadmin. So the best way
to do this is to move /etc/uadmin to /etc/uadmin.real and use the above
script bit as /etc/uadmin, ending it with:

exec /etc/uadmin.real "$@"

The `ps` chain is cute, but unnecessary -- better to give `ps -elf`
output and let the reader figure out the chaining. The actual cause of
shutdown might not be in the parenthood of the process doing the
shutdown. (e.g. if someone ran `sd shutdown`.) So the entire script
can be reduced to:

#!/bin/sh
{
echo Process $$, on tty `tty`, user `id`, ran:
echo " $0 $@"
echo
ps -elf
} | mail -s "uadmin was run" root
sync; sleep 5; sync
exec /etc/uadmin.real "$@" # "real" /etc/uadmin was renamed /etc/uadmin.real

--
to...@aplawrence.com Unix/Linux resources: http://aplawrence.com
Inexpensive phone/email support
Download Free SCO Skills Test: http://pcunix.com/skilltests.html

Lucky Leavell

unread,
Jun 13, 2003, 12:24:43 PM6/13/03
to to...@aplawrence.com, comp.unix.sco.misc
On Fri, 13 Jun 2003 to...@aplawrence.com wrote:

> Lucky Leavell <lu...@unixpress.com> wrote:
> : OS: OSR5.0.6a Enterprise
> : Lone-TAR: v3.2.4.1
>
> : The last two nights, a client's system has died while in the LT verify pass.
> : I do not know if this is related or a coincidence.
>
> : I find nothing at all in the /usr/adm/messages or syslog to indicate any
> : problem. I have both AUTOBOOT and PANICBOOT in /etc/default/boot set to YES
> : but when they come in the system is down and must be manually rebooted.
>
> So what does it show on the screen?
>

It was asking if they wanted to save the dump; they didn't. I asked them
to call me next time and I will talk them through saving the dump to tape.

At least that eliminates some process initiating a reboot ...

My gut feeling at this point is hardware but we'll see ...

to...@aplawrence.com

unread,
Jun 13, 2003, 12:38:40 PM6/13/03
to
Lucky Leavell <sco...@unixpress.com> wrote:

: On Fri, 13 Jun 2003 to...@aplawrence.com wrote:

:> Lucky Leavell <lu...@unixpress.com> wrote:
:> : OS: OSR5.0.6a Enterprise
:> : Lone-TAR: v3.2.4.1
:>
:> : The last two nights, a client's system has died while in the LT verify pass.
:> : I do not know if this is related or a coincidence.
:>
:> : I find nothing at all in the /usr/adm/messages or syslog to indicate any
:> : problem. I have both AUTOBOOT and PANICBOOT in /etc/default/boot set to YES
:> : but when they come in the system is down and must be manually rebooted.
:>
:> So what does it show on the screen?
:>
: It was asking if they wanted to save the dump; they didn't. I asked them
: to call me next time and I will talk them through saving the dump to tape.

You don't want to save the dump. You want to know what the line that
begins with k_trap says. That tells you what the panic is from. Most
usually, it's a trap E, which is an invalid memory reference, but
this can also be from lack of patches.

At http://aplawrence.com/Unixart/trape.html is an article that
explains this, but since my site is still down:

Trap 0x0000000E

Don't panic. That's the kernel's job - Jeff Liebermann

More Articles

Of all the possible reasons for a system panic, a "trap 0x0000000E" is
the one most often seen (see SCO's "What are Traps, Interrupts and
Exceptions?" for other reasons). Technically, an E trap is a page
fault that referenced an impossible page: the CPU tries to access an
address that does not exist and can't be accessed. As page references
are normally very carefully managed, the usual cause for this is bad
(defective) RAM; scrambled bits point the CPU toward disaster and it
blindly follows. Therefor, if you have a Trap E panic on a machine
that otherwise has been running along for months or years, bad RAM is
the most likely suspect.

You can't expect that the so called "memory test" that runs when your
computer starts is going to catch bad ram. That testing is very
superficial, and really can only find ram that's totally screwed up-
subtle problems just will not be seen by that test. There are tests
available that can really stress memory, but the best ones need to run
a very long time, so if the suspect machine is critical, you probably
don't have the time to do this.

RAM is pretty cheap nowadays, so you may just want to replace all of
it, or you can pull individual sticks and swap things around until you
determine where the problem is. In taking this approach, try booting
with as little RAM as possible; the bad chips can be found more easily
that way (see Memory).

Of course, there are other possibilities. A bus card that uses shared
memory can mess up the CPU by misaddressing itself into an area that
the CPU doesn't expect it to be in. If it writes its own patterns into
some of that shared memory, your CPU can once again be presented with
an insane memory reference and it will react accordingly. So, when you
open the machine to see what you can do about the memory, try pulling
all non-essential cards that use any shared memory (multiport cards,
etc. use shared memory. Any card that does will usually show that in
"hwconfig"). If you aren't sure, just pull anything that you don't
need to boot- if the problem goes away, one of those cards is the
problem; put them back one at a time until you know which one.

It's also possible that you just need patches- some of these crashes
are caused by problems in the OS- be sure to search the TA database
for symptoms and messages that match yours, and be sure that you do
have the recommended minimum patches for your OS version.

You can do a little panic analysis yourself:
http://stage.caldera.com/cgi-bin/ssl_reference?106181 explains how to
determine if panics are consistent, that is, are they happening in the
same kernel routine. If they are, your problem still could be
hardware, but now you will have more info to narrow it down. If the
panics are inconsistent, it's surely hardware.

It's also possible that a bad CPU will cause the same symptoms because
it gets the info it read from RAM scrambled internally. That's much
more rare: the CPU's self-test and usually would halt long before
you'd get to boot. Of course, a bad motherboard can cause good RAM to
deliver bad bits to a good CPU also, but again the POST (Power On Self
Tests) will usually catch this sort of thing unless it is very subtle.

A bad driver can also do this by trampling flowerbeds (stepping on RAM
that the CPU needs for its own sanity). If the system has been running
up to this point you can usually discount that, but if you've just
installed a new driver, this could be the cause. Try booting
"unix.old"; if that works, the new driver could very well be at fault.

Finally, disk corruption can cause otherwise good code to be read
incorrectly from disk, which ends up being the same as a bad driver:
misread bits send the kernel off on a rampage where it ultimately
steps on its own tail and panics. Using "unix.old" or other kernels
you may have can sometimes get around this at least long enough to
save the data. Emergency Boot Floppies can also help here, and if you
don't have those, you can break into the original install disks boot
and get at the drive with that. The method for doing that varies with
the release of SCO. With modern SCO, just type "tools" at the install
floppy boot prompt. See SCO FAQ's for a listing of different methods
for older versions.

If it is disk corruption, and can't be gotten around with alternate
kernels or boot floppies, the disk recovery guys can suck your data
down to cdrom or other media:
* Data Recovery Labs
* Ontrack
* Excalibur

It's unlikely that you'd need to go to this extent for a trap E
problem, though:if the disk isn't obviously trashed in other ways, a
local problem that happens to be in the kernel tracks should be able
to be gotten around with one of the other methods suggested above.

co...@rudaz.co.uk added this:
_________________________________________________________________

The problem with the above article though is that it doesn't take into
account ECC protected memory as you commonly find on Intel servers;
Single bit failures are corrected by hardware and are invisible to the
operating system, when multiple bit failures occur then a hardware NMI
should be raised and subsequently caught by the OpenServer nmi kernel
driver, resulting in a PANIC message that contains "FATAL:Parity error
address unknown".

To date I have managed to get Caldera to include a second note on this
on this subject,

http://stage.caldera.com/cgi-bin/ssl_reference?111621

rudaz
_________________________________________________________________

Publish your articles, comments, book reviews or opinions here!

Lucky Leavell

unread,
Jun 13, 2003, 2:02:34 PM6/13/03
to to...@aplawrence.com, comp.unix.sco.misc
On Fri, 13 Jun 2003 to...@aplawrence.com wrote:

> Lucky Leavell <sco...@unixpress.com> wrote:
> :>
> :> So what does it show on the screen?
> :>
> : It was asking if they wanted to save the dump; they didn't. I asked them
> : to call me next time and I will talk them through saving the dump to tape.
>
> You don't want to save the dump.

Oh, then I should set PANICBOOT="NO" in /etc/default/boot? When they see
the screen, it has already rebooted and is sitting there asking if they
want to save the dump to tape. Doesn't the dump have the panic info in it?

to...@aplawrence.com

unread,
Jun 13, 2003, 3:30:23 PM6/13/03
to
Lucky Leavell <sco...@unixpress.com> wrote:

: On Fri, 13 Jun 2003 to...@aplawrence.com wrote:

:> Lucky Leavell <sco...@unixpress.com> wrote:
:> :>
:> :> So what does it show on the screen?
:> :>
:> : It was asking if they wanted to save the dump; they didn't. I asked them
:> : to call me next time and I will talk them through saving the dump to tape.
:>
:> You don't want to save the dump.
: Oh, then I should set PANICBOOT="NO" in /etc/default/boot? When they see
: the screen, it has already rebooted and is sitting there asking if they
: want to save the dump to tape. Doesn't the dump have the panic info in it?


Do you really want to go through the trouble of saving, extracting, and
analyzing a dump? Do you even have the tools and knowledge to
do so?

For must of us, a dump is nearly useless. The trap and cpu information is
enough to tell us the type of problem and whether it is repeating
(which could indicate a specific driver), but dumps can't give
us much more. I can read assembly language fairly well,
and I know a little bit about hardware and vm and so on, but even if I
had source code to help me, I wouldn't bother with it, at least
not initially. You need intimate knowledge and experience to get
very far. The trap/cpu registers give enough for my teeny brain
to work with, thank you.

Unless of course you are planning to throw a bunch of cash SCO's
way and have them look at it?

Tom Melvin

unread,
Jun 13, 2003, 5:50:29 PM6/13/03
to
Lucky Leavell made comment on Fri Jun 13 19:02:34 2003 :

> On Fri, 13 Jun 2003 to...@aplawrence.com wrote:
>
> > Lucky Leavell <sco...@unixpress.com> wrote:
> > :>
> > :> So what does it show on the screen?
> > :>
> > : It was asking if they wanted to save the dump; they didn't. I asked them
> > : to call me next time and I will talk them through saving the dump to tape.
> >
> > You don't want to save the dump.
> Oh, then I should set PANICBOOT="NO" in /etc/default/boot? When they see
> the screen, it has already rebooted and is sitting there asking if they
> want to save the dump to tape. Doesn't the dump have the panic info in it?

You can get the m/c to mail you some of the 'important' stuff.

See http://www.tkrh.demon.co.uk/panic.html

Tom

(Apologies for any broken links on that page - it's on the to do list).


--
========================================================================
Tom Melvin t...@tkrh.demon.co.uk http://www.tkrh.demon.co.uk
Veterinary Solutions Ltd Sysop Compuserve Unixforum
========================================================================

Bela Lubkin

unread,
Jun 14, 2003, 4:26:53 PM6/14/03
to sco...@xenitec.ca
Tony Lawrence wrote:

> Do you really want to go through the trouble of saving, extracting, and
> analyzing a dump? Do you even have the tools and knowledge to
> do so?
>
> For must of us, a dump is nearly useless. The trap and cpu information is
> enough to tell us the type of problem and whether it is repeating
> (which could indicate a specific driver), but dumps can't give
> us much more. I can read assembly language fairly well,
> and I know a little bit about hardware and vm and so on, but even if I
> had source code to help me, I wouldn't bother with it, at least
> not initially. You need intimate knowledge and experience to get
> very far. The trap/cpu registers give enough for my teeny brain
> to work with, thank you.
>
> Unless of course you are planning to throw a bunch of cash SCO's
> way and have them look at it?

I have to firmly disagree with you on this. You don't have to analyze a
dump yourself, you are typically going to go ask someone for help on it
-- whether that's the newsgroups, SCO, or a local consultant. Whoever
it is, one of the first things they're going to ask for is a symbolic
traceback of the panic. That can be gotten out of a panic dump with
little skill; can't be gotten out of the panic _message_ at all.

The message printed at panic time is full of hexadecimal addresses that
are _meaningless_ without examining the kernel that produced them.

Anyone wanting help with a panic should _at least_ allow a dump to be
saved in /dev/swap, then run:

# crash -d /dev/swap
> panic -w /tmp/panic

and post the resulting contents of /tmp/panic.

>Bela<

to...@aplawrence.com

unread,
Jun 14, 2003, 6:05:27 PM6/14/03
to
Bela Lubkin <be...@sco.com> wrote:
>Tony Lawrence wrote:

>> Do you really want to go through the trouble of saving, extracting, and
>> analyzing a dump? Do you even have the tools and knowledge to
>> do so?
>>
>> For must of us, a dump is nearly useless. The trap and cpu information is
>> enough to tell us the type of problem and whether it is repeating
>> (which could indicate a specific driver), but dumps can't give
>> us much more. I can read assembly language fairly well,
>> and I know a little bit about hardware and vm and so on, but even if I
>> had source code to help me, I wouldn't bother with it, at least
>> not initially. You need intimate knowledge and experience to get
>> very far. The trap/cpu registers give enough for my teeny brain
>> to work with, thank you.
>>
>> Unless of course you are planning to throw a bunch of cash SCO's
>> way and have them look at it?

>I have to firmly disagree with you on this. You don't have to analyze a

Well, we'll have to disagree. More than 99% of the time panics are
either bad ram or other hardware or missing patches. I see no point
in paying or taking the time to analyze a dump for that.

>dump yourself, you are typically going to go ask someone for help on it
>-- whether that's the newsgroups, SCO, or a local consultant. Whoever

Quick, name three consultants with the resources to analyze a dump :-)

>it is, one of the first things they're going to ask for is a symbolic
>traceback of the panic. That can be gotten out of a panic dump with
>little skill; can't be gotten out of the panic _message_ at all.

>The message printed at panic time is full of hexadecimal addresses that
>are _meaningless_ without examining the kernel that produced them.

>Anyone wanting help with a panic should _at least_ allow a dump to be
>saved in /dev/swap, then run:

> # crash -d /dev/swap
> > panic -w /tmp/panic

>and post the resulting contents of /tmp/panic.

Well, tell you what: I'll be sure to suggest that from now on, but
my bet is that unless you happen to be reading that day, it will
get them nothing.

Lucky Leavell

unread,
Jun 14, 2003, 6:22:27 PM6/14/03
to Bela Lubkin, comp.unix.sco.misc
On Sat, 14 Jun 2003, Bela Lubkin wrote:

> Tony Lawrence wrote:
>
> > Do you really want to go through the trouble of saving, extracting, and
> > analyzing a dump? Do you even have the tools and knowledge to
> > do so?
> >

> > Unless of course you are planning to throw a bunch of cash SCO's
> > way and have them look at it?
>

Hardly, I already do enough of that <G>...

> I have to firmly disagree with you on this. You don't have to analyze a
> dump yourself, you are typically going to go ask someone for help on it
> -- whether that's the newsgroups, SCO, or a local consultant.

First, let me say I have the utmost repsect for Tony and his opinions and
certainly his treasure trove of a web site. Perhaps it is because I was a
systems programmer (Xerox CP-V) in a former life but the crash tool and
the dump from a crash do provide important clues beyond the panic display
like the currently scheduled processes.

One excellent reason to go this route is the thought of having a novice
try to write down all those registers and hex contents without making a
mistake, not to mention the time required while their users are breathing
down their neck. (I am nearly an hour away so I can't just pop over
myself.)

In this case I ARE the consultant (bad grammar intentional for emphasis)!

I like Tome Melvin's script which runs the crash analysis:
panic
stack
trace
user
proc, and, of course,
quit

against the dump and emails the result though I might be tempted to save
the dump to a file before cleaning it out of swap all without requiring
any user intervention or delaying the system's coming back up.

Bela Lubkin

unread,
Jun 14, 2003, 7:28:23 PM6/14/03
to sco...@xenitec.ca
Tony Lawrence wrote:

> Bela Lubkin <be...@sco.com> wrote:
> >Tony Lawrence wrote:
>
> >> Do you really want to go through the trouble of saving, extracting, and
> >> analyzing a dump? Do you even have the tools and knowledge to
> >> do so?
> >>
> >> For must of us, a dump is nearly useless. The trap and cpu information is
> >> enough to tell us the type of problem and whether it is repeating
> >> (which could indicate a specific driver), but dumps can't give
> >> us much more. I can read assembly language fairly well,
> >> and I know a little bit about hardware and vm and so on, but even if I
> >> had source code to help me, I wouldn't bother with it, at least
> >> not initially. You need intimate knowledge and experience to get
> >> very far. The trap/cpu registers give enough for my teeny brain
> >> to work with, thank you.
> >>
> >> Unless of course you are planning to throw a bunch of cash SCO's
> >> way and have them look at it?
>
> >I have to firmly disagree with you on this. You don't have to analyze a
>
> Well, we'll have to disagree. More than 99% of the time panics are
> either bad ram or other hardware or missing patches. I see no point
> in paying or taking the time to analyze a dump for that.

76.3% of statistics are made up out of thin air. Well, at least these
two are.

Nevertheless, the middle sentence in that paragraph is an argument in
_favor_ of providing a stack trace. You are correct that _some_ subset
of panics are caused by missing patches. Posting a stack trace from
such a panic allows anyone who can search the SCO Technical Articles to
match it up with the trace in the corresponding TA. For instance,
suppose someone posted that their system panic'd with this trace
(excerpted):

e0000c34 e0000cec kstrcpy (u+0xd74,0x1,0x4e0,u+0x1148)
e0000cf4 e0000d8c Scfgioctl (0x2700,0xd,0x8047ec4,0x3)
e0000d94 e0000dbc devfs_ioct (inode+0x4290,0xd,0x8047ec4,0x3)

You could track that down to TA #109675 in about 30 seconds.

If instead they posted that it was a "trap E", you would spend half a
dozen question-answer cycles before you got to the cause.

> >dump yourself, you are typically going to go ask someone for help on it
> >-- whether that's the newsgroups, SCO, or a local consultant. Whoever
>
> Quick, name three consultants with the resources to analyze a dump :-)

Name 20 frequent posters to this group who can look up a string in the
TA search engine.

> >it is, one of the first things they're going to ask for is a symbolic
> >traceback of the panic. That can be gotten out of a panic dump with
> >little skill; can't be gotten out of the panic _message_ at all.
>
> >The message printed at panic time is full of hexadecimal addresses that
> >are _meaningless_ without examining the kernel that produced them.
>
> >Anyone wanting help with a panic should _at least_ allow a dump to be
> >saved in /dev/swap, then run:
>
> > # crash -d /dev/swap
> > > panic -w /tmp/panic
>
> >and post the resulting contents of /tmp/panic.
>
> Well, tell you what: I'll be sure to suggest that from now on, but
> my bet is that unless you happen to be reading that day, it will
> get them nothing.

Well, evan aside fro mthe fact that I read almost everything on this
group, you're still wrong. Function names in the trace lead very
quickly to a general area of suspicion. If there are a bunch of
functions named "bladxyz" on the stack, not much "analysis" is required
to offer the suggestion that they make sure their "blad" driver is the
most recent version.

You're advising people to leave out the most salient information they
are likely to have available in a panic situation. That makes no sense.

>Bela<

to...@aplawrence.com

unread,
Jun 14, 2003, 7:39:44 PM6/14/03
to
Lucky Leavell <sco...@unixpress.com> wrote:
>
>First, let me say I have the utmost repsect for Tony and his opinions and
>certainly his treasure trove of a web site. Perhaps it is because I was a
>systems programmer (Xerox CP-V) in a former life but the crash tool and
>the dump from a crash do provide important clues beyond the panic display
>like the currently scheduled processes.

Sure, if you have the skills and knowledge, go for it. I don't, and
I don't know many people who can get very far with this.

>One excellent reason to go this route is the thought of having a novice
>try to write down all those registers and hex contents without making a
>mistake, not to mention the time required while their users are breathing
>down their neck. (I am nearly an hour away so I can't just pop over
>myself.)

Well, as I said, most of the time it's hardware and the analysis is
not going to help you a lot. There's also the issue of time: ripping
out ram is quick, simple and cheap.. it may not be the issue at all,
but it's easy to do. Other things like nics are easy too. It's
also often easy enough to pop the drives into a whole other box.

I had one very recently that apparently ended up being the nic. I say
apparently because I wasn't there, but this was a machine that wouldn't
even stay up long enough for a backup unless it cooled off for an
hour first. Time down can cost big, big money when you have employees
sitting around doing nothing. Better to rip into the hardware and
if you are lucky, you fix it quickly.

But we all approach things with the tools and skills we feel most
confident in. One of the things we seldom see here is examples of
tracing panics and the diagnosis. If I remember, Bela might have
blessed us with something like that once or twice, and I probably
saved it at my site, but we sure don't see much of that. That,
I think, lends weight to my argument that it's a confusing swamp
for most of us and that few have gotten useful results.

But I'd be happy to be proved wrong :-) So if your analysis helps
you solve the problem, do a big favor for other folks and write it up.
Post it here or I'd happily publish it at http://aplawrence.com and
index it seven ways from Sunday.

to...@aplawrence.com

unread,
Jun 14, 2003, 7:55:08 PM6/14/03
to
Bela Lubkin <be...@sco.com> wrote:
>Tony Lawrence wrote:

>> Bela Lubkin <be...@sco.com> wrote:
>> >Tony Lawrence wrote:
>>
>> >> Do you really want to go through the trouble of saving, extracting, and
>> >> analyzing a dump? Do you even have the tools and knowledge to
>> >> do so?
>> >>
>> >> For must of us, a dump is nearly useless. The trap and cpu information is
>> >> enough to tell us the type of problem and whether it is repeating
>> >> (which could indicate a specific driver), but dumps can't give
>> >> us much more. I can read assembly language fairly well,
>> >> and I know a little bit about hardware and vm and so on, but even if I
>> >> had source code to help me, I wouldn't bother with it, at least
>> >> not initially. You need intimate knowledge and experience to get
>> >> very far. The trap/cpu registers give enough for my teeny brain
>> >> to work with, thank you.
>> >>
>> >> Unless of course you are planning to throw a bunch of cash SCO's
>> >> way and have them look at it?
>>
>> >I have to firmly disagree with you on this. You don't have to analyze a
>>
>> Well, we'll have to disagree. More than 99% of the time panics are
>> either bad ram or other hardware or missing patches. I see no point
>> in paying or taking the time to analyze a dump for that.

>76.3% of statistics are made up out of thin air. Well, at least these
>two are.

OK. 99% of panics *I have been involved in*. Which probably doesn't
begin to be a noticeable percentage of your experience, but it is
not insignificant. In fact, I'd almost say 100% but there was one
that I was only tangentially involved with and I think it went out
for analysis. Dunno what the results were though. Everything else
I have ever seen has been simple lack of patches, bad ram, or a defective
card or motherboard.

I think we're probably talking about different things though: if
a machine is panicing now and then, or it's a new install, maybe
that is the best way to do it.

But when it's been running for some time and suddenly the darn thing
won't stay up long or won't boot at all, my rip
and replace is (again, my opinion) a far better approach. The
probability of hardware is very, very high.

If it's been running a while but has "always" paniced now and then,
I suspect patches and drivers. As it's usually the case that the real
major and important ones aren't applied, I apply 'em. If it fixes it,
great. If not, start ripping..

But then I do lack culture and finesse..

Bela Lubkin

unread,
Jun 14, 2003, 8:51:15 PM6/14/03
to sco...@xenitec.ca
Tony Lawrence wrote:

> Lucky Leavell <sco...@unixpress.com> wrote:
>
> >One excellent reason to go this route is the thought of having a novice
> >try to write down all those registers and hex contents without making a
> >mistake, not to mention the time required while their users are breathing
> >down their neck. (I am nearly an hour away so I can't just pop over
> >myself.)
>
> Well, as I said, most of the time it's hardware and the analysis is
> not going to help you a lot. There's also the issue of time: ripping
> out ram is quick, simple and cheap.. it may not be the issue at all,
> but it's easy to do. Other things like nics are easy too. It's
> also often easy enough to pop the drives into a whole other box.

I guess we have very different ideas of what is easy. Running a couple
of commands and posting the resulting text on USENET is easy, to me.
Scheduling some down time, (possibly driving or flying across the
country to the problem system), getting everyone off the system,
shutting down, swapping out bits of hardware, bringing it back up, then
waiting days or weeks to see whether an intermittent panic has stopped
-- that's not easy.

> But we all approach things with the tools and skills we feel most
> confident in. One of the things we seldom see here is examples of
> tracing panics and the diagnosis. If I remember, Bela might have
> blessed us with something like that once or twice, and I probably
> saved it at my site, but we sure don't see much of that. That,
> I think, lends weight to my argument that it's a confusing swamp
> for most of us and that few have gotten useful results.

It's only a made-up number, but I would estimate that I have helped
solve about 100 different problems by way of analyzing panic stack
traces posted on USENET. Perhaps your internal counter of these doesn't
increment because your eyes glaze over when the discussion goes in that
direction.

dejagoogle search for "panic group:comp.unix.sco.misc author:bela" finds
300+ articles.

Without going back to analyze those articles in detail, I am simply
stating from _my_ experience that a symbolic trace of the panic is _very
useful_ in a large proportion of panic cases. By "large proportion" I
mean something more than 20% and less than 80% -- I can't be more
accurate than that without going off to do a bunch of analysis which
would be a waste of time. Even if it is only 20% likely to help, it is
still totally worth the poster's while to obtain and post a stack trace.

Heck, even knowing the operating system version number is probably only
_really_ needed about 20% of the time, but we _all_ whine if a poster
doesn't include that...

>Bela<

to...@aplawrence.com

unread,
Jun 15, 2003, 8:57:17 AM6/15/03
to
Bela Lubkin <be...@sco.com> wrote:
>Tony Lawrence wrote:

>> Well, as I said, most of the time it's hardware and the analysis is
>> not going to help you a lot. There's also the issue of time: ripping
>> out ram is quick, simple and cheap.. it may not be the issue at all,
>> but it's easy to do. Other things like nics are easy too. It's
>> also often easy enough to pop the drives into a whole other box.

>I guess we have very different ideas of what is easy. Running a couple
>of commands and posting the resulting text on USENET is easy, to me.
>Scheduling some down time, (possibly driving or flying across the

Scheduling down time? Bela, I'm talking about systems that are crashing,
that won't stay running long enough to get work done, that are screwing
up data and stopping business. There's no problem scheduling the time.

Again, I think we probably have different viewpoints. You probably
see more installation problems and I see more existing systems. It's
an entirely different condition: an installation can be put off, a
running system needs to be fixed now.

There's another issue too: let's say it's NOT hardware and it's NOT
patches. What the heck am I going to do about it? Some esoteric
bug in a driver and no replacement. Unlikely on a running system,
but it could happen because of growth: more users, new resource demands,
whatever. My approach doesn't fix the problem (I don't think I
have ever had one like this, but maybe). But what happens next?
Well, maybe we go your way and maybe eventually somebody can patch
whatever needs patching and fix it. Well, I'm seldom in a position
to wait for that. I'll change out the whole damn machine, or figure
out a kludge that will get us around the problem and get business flowing
again.

Bela Lubkin

unread,
Jun 15, 2003, 5:34:45 PM6/15/03
to sco...@xenitec.ca
Tony Lawrence wrote:

> Bela Lubkin <be...@sco.com> wrote:
> >Tony Lawrence wrote:
>
> >> Well, as I said, most of the time it's hardware and the analysis is
> >> not going to help you a lot. There's also the issue of time: ripping
> >> out ram is quick, simple and cheap.. it may not be the issue at all,
> >> but it's easy to do. Other things like nics are easy too. It's
> >> also often easy enough to pop the drives into a whole other box.
>
> >I guess we have very different ideas of what is easy. Running a couple
> >of commands and posting the resulting text on USENET is easy, to me.
> >Scheduling some down time, (possibly driving or flying across the
>
> Scheduling down time? Bela, I'm talking about systems that are crashing,
> that won't stay running long enough to get work done, that are screwing
> up data and stopping business. There's no problem scheduling the time.

Then you are talking about a small subset of the entire spectrum of
systems that panic. I've seen everything from systems that always panic
in the same place during bootup, all the way to systems that panic
around once a year. Nevertheless, it is _still_ useful to look at a
panic trace from any of those.

> Again, I think we probably have different viewpoints. You probably
> see more installation problems and I see more existing systems. It's
> an entirely different condition: an installation can be put off, a
> running system needs to be fixed now.

That isn't the dividing line at all. I also am mostly talking about
already-up systems which panic periodically. You seem to be saying that
under your care, you only see systems that panic _frequently_ so that it
isn't necessary to "schedule downtime". Of course I have seen systems
like that, but I've also seen ones that panic around once a week --
which means that if you went onsite to witness it, it might happen
immediately, it might take 3 weeks to reproduce. With such a system you
would not want to wait for a panic before applying a potential fix. You
would want to bring it down deliberately -- which means negotiating some
sort of scheduled downtime with the users. Yes, if the system has been
panicing a lot they're probably used to unscheduled downtime and will
let you bring it down when you want to, but it's still a lot more polite
to _check_ with them about that!

> There's another issue too: let's say it's NOT hardware and it's NOT
> patches. What the heck am I going to do about it? Some esoteric
> bug in a driver and no replacement. Unlikely on a running system,
> but it could happen because of growth: more users, new resource demands,
> whatever. My approach doesn't fix the problem (I don't think I
> have ever had one like this, but maybe). But what happens next?
> Well, maybe we go your way and maybe eventually somebody can patch
> whatever needs patching and fix it. Well, I'm seldom in a position
> to wait for that. I'll change out the whole damn machine, or figure
> out a kludge that will get us around the problem and get business flowing
> again.

Ok, that may be your mode of operation. It still makes no sense to me,
and I really wish you wouldn't push it to other readers of the
newsgroup.

You are projecting your own lack of knowledge about kernel internals
onto other readers. That alone wouldn't be bad, but you're also
(somehow) arguing that they can't benefit from the _plethora_ of
knowledge to which their question can be exposed, by posting it here.

It's like ... suppose you went to a foreign grocery store and were given
a free package of, I dunno, pudding mix, along with the stuff you were
intentionally buying. It has instructions in some language that you
can't read (but which uses the same character set as English). Your
argument is that since you can't read the instructions, you should just
throw out the product. Mine is that, since you're a member of a
discussion group where many other readers _do_ read that language, you
should type it in and see if someone can translate for you.

Sure, it _might_ not work, but what's the cost of trying? A few minutes
of typing? vs. having to go onsite to swap out bits of hardware, wait
to see if it helped, repeat until successful. Which might be forever,
if it really is a software problem unrelated to bad hardware.

>Bela<

to...@aplawrence.com

unread,
Jun 15, 2003, 7:36:15 PM6/15/03
to
Bela Lubkin <be...@sco.com> wrote:

>Ok, that may be your mode of operation. It still makes no sense to me,
>and I really wish you wouldn't push it to other readers of the
>newsgroup.

Well, Bela, no offense meant, but it's a public newgroup. I
respect your opinion, but I don't agree with it, and unlike
certain other venues, I can continue to express my opinions
here in spite of your dislike of them.

>You are projecting your own lack of knowledge about kernel internals
>onto other readers. That alone wouldn't be bad, but you're also
>(somehow) arguing that they can't benefit from the _plethora_ of
>knowledge to which their question can be exposed, by posting it here.


I've been reading and posting to this newsgroup since 1991. I
haven't seen the benefits you claim.

It is my opinion that you are wrong in this regard, and nothing you
have said has changed my opinion. However, I've put everything you
have had to say on my site, and later on will cross reference it
so that my odious opinions are balanced by yours.

0 new messages