It was crashing for a few days, but it isn't the Debian OS because it
also crashed when using a DSL LiveCD and a Knoppix LiveCD. I thought
that those tests would eliminate the Debian OS as well as the HDD.
It ran for a week without crashing when running from tomsrtbt on a floppy.
I fired it up again on the Debian 4.0 installed on the HDD and it ran OK
for a week until I accidentally killed the AC power feed which ended the
week's uptime. Then I added a UPS and rebooted it.
It has ran for 5 days without failure until a short while ago.
I had a tail -f running on the syslog and just now when I checked it I
noticed the box was in trouble.
john@optima12:~$ date
Tue Apr 29 22:57:46 CDT 2008
john@optima12:~$ uptime
08:18:02 up 4 days, 10:23, 3 users, load average: 0.04, 0.02, 0.00
john@optima12:~$
Message from syslogd@optima12 at Wed Apr 30 11:56:39 2008 ...
optima12 kernel: Oops: 0000 [#1]
Message from syslogd@optima12 at Wed Apr 30 11:56:39 2008 ...
optima12 kernel: CPU: 0
Message from syslogd@optima12 at Wed Apr 30 11:56:39 2008 ...
optima12 kernel: EIP is at drain_array+0x10/0x7f
Message from syslogd@optima12 at Wed Apr 30 11:56:39 2008 ...
optima12 kernel: eax: 00000000 ebx: c7f0a220 ecx: 07ecb000 edx:
c7f0a220
Message from syslogd@optima12 at Wed Apr 30 11:56:39 2008 ...
optima12 kernel: esi: c7fb48a0 edi: 07ecb000 ebp: c7fb48a0 esp:
c7b51f54
Message from syslogd@optima12 at Wed Apr 30 11:56:39 2008 ...
optima12 kernel: ds: 007b es: 007b ss: 0068
Message from syslogd@optima12 at Wed Apr 30 11:56:39 2008 ...
optima12 kernel: Process events/0 (pid: 3, ti=c7b50000 task=c7b40030
task.ti=c7b 50000)
Message from syslogd@optima12 at Wed Apr 30 11:56:39 2008 ...
optima12 kernel: Stack: c7f0a220 c7fb48a0 c7b60740 00000000 c014703b
00000000 00 000000 00000292
Message from syslogd@optima12 at Wed Apr 30 11:56:39 2008 ...
optima12 kernel: c036bd80 c012075a c014700d c7b60750 c7b60740
c7b60748 00 000000 c0120c30
Message from syslogd@optima12 at Wed Apr 30 11:56:40 2008 ...
optima12 kernel: 00000001 00000000 c7b40ab0 00010000 00000000
00000000 c7 b40030 c0111cba
Message from syslogd@optima12 at Wed Apr 30 11:56:40 2008 ...
optima12 kernel: Call Trace:
Message from syslogd@optima12 at Wed Apr 30 11:56:40 2008 ...
optima12 kernel: Code: 24 08 83 c5 04 8b 44 24 04 39 44 24 08 0f 8c 40
ff ff ff 83 c4 0c 5b 5e 5f 5d c3 55 57 56 53 89 c5 89 cf 8b 44 24 14 85
c9 74 6a <83> 39 00 74 65 83 79 0c 00 74 0d 85 c0 75 09 c7 41 0c 00 00 00
Message from syslogd@optima12 at Wed Apr 30 11:56:40 2008 ...
optima12 kernel: EIP: [<c01461d8>] drain_array+0x10/0x7f SS:ESP
0068:c7b51f54
Then I tried to see if the box was still alive:
john@optima12:~$ date
Segmentation fault
john@optima12:~$
Message from syslogd@optima12 at Wed Apr 30 12:09:59 2008 ...
optima12 kernel: Oops: 0000 [#2]
Message from syslogd@optima12 at Wed Apr 30 12:09:59 2008 ...
optima12 kernel: CPU: 0
Message from syslogd@optima12 at Wed Apr 30 12:09:59 2008 ...
optima12 kernel: EIP is at cache_alloc_refill+0xf0/0x3ea
Message from syslogd@optima12 at Wed Apr 30 12:09:59 2008 ...
optima12 kernel: eax: 00000000 ebx: c7fb08e0 ecx: 00000000 edx:
00000000
Message from syslogd@optima12 at Wed Apr 30 12:10:00 2008 ...
optima12 kernel: esi: 05836020 edi: c7ecc3e0 ebp: c7e990c0 esp:
c3b91ce8
Message from syslogd@optima12 at Wed Apr 30 12:10:00 2008 ...
optima12 kernel: ds: 007b es: 007b ss: 0068
Message from syslogd@optima12 at Wed Apr 30 12:10:00 2008 ...
optima12 kernel: Process bash (pid: 1092, ti=c3b90000 task=c7dd5030
task.ti=c3b90000)
Message from syslogd@optima12 at Wed Apr 30 12:10:00 2008 ...
optima12 kernel: Stack: c3b91dc0 000000d0 c7fb4840 0000001b 00000000
00000000 c88d33c4 018b2900
Message from syslogd@optima12 at Wed Apr 30 12:10:00 2008 ...
optima12 kernel: 00000000 c015bff7 00000282 c7ed4e00 c111a260
c7ed4e00 c0146357 00000001
Message from syslogd@optima12 at Wed Apr 30 12:10:00 2008 ...
optima12 kernel: Call Trace:
Message from syslogd@optima12 at Wed Apr 30 12:10:00 2008 ...
optima12 kernel: Code: 00 00 8b 4d 00 8b 5e 14 8b 44 24 08 8b 50 10 0f
af d3 03 56 0c 8b 04 24 40 89 46 10 8b 44 9e 1c 89 46 14 89 54 8d 10 41
89 4d 00 <8b> 56 10 89 14 24 8b 4c 24 08 3b 51 1c 73 0b ff 4c 24 0c 83 7c
Message from syslogd@optima12 at Wed Apr 30 12:10:00 2008 ...
optima12 kernel: EIP: [<c014644f>] cache_alloc_refill+0xf0/0x3ea SS:ESP
0068:c3b91ce8
Message from syslogd@optima12 at Wed Apr 30 12:10:00 2008 ...
optima12 kernel: c88dd2d1 c015c10d 00000001 00000000 c111a260
c015cbfd c88d6989 00000001
At this point the SSH connection stopped responding, and I cannot
reconnect remotely. I'll have to physically visit the computer and see
what it is doing and reboot it.
Oops is briefly described here:
http://en.wikipedia.org/wiki/Linux_kernel_oops
I have Debian 4.0 running on several other computers without any
problems, and Debian 3.1 running on identical PC hardware without
problems. I don't suspect Debian, but if there is a clue to a hardware
failure location, it would be of great assistance to pinpoint it.
Any ideas of the cause and solution?
Should this be provided to Debian for bug tracking? If so, what is the
best URL?
--
John
No Microsoft, Apple, Intel, Trend Micro, nor Ford products were used in the preparation or transmission of this message.
The EULA sounds like it was written by a team of lawyers who want to tell me what I can't do. The GPL sounds like it was written by a human being, who wants me to know what I can do.
Try doing a memory test, like memtest68+
Robert
Took your advice, even though I didn't suspect a memory issue, since
this server has ran for many days in the past with various OSes (see above).
The memtest86 has ran for seven straight days now without any failure.
The kerneloops.org site says to report oopses from distros to the
various distros instead of their site. So it looks like I'll have to
report it to Debian, but would like to figure it out myself since it
must be my hardware (other Etch PCs aren't oopsing).
Do you have any knowledge of what the oops message indicates, such as a
particular piece of hardware?
Here it is again:
Well, that is in the slab allocator, so it could well be a driver
problem. Does your kernel include any non-free drivers (e.g. nvidia or a
Windows wireless driver)?
Robert
>
> [snip]
> Well, that is in the slab allocator, so it could well be a driver
> problem. Does your kernel include any non-free drivers (e.g. nvidia or a
> Windows wireless driver)?
>
> Robert
Not that I'm aware of. I don't use nvidia or wireless (or Windows). This
is a headless server, but presently with a keyboard and monitor for
testing. It is also on a LAN.
It is an old AMD-K6 266 MHz box with 128 MB RAM. It was running Debian
3.1 until I upgraded it to Etch. I've done the same on several other
identical computers with 266 and 233 MHz CPUs, and RAM down as low as 48 MB.
I'd rule out the kernel Robert, because it fails with Debian Etch, and
from a LiveCD, DSL and Knoppix. ;-)
It did tun a full week with tomsrtbt. That, if anything, would lead me
to suspect a hard drive issue. Perhaps the floppy-based tomsrtbt and the
memtest86 do not look at the hard drive, so they do not fail. The
LiveCDs, DSL and Knoppix, run from the CD-ROM drive, but might be
occasionally looking at the hard drive.
It also might be the IDE circuitry, which both the hard drive and CD-ROM
would utilize. Or even the power supply could be drooping during a disk
head movement, etc. But if so, why did Debian 3.1 run for months with no
failure? I just don't suspect software at all.
I hate shotgunning intermittent troubles because of the excessive time
required to eliminate this and that, and to isolate the problem to a
specific piece of hardware. This one PC has been "cooking" now for over
one month while testing various theories, and still no conclusive proof.
If I don't have a tail on the log, then there is no way to see any
logged failure. I was hoping the oops info could pin it down to a
specific area, then I could concentrate on that, like maybe
disconnecting the hard disk and running a LiveCD, disconnecting the
CD-ROM's IDE and/or power connector, or even swapping the hard disk,
etc. Each test can take many days, and if there is no failure, it still
isn't conclusive.
As you know, the final conclusive test is: it will likely fail after the
cover is replaced, shoved back in the server rack, and put back into
service. ;-)
That is the one item I plan on isolating first. It's not the CD drive
though. Read on....
The CD-ROM in the PC was flaky when I received it in 2005. I had to
remove it, open it, and I found the tray was stuck and the laser LED
carrier wouldn't move. I applied a small amount of white grease, did a
lot of testing, and then reassembled the drive.
Then it worked good, so I could install Debian Sarge.
After two years, when I tried to open the CD-ROM to insert the Etch CD,
the drive wouldn't open. This is probably a mechanical problem because I
can feel some vibration.
Using a thin hook (cable lashing sewing needle), I was able to get the
door to open and the tray followed. After working it a few times it
"self-lubricated" and worked without sticking.
Then I tried to install Etch, but there were occasional read errors from
the CD. I remove this CD-ROM drive and installed a spare CD-R/W just to
run the Etch CD. It ran perfectly, and Etch was installed (or actually
Sarge was upgraded).
Now I'm kinda leaning more toward the IDE channel since LiveCDs also are
crashing, as well as the Etch on the hard drive. The only time it has
ran for several days is when using tomsrtbt from a floppy. It also ran
for several days when I accessed the BIOS screen and left it in that
condition so I would know if it had rebooted or not.
Si I'm going to disconnect the CD-ROM drive, both power and IDE, and see
if it runs without trouble. If so, in a week or so, I'll plug the power
back into the CD-ROM and test further. The next step would be to plug
the IDE cable back into the drive and run it for another week.
Should that fail, I'll disconnect the hard drive instead, and boot a
LiveCD of Knoppix and test that way.
Right now it is running the memtest86 from an Ubuntu 7.04 Feisty Fawn
LiveCD, and the hard drive is still connected. A week has passed with no
failures.
Time will tell, and it certainly takes time.
> If it's the channel, well ... you'll know what to buy ;)
>
I bought 13 of these identical school district junkers for $5.00 each,
and wound up with 13 that still worked!
After adding a $2.50 CMOS battery to each, they are worth $7.50 now. ;-)
I wouldn't put any money into them since I am really only running two
24/7. If I wanted, I could buy an ISA or PCI IDE controller card, but
that probably would be too much.
The original light bulb that went on above my head spelled out Beowulf. ;-)
> Good plan. FYI, if you're not already aware - I've seen that you *are*
> savvy - a LiveCD doesn't need the HD to be attached to operate. Instead,
> it'll be forced to run in RAM alone. Some, like Ubuntu, don't even make
> a ramdisk and the HD is entirely ignored. This is the _only_ means by
> which I test them as a Sabayon DVD wrote to all user's /home/.mozilla &
> .mozilla-thunderbird directories over writing browser Home pages to
> theirs and all email settings were gone.
>
Yuck! Sounds like my recent problem of Suse 10 renumbering my UID when I
set it up to use a common /home partition on a multiboot with five
distros that got along together (Fedora Core 3, Ubuntu 6.06 and 7.10,
Debian 3.1 and 4.0).
>> Time will tell, and it certainly takes time.
>>
>
> It will and I wish you good hunting.
Thank you.