Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

systems hangs every few days

9 views
Skip to first unread message

Chris Purves

unread,
Jun 18, 2013, 9:10:04 AM6/18/13
to
After upgrading to wheezy, I get a system hang every one or two days where the system becomes completely unresponsive and I need do a cold boot.

This is an older machine with an Athlon processor. I'm not running X. I don't see anything unusual in the logs. The last entry in syslog is typically a cron job, but not always the same one. The system seems to freeze without any warning.

I tried downgrading the kernel back to the squeeze version (2.6) and it still locks up. Before upgrading to wheezy I resized a few of the partitions. Other than that, nothing else has changed and everything had been running fine for years.

I'd appreciate any help in debugging this problem.

Thanks.


--
Chris Purves
Visit my blog: http://chris.northfolk.ca

"I can't have a lobotomy just because I've got a pirate costume on." - Christine Purves


--
To UNSUBSCRIBE, email to debian-us...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org
Archive: http://lists.debian.org/51C0599B...@northfolk.ca

Rob Owens

unread,
Jun 18, 2013, 9:30:02 AM6/18/13
to
----- Original Message -----
> From: "Chris Purves" <ch...@northfolk.ca>
> To: debia...@lists.debian.org
> Sent: Tuesday, June 18, 2013 8:59:07 AM
> Subject: systems hangs every few days
>
> After upgrading to wheezy, I get a system hang every one or two days
> where the system becomes completely unresponsive and I need do a
> cold boot.
>
That sounds a lot like bad RAM. Run MEMTEST86 on it. I'm not sure if the debian install cds contain that, but Knoppix usually has that as a boot option. The full test could take more than a day. If you're lucky, it'll find an error in the first 10 minutes.

-Rob


--
To UNSUBSCRIBE, email to debian-us...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org
Archive: http://lists.debian.org/1933413594.32244493.137...@ptd.net

Chris Purves

unread,
Jun 18, 2013, 10:40:01 AM6/18/13
to
On 2013-06-18 10:20, Rob Owens wrote:
> ----- Original Message -----
>> From: "Chris Purves" <ch...@northfolk.ca>
>> To: debia...@lists.debian.org
>> Sent: Tuesday, June 18, 2013 8:59:07 AM
>> Subject: systems hangs every few days
>>
>> After upgrading to wheezy, I get a system hang every one or two days
>> where the system becomes completely unresponsive and I need do a
>> cold boot.
>>
> That sounds a lot like bad RAM. Run MEMTEST86 on it. I'm not sure if the debian install cds contain that, but Knoppix usually has that as a boot option. The full test could take more than a day. If you're lucky, it'll find an error in the first 10 minutes.
>

I ran memtest86 already for about 20 minutes without error. (I forgot to mention that in original post). I will run it overnight tonight to see if anything else pops up.

I am skeptical that it's bad RAM since the problem occurred after upgrading and I would have expected some kernel panic errors in the logs or something like that, although I suppose bad RAM can manifest itself in different ways.


--
Chris Purves
Visit my blog: http://chris.northfolk.ca

"...he was right in front of me trying to block my way, so I took him out." - Jean Chrétien


--
To UNSUBSCRIBE, email to debian-us...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org
Archive: http://lists.debian.org/51C06F3C...@northfolk.ca

Karl E. Jorgensen

unread,
Jun 18, 2013, 11:40:02 AM6/18/13
to
On Tue, Jun 18, 2013 at 01:59:07PM +0100, Chris Purves wrote:

> After upgrading to wheezy, I get a system hang every one or two days
> where the system becomes completely unresponsive and I need do a
> cold boot.

> This is an older machine with an Athlon processor. I'm not running
> X. I don't see anything unusual in the logs. The last entry in
> syslog is typically a cron job, but not always the same one. The
> system seems to freeze without any warning.

When the system freezes, is there anything useful on the console? If
the kernel craps out, the result may not be visible in the log
files (because things can halt before buffers are flushed etc).

It may be useful to disable screen blanking on the console for this -
the kernel may (or may not) wake up the console upon death. (I call
that the JFK syndrome: He never knew what hit him).

A couple of candidates spring to mind:

* Overheating? If the system is old, it may be full of dust and thus
the fans may struggle. Or the bearings get worn out. Insufficient
airflow and cooling does tend to make things go pop - except for
CPUs which (I believe) shut themselves down due to a built-in
self-preservation instinct courtesy of the hardware engineers.

* Struggling power supply? If the power supply is just barely
providing enough power, random things which require more power may
cause voltage drops that some component take a dislike to. Although
the system *should* be consuming peak amount of power during
power-on peaks may also occur later.

* Bad RAM? (already covered in a different part of the thread)

* Bad capacitors? Older motherboards are more likely to suffer from
the capacitors going "pop". A web search for "Capacitor plague" is
probably more reliable and informative than I can achieve in this
email.

> I tried downgrading the kernel back to the squeeze version (2.6) and
> it still locks up. Before upgrading to wheezy I resized a few of the
> partitions. Other than that, nothing else has changed and everything
> had been running fine for years.

Assuming that the resize was healthy, all should be ok.

But... Since there are no clear suspects, paranoia dictates a run of
fsck on the affected file systems. Just in case. At least it is a
harmless check if you can afford the downtime while the file systems
are unmounted.

Hope this helps

--
Karl E. Jorgensen


--
To UNSUBSCRIBE, email to debian-us...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org
Archive: http://lists.debian.org/20130618153827.GB15076@hawking

Stan Hoeppner

unread,
Jun 18, 2013, 12:10:02 PM6/18/13
to
On 6/18/2013 7:59 AM, Chris Purves wrote:
> After upgrading to wheezy, I get a system hang every one or two days where the system becomes completely unresponsive and I need do a cold boot.
>
> This is an older machine with an Athlon processor. I'm not running X. I don't see anything unusual in the logs. The last entry in syslog is typically a cron job, but not always the same one. The system seems to freeze without any warning.
>
> I tried downgrading the kernel back to the squeeze version (2.6) and it still locks up. Before upgrading to wheezy I resized a few of the partitions. Other than that, nothing else has changed and everything had been running fine for years.
>
> I'd appreciate any help in debugging this problem.

It's not a kernel/software problem as you're not seeing kernel panics,
nothing in the logs. Could be DRAM but it's unlikely. Given that
marginal silicon typically fails within hours/days of initial use and
rarely thereafter, it's probably not a DIMM gone bad as someone else
suggested.

You said this system is "older" and housing an Athlon CPU. There were 5
generations of Athlon produced from 1999 to 2005. Thus this box could
be anywhere from 7 to 14 years old. On machines of this age you need to
check/test/troubleshoot/replace hardware in the following order:

1. CPU fan -- rarely last 7 years, let alone 14. Some models may lose
80% of their nominal RPM with age, yet without emitting noticeable
noise. The heatsink may get just enough airflow to allow a few
days of run time. When the fan fails completely, the box locks up
in a few minutes.

2. PSU fan -- while failing will cause MOSFET/cap/etc overheating which
can cause "random" lockups, reboots, and other odd behavior

3. PSU itself -- failed fan can permanently damage MOSFETS/caps/etc
Even with a good fan, PSU components can fail with age.

4. Removable media drives -- floppy/CD/DVD-ROM can fail in odd ways
sending spurious high voltage signals or shorting wires, locking up
the motherboard, or causing random reboots. Disconnect their
data cable and power leads and run without them.

5. The motherboard. Even with good cooling over the life of a machine
the motherboard can still simply fail. You may not be able to find
bulged caps nor burn marks on VRMs, no visible signs of failure.

Point in fact: I had a Biostar Socket A nForce2 400 motherboard
w/Athlon XP 2500 simply give up the ghost in 2011 in a similar
manner. It locked up a few times over a period of a week or so,
then simply wouldn't post.

I built that machine in Aug 2003 and it lasted 8 years. I started
with two 92x25mm Panaflow case fans, plus the 80x25 PSU fan.
I replaced the PSU fan with an NMB boxer, and the case fans with
two Nidec Beta Vs, twice during the life of the box. All of the
fans were fully functional at the time of replacement. This was
proactive maintenance. The box had 110 CFM of properly directed
airflow during its lifespan. Compare this to the ~30 CFM of a
quiet Dell, HP, or IBM machine. Anyone who knows hardware knows
that these are all top shelf 12VDC fans. The PSU is still running,
in another box, as are the two DIMMS and the CPU. The motherboard
simply gave up the ghost after 8 years of 24x7 operation. Let's
hope that isn't the case here.


--
Stan


--
To UNSUBSCRIBE, email to debian-us...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org
Archive: http://lists.debian.org/51C08551...@hardwarefreak.com

Stan Hoeppner

unread,
Jun 18, 2013, 12:20:03 PM6/18/13
to
On 6/18/2013 10:38 AM, Karl E. Jorgensen wrote:

> But... Since there are no clear suspects, paranoia dictates a run of
> fsck on the affected file systems. Just in case. At least it is a
> harmless check if you can afford the downtime while the file systems
> are unmounted.

If the PSU is headed South this may not be harmless. An fsck is going
to generate serious head movement while seeking the metadata inodes.
Head movement increases power draw, sometimes double that of the drive
at idle spin. If errors are found and corrections being written when
the system locks up, due to voltage drop caused by increased power draw
by the drive, then you may have a more corrupted filesystem, not less.

Find the cause of the lockup and fix it, before performing a destructive
repair of a filesystem. If a system is acting flaky the last thing
anyone should do is a destructive repair, as it just might be that.

--
Stan


--
To UNSUBSCRIBE, email to debian-us...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org
Archive: http://lists.debian.org/51C08851...@hardwarefreak.com

Jape Person

unread,
Jun 18, 2013, 1:10:03 PM6/18/13
to
On 06/18/2013 10:31 AM, Chris Purves wrote:
> On 2013-06-18 10:20, Rob Owens wrote:
>> ----- Original Message -----
>>> From: "Chris Purves" <ch...@northfolk.ca>
>>> To: debia...@lists.debian.org
>>> Sent: Tuesday, June 18, 2013 8:59:07 AM
>>> Subject: systems hangs every few days
>>>
>>> After upgrading to wheezy, I get a system hang every one or two days
>>> where the system becomes completely unresponsive and I need do a
>>> cold boot.
>>>
>> That sounds a lot like bad RAM. Run MEMTEST86 on it. I'm not sure if the debian install cds contain that, but Knoppix usually has that as a boot option. The full test could take more than a day. If you're lucky, it'll find an error in the first 10 minutes.
>>
>
> I ran memtest86 already for about 20 minutes without error. (I forgot to mention that in original post). I will run it overnight tonight to see if anything else pops up.
>
> I am skeptical that it's bad RAM since the problem occurred after upgrading and I would have expected some kernel panic errors in the logs or something like that, although I suppose bad RAM can manifest itself in different ways.

One thing I noted in your first message, Chris, was that it usually happens when
a cron job is running. I immediately thought of two possibilities:

1. One or more of the cron jobs are running a process that is "tickling" a
driver that doesn't work as well with the newer kernel as it did with the
previous one.

2. A scheduled job is doing something CPU-intensive that could be making the
system overheat. (I suppose load caused by a given job might be worse under
wheezy than it was under squeeze.)

Unfortunately, I didn't keep the earlier messages and can't refer back to them
to be certain whether or not my addled brain is remembering the details properly.

I hope you find the gremlin and oust him!

Regards,
J.


--
To UNSUBSCRIBE, email to debian-us...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org
Archive: http://lists.debian.org/51C092DA...@comcast.net

Karl E. Jorgensen

unread,
Jun 18, 2013, 1:30:02 PM6/18/13
to
On Tue, Jun 18, 2013 at 05:18:25PM +0100, Stan Hoeppner wrote:
> On 6/18/2013 10:38 AM, Karl E. Jorgensen wrote:
>
> > But... Since there are no clear suspects, paranoia dictates a run of
> > fsck on the affected file systems. Just in case. At least it is a
> > harmless check if you can afford the downtime while the file systems
> > are unmounted.
>
> If the PSU is headed South this may not be harmless. An fsck is going
> to generate serious head movement while seeking the metadata inodes.
> Head movement increases power draw, sometimes double that of the drive
> at idle spin. If errors are found and corrections being written when
> the system locks up, due to voltage drop caused by increased power draw
> by the drive, then you may have a more corrupted filesystem, not less.
>
> Find the cause of the lockup and fix it, before performing a destructive
> repair of a filesystem. If a system is acting flaky the last thing
> anyone should do is a destructive repair, as it just might be that.

Ah yes. That is a very good point. Thanks for highlighting that.

In other words, it should be treated as an unexploded hand
grenade. Carefully.

Perhaps an fsck with -n (to prevent actual repair) would be safer?
depends on the filesystem in use...


--
Karl E. Jorgensen


--
To UNSUBSCRIBE, email to debian-us...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org
Archive: http://lists.debian.org/20130618172329.GD15076@hawking

Klaus

unread,
Jun 18, 2013, 3:00:02 PM6/18/13
to
On 18/06/13 13:59, Chris Purves wrote:
> I'm not running X.
>
Do you have (a) a serial or (b) a graphical console attached to the
mo-bo, or (c) do you access the unit remotely via telnet / ssh? If b or
c, could you try to hook up a serial console while the system hangs and
check whether you re-gain access? Maybe that way you get a chance to do
more debugging. (I'm having similar issues on a NSLU2...)

--
Klaus


--
To UNSUBSCRIBE, email to debian-us...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org
Archive: http://lists.debian.org/51C0AD93...@gmail.com

Chris Purves

unread,
Jun 19, 2013, 8:20:02 AM6/19/13
to
On 2013-06-18 10:20, Rob Owens wrote:
> ----- Original Message -----
>> From: "Chris Purves" <ch...@northfolk.ca>
>> To: debia...@lists.debian.org
>> Sent: Tuesday, June 18, 2013 8:59:07 AM
>> Subject: systems hangs every few days
>>
>> After upgrading to wheezy, I get a system hang every one or two days
>> where the system becomes completely unresponsive and I need do a
>> cold boot.
>>
> That sounds a lot like bad RAM. Run MEMTEST86 on it. I'm not sure if the debian install cds contain that, but Knoppix usually has that as a boot option. The full test could take more than a day. If you're lucky, it'll find an error in the first 10 minutes.

I ran memtest86+ for three hours without finding any errors. I may try again for longer at a later time, but for now I'm going to rule out bad RAM.


--
Chris Purves
Visit my blog: http://chris.northfolk.ca

"It's not the end of the world, but you can see it from there." - Pierre Trudeau


--
To UNSUBSCRIBE, email to debian-us...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org
Archive: http://lists.debian.org/51C1A19F...@northfolk.ca

Chris Purves

unread,
Jun 19, 2013, 8:30:01 AM6/19/13
to
On 2013-06-18 15:57, Klaus wrote:
> On 18/06/13 13:59, Chris Purves wrote:
>> I'm not running X.
>>
> Do you have (a) a serial or (b) a graphical console attached to the
> mo-bo, or (c) do you access the unit remotely via telnet / ssh? If b
> or c, could you try to hook up a serial console while the system
> hangs and check whether you re-gain access? Maybe that way you get a
> chance to do more debugging. (I'm having similar issues on a
> NSLU2...)
>

I'll give that a try, thanks.


--
Chris Purves
Visit my blog: http://chris.northfolk.ca

"I am not afraid of an army of lions led by a sheep; I am afraid of an army of sheep led by a lion." - Alexander the Great


--
To UNSUBSCRIBE, email to debian-us...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org
Archive: http://lists.debian.org/51C1A24D...@northfolk.ca

Chris Purves

unread,
Jun 19, 2013, 8:50:01 AM6/19/13
to
On 2013-06-18 12:38, Karl E. Jorgensen wrote:
> On Tue, Jun 18, 2013 at 01:59:07PM +0100, Chris Purves wrote:
>
>> After upgrading to wheezy, I get a system hang every one or two days
>> where the system becomes completely unresponsive and I need do a
>> cold boot.
>
>> This is an older machine with an Athlon processor. I'm not running
>> X. I don't see anything unusual in the logs. The last entry in
>> syslog is typically a cron job, but not always the same one. The
>> system seems to freeze without any warning.
>
> When the system freezes, is there anything useful on the console? If
> the kernel craps out, the result may not be visible in the log
> files (because things can halt before buffers are flushed etc).
>
> It may be useful to disable screen blanking on the console for this -
> the kernel may (or may not) wake up the console upon death. (I call
> that the JFK syndrome: He never knew what hit him).

I disabled screen blanking and powersave on the console. We'll see if anything useful is displayed the next time it goes down.


> A couple of candidates spring to mind:
>
> * Overheating? If the system is old, it may be full of dust and thus
> the fans may struggle. Or the bearings get worn out. Insufficient
> airflow and cooling does tend to make things go pop - except for
> CPUs which (I believe) shut themselves down due to a built-in
> self-preservation instinct courtesy of the hardware engineers.

The CPU fan is about fours years old and the CPU temperature hovers around 50-55 C, however, perhaps some other component on the motherboard is getting too hot. I can check into that.

> * Struggling power supply? If the power supply is just barely
> providing enough power, random things which require more power may
> cause voltage drops that some component take a dislike to. Although
> the system *should* be consuming peak amount of power during
> power-on peaks may also occur later.

I am suspecting that it may be the power supply. I replaced it about two years ago, but maybe it's time again. If I don't get any good information from the console I may try replacing the power supply.

> * Bad RAM? (already covered in a different part of the thread)
>
> * Bad capacitors? Older motherboards are more likely to suffer from
> the capacitors going "pop". A web search for "Capacitor plague" is
> probably more reliable and informative than I can achieve in this
> email.

I can take a look for that.

>> I tried downgrading the kernel back to the squeeze version (2.6) and
>> it still locks up. Before upgrading to wheezy I resized a few of the
>> partitions. Other than that, nothing else has changed and everything
>> had been running fine for years.
>
> Assuming that the resize was healthy, all should be ok.
>
> But... Since there are no clear suspects, paranoia dictates a run of
> fsck on the affected file systems. Just in case. At least it is a
> harmless check if you can afford the downtime while the file systems
> are unmounted.
>
> Hope this helps
>


--
Chris Purves
Visit my blog: http://chris.northfolk.ca

"Nobody goes there no more; it's too crowded!" - Yogi Berra


--
To UNSUBSCRIBE, email to debian-us...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org
Archive: http://lists.debian.org/51C1A73E...@northfolk.ca

Chris Purves

unread,
Jun 19, 2013, 9:00:02 AM6/19/13
to
On 2013-06-18 13:05, Stan Hoeppner wrote:
> On 6/18/2013 7:59 AM, Chris Purves wrote:
>> After upgrading to wheezy, I get a system hang every one or two days where the system becomes completely unresponsive and I need do a cold boot.
>>
>> This is an older machine with an Athlon processor. I'm not running X. I don't see anything unusual in the logs. The last entry in syslog is typically a cron job, but not always the same one. The system seems to freeze without any warning.
>>
>> I tried downgrading the kernel back to the squeeze version (2.6) and it still locks up. Before upgrading to wheezy I resized a few of the partitions. Other than that, nothing else has changed and everything had been running fine for years.
>>
>> I'd appreciate any help in debugging this problem.
>
> It's not a kernel/software problem as you're not seeing kernel panics,
> nothing in the logs. Could be DRAM but it's unlikely. Given that
> marginal silicon typically fails within hours/days of initial use and
> rarely thereafter, it's probably not a DIMM gone bad as someone else
> suggested.
>
> You said this system is "older" and housing an Athlon CPU. There were 5
> generations of Athlon produced from 1999 to 2005. Thus this box could
> be anywhere from 7 to 14 years old. On machines of this age you need to
> check/test/troubleshoot/replace hardware in the following order:
>
> 1. CPU fan -- rarely last 7 years, let alone 14. Some models may lose
> 80% of their nominal RPM with age, yet without emitting noticeable
> noise. The heatsink may get just enough airflow to allow a few
> days of run time. When the fan fails completely, the box locks up
> in a few minutes.

CPU fan is about three years old and while trying to debug this previously I was running 'sensors' in a cron-job every five minutes and the CPU temp never exceeded 60 C.

> 2. PSU fan -- while failing will cause MOSFET/cap/etc overheating which
> can cause "random" lockups, reboots, and other odd behavior
>
> 3. PSU itself -- failed fan can permanently damage MOSFETS/caps/etc
> Even with a good fan, PSU components can fail with age.

PSU is near the top of my list. I will be replacing it soon barring some new discovery.

> 4. Removable media drives -- floppy/CD/DVD-ROM can fail in odd ways
> sending spurious high voltage signals or shorting wires, locking up
> the motherboard, or causing random reboots. Disconnect their
> data cable and power leads and run without them.

These were already disconnected.

> 5. The motherboard. Even with good cooling over the life of a machine
> the motherboard can still simply fail. You may not be able to find
> bulged caps nor burn marks on VRMs, no visible signs of failure.

It could be that the motherboard is at the end. I've had it for eight years and it was used when I got it. My only issue is that this didn't happen until I upgraded to wheezy. If this had started happening before the upgrade or a week or two after the upgrade I would more readily suspect a hardware failure.

> Point in fact: I had a Biostar Socket A nForce2 400 motherboard
> w/Athlon XP 2500 simply give up the ghost in 2011 in a similar
> manner. It locked up a few times over a period of a week or so,
> then simply wouldn't post.
>
> I built that machine in Aug 2003 and it lasted 8 years. I started
> with two 92x25mm Panaflow case fans, plus the 80x25 PSU fan.
> I replaced the PSU fan with an NMB boxer, and the case fans with
> two Nidec Beta Vs, twice during the life of the box. All of the
> fans were fully functional at the time of replacement. This was
> proactive maintenance. The box had 110 CFM of properly directed
> airflow during its lifespan. Compare this to the ~30 CFM of a
> quiet Dell, HP, or IBM machine. Anyone who knows hardware knows
> that these are all top shelf 12VDC fans. The PSU is still running,
> in another box, as are the two DIMMS and the CPU. The motherboard
> simply gave up the ghost after 8 years of 24x7 operation. Let's
> hope that isn't the case here.
>
>


--
Chris Purves
Visit my blog: http://chris.northfolk.ca


--
To UNSUBSCRIBE, email to debian-us...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org
Archive: http://lists.debian.org/51C1A9A...@northfolk.ca

Stan Hoeppner

unread,
Jun 19, 2013, 11:00:01 AM6/19/13
to
On 6/19/2013 7:52 AM, Chris Purves wrote:

> CPU fan is about three years old and while trying to debug this previously I was running 'sensors' in a cron-job every five minutes and the CPU temp never exceeded 60 C.

60C seems high for an Athlon especially if idle. Given that you're
apparently near Moncton where recent daytime temps have been less than
21C, that's really high. Are you sure you have sensors.conf setup
correctly for that CPU, and I'll guess a Winbond w83782d? Did you maybe
apply to much TIM when you replaced the HSF a few years back? Did you
use decent TIM? If you used silicone base TIM that might explain the
high temp as well. It's more of a thermal insulator than a conductor.

> PSU is near the top of my list. I will be replacing it soon barring some new discovery.

Test it first if you have a multi tester. Digital peak/hold types work
best as you can easily see the swing. Make sure you test at idle and at
load. Load is where you'll see the big voltage drop, if indeed the PSU
is going South or is already there. Burnp6 is great for this.

> It could be that the motherboard is at the end. I've had it for eight years and it was used when I got it. My only issue is that this didn't happen until I upgraded to wheezy. If this had started happening before the upgrade or a week or two after the upgrade I would more readily suspect a hardware failure.

It's very likely simply coincidence. Just in case turn ACPI off in the
system BIOS. Nothing else kernel related could cause lockups like this,
that I'm aware of.

Also, for shits and giggles remove the chassis reset switch 2 pin
connector from the mobo header, along with the PC speaker. Unplug
everything that's not necessary for the machine to run.

Oh, one last thing. It's a PS/2 KB. Swap it with another one. I've
seen DIN5 and PS/2 KBs go bad and cause this type of lockup behavior.
If you're using a mechanical KVM take it out of the loop as they can
cause the same behavior on occasion.

--
Stan


--
To UNSUBSCRIBE, email to debian-us...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org
Archive: http://lists.debian.org/51C1C61...@hardwarefreak.com

Chris Purves

unread,
Jul 3, 2013, 3:10:03 PM7/3/13
to
On 2013-06-18 14:03, Jape Person wrote:
> On 06/18/2013 10:31 AM, Chris Purves wrote:
>> On 2013-06-18 10:20, Rob Owens wrote:
>>> ----- Original Message -----
>>>> From: "Chris Purves" <ch...@northfolk.ca>
>>>> To: debia...@lists.debian.org
>>>> Sent: Tuesday, June 18, 2013 8:59:07 AM
>>>> Subject: systems hangs every few days
>>>>
>>>> After upgrading to wheezy, I get a system hang every one or two days
>>>> where the system becomes completely unresponsive and I need do a
>>>> cold boot.
>>>>
>>> That sounds a lot like bad RAM. Run MEMTEST86 on it. I'm not sure if the debian install cds contain that, but Knoppix usually has that as a boot option. The full test could take more than a day. If you're lucky, it'll find an error in the first 10 minutes.
>>>
>>
>> I ran memtest86 already for about 20 minutes without error. (I forgot to mention that in original post). I will run it overnight tonight to see if anything else pops up.
>>
>> I am skeptical that it's bad RAM since the problem occurred after upgrading and I would have expected some kernel panic errors in the logs or something like that, although I suppose bad RAM can manifest itself in different ways.
>
> One thing I noted in your first message, Chris, was that it usually happens when
> a cron job is running. I immediately thought of two possibilities:
>
> 1. One or more of the cron jobs are running a process that is "tickling" a
> driver that doesn't work as well with the newer kernel as it did with the
> previous one.
>
> 2. A scheduled job is doing something CPU-intensive that could be making the
> system overheat. (I suppose load caused by a given job might be worse under
> wheezy than it was under squeeze.)
>


I looked more closely at the syslog just before the system hangs, and not always, but often the last line is a cron job calling a script that I wrote which uses phantomjs to log into the web GUI of my router to see if the IP address has been changed. I started running the script in a loop and twice after about 300 iterations the system would freeze up. I removed that script and have since reached an uptime of 8 days.

Interestingly, phantomjs is one of two packages that couldn't upgrade to wheezy. I'm still using the squeeze install. I have no idea how that could be the cause.

I'll report back in a few weeks to confirm if things are still running properly.


--
Chris Purves
Visit my blog: http://chris.northfolk.ca

"I can calculate the motion of heavenly bodies, but not the madness of people." - Sir Isaac Newton


--
To UNSUBSCRIBE, email to debian-us...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org
Archive: http://lists.debian.org/51D4758...@northfolk.ca
0 new messages