Detecting I/O error and Halting System

10 vues
Accéder directement au premier message non lu

zine el abidine Hamid

non lue,
27 mars 2006, 10:00:1127/03/2006
à
Hi everybody,

I have I/O error which occurs on servers based on a
VIA VT82C686 chipset and I have to prevent or stop the
error. I spent a lot time for searching solutions to
stop the error but I don't found anything, So I want
to write a module which will surveil the HDD and
stops the system after sending a mail.

I read a lot of documents about kernel and writting
modules but I don't know how to start...? Help,
please.

I'm not closed to others solutions (like smartd, or
writting classical programms)

Best regards.

Zine

PS : this are errors du to VIA chipset; if anyone
knows wath appens...?


Feb 12 04:46:03 porte_de_clignancourt_nds_b kernel:
hda: timeout waiting for DMA
Feb 12 04:46:06 alesia_nds_b ucd-snmp[812]: Connection
from 104.25.3.11
Feb 12 04:46:23 porte_de_clignancourt_nds_b kernel:
ide_dmaproc: chipset supported ide_dma_timeout func
only: 14
Feb 12 04:46:23 porte_de_clignancourt_nds_b kernel:
hda: status timeout: status=0xd0 { Busy } adapter
disque annonce un status busy du DMA
Feb 12 04:46:23 porte_de_clignancourt_nds_b kernel:
hda: drive not ready for command
Feb 12 04:46:23 porte_de_clignancourt_nds_b
ucd-snmp[813]: Connection from 104.1.3.11
Feb 12 04:46:23 porte_de_clignancourt_nds_b
ucd-snmp[813]: Connection from 104.1.3.11
Feb 12 04:46:23 porte_de_clignancourt_nds_b last
message repeated 3 times
Feb 12 04:46:23 porte_de_clignancourt_nds_b kernel:
ide0: reset: success
Feb 12 10:22:38 porte_de_clignancourt_nds_b kernel:
hda: timeout waiting for DMA
Feb 12 10:24:46 porte_de_clignancourt_nds_b kernel:
ide_dmaproc: chipset supported ide_dma_timeout func
only: 14
Feb 12 10:24:46 porte_de_clignancourt_nds_b kernel:
hda: status timeout: status=0xd0 { Busy }
Feb 12 10:24:47 porte_de_clignancourt_nds_b kernel:
hda: drive not ready for command
Feb 12 10:24:47 porte_de_clignancourt_nds_b
ucd-snmp[813]: Connection from 104.1.3.11
Feb 12 10:24:47 porte_de_clignancourt_nds_b last
message repeated 4 times
Feb 12 10:24:47 porte_de_clignancourt_nds_b kernel:
ide0: reset timed-out, status=0x80
le premier reser de ide0 est en échec
Feb 12 10:24:47 porte_de_clignancourt_nds_b kernel:
hda: status timeout: status=0x80 { Busy }
Feb 12 10:24:47 porte_de_clignancourt_nds_b kernel:
hda: drive not ready for command
Feb 12 10:24:47 porte_de_clignancourt_nds_b
ucd-snmp[813]: Connection from 104.1.3.11
Feb 12 10:24:47 porte_de_clignancourt_nds_b kernel:
ide0: reset: success

Feb 12 10:24:47 porte_de_clignancourt_nds_b kernel:
hda: irq timeout: status=0xd0 { Busy }

Feb 12 10:24:47 porte_de_clignancourt_nds_b kernel:
hda: DMA disabled

Feb 12 10:24:47 porte_de_clignancourt_nds_b kernel:
ide0: reset timed-out, status=0x80
Feb 12 10:24:47 porte_de_clignancourt_nds_b kernel:
hda: status timeout: status=0x80 { Busy }
nouvel échec de reset
Feb 12 10:24:47 porte_de_clignancourt_nds_b kernel:
hda: drive not ready for command
Feb 12 10:24:47 porte_de_clignancourt_nds_b kernel:
ide0: reset: success

Feb 12 13:45:38 porte_de_clignancourt_nds_b kernel:
ide0: reset timed-out, status=0x80
Feb 12 13:45:38 porte_de_clignancourt_nds_b kernel:
hda: status timeout: status=0x80 { Busy }
Feb 12 13:45:38 porte_de_clignancourt_nds_b kernel:
hda: drive not ready for command
Feb 12 13:45:38 porte_de_clignancourt_nds_b kernel:
ide0: reset timed-out, status=0x80
Feb 12 13:45:38 porte_de_clignancourt_nds_b kernel:
end_request: I/O error, dev 03:02 (hda), sector 102263
Feb 12 13:45:38 porte_de_clignancourt_nds_b syslogd:
/var/log/maillog: Input/output error
Feb 12 13:45:38 porte_de_clignancourt_nds_b kernel:
end_request: I/O error, dev 03:02 (hda), sector 110720
Feb 12 13:45:38 porte_de_clignancourt_nds_b kernel:
end_request: I/O error, dev 03:02 (hda), sector 110728




___________________________________________________________________________
Nouveau : téléphonez moins cher avec Yahoo! Messenger ! Découvez les tarifs exceptionnels pour appeler la France et l'international.
Téléchargez sur http://fr.messenger.yahoo.com
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

linux-os (Dick Johnson)

non lue,
27 mars 2006, 10:30:2627/03/2006
à

Maybe you should just fix the problem rather than attempting
to work around it! The problem is that that you system has
difficulty communicating with a hard disk. This is caused
by either:

(1) Bad hard disk.
(2) Bad cable.
(3) Improper configuration of one or more disks.

Since a reset timed out, it is likely that one of the disks
that share the same cable is defective, not necessarily /dev/hda
if you have another drive on the same cable. If you have only
one drive per cable (or only one drive), it is likely that
/dev/hda is (or becomes defective). Note that the disk can
fail if it gets too hot.


Cheers,
Dick Johnson
Penguin : Linux version 2.6.15.4 on an i686 machine (5589.42 BogoMips).
Warning : 98.36% of all statistics are fiction, book release in April.
_

****************************************************************
The information transmitted in this message is confidential and may be privileged. Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited. If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to Deliver...@analogic.com - and destroy all copies of this information, including any attachments, without reading or disclosing them.

Thank you.

Alan Cox

non lue,
27 mars 2006, 11:50:2627/03/2006
à
On Llu, 2006-03-27 at 16:55 +0200, zine el abidine Hamid wrote:
> hda: status timeout: status=0xd0 { Busy } adapter
> disque annonce un status busy du DMA

If I'm reading the translation right then your hard disk decided
it was busy and then never came back

> Feb 12 04:46:23 porte_de_clignancourt_nds_b kernel:
> ide0: reset: success

So the IDE layer tried to reset it

> Feb 12 10:22:38 porte_de_clignancourt_nds_b kernel:
> hda: timeout waiting for DMA

Which didnt help

> Feb 12 10:24:47 porte_de_clignancourt_nds_b kernel:
> ide0: reset: success

Still trying



> Feb 12 10:24:47 porte_de_clignancourt_nds_b kernel:
> hda: irq timeout: status=0xd0 { Busy }
>
> Feb 12 10:24:47 porte_de_clignancourt_nds_b kernel:
> hda: DMA disabled

We gave up on DMA to see if PIO would help


>
>
> Feb 12 10:24:47 porte_de_clignancourt_nds_b kernel:
> ide0: reset timed-out, status=0x80
> Feb 12 10:24:47 porte_de_clignancourt_nds_b kernel:
> hda: status timeout: status=0x80 { Busy }
> nouvel échec de reset
> Feb 12 10:24:47 porte_de_clignancourt_nds_b kernel:
> hda: drive not ready for command
> Feb 12 10:24:47 porte_de_clignancourt_nds_b kernel:
> ide0: reset: success

And reset..


> Feb 12 13:45:38 porte_de_clignancourt_nds_b kernel:
> hda: status timeout: status=0x80 { Busy }
> Feb 12 13:45:38 porte_de_clignancourt_nds_b kernel:
> hda: drive not ready for command
> Feb 12 13:45:38 porte_de_clignancourt_nds_b kernel:
> ide0: reset timed-out, status=0x80
> Feb 12 13:45:38 porte_de_clignancourt_nds_b kernel:
> end_request: I/O error, dev 03:02 (hda), sector 102263
> Feb 12 13:45:38 porte_de_clignancourt_nds_b syslogd:
> /var/log/maillog: Input/output error
> Feb 12 13:45:38 porte_de_clignancourt_nds_b kernel:
> end_request: I/O error, dev 03:02 (hda), sector 110720
> Feb 12 13:45:38 porte_de_clignancourt_nds_b kernel:
> end_request: I/O error, dev 03:02 (hda), sector 110728

Eventually we give up.


First thing to check would be the disk and the temperature, then the
cabling. In particular make sure the *long* part of the cable is between
the drive and the controller.

zine el abidine Hamid

non lue,
28 mars 2006, 10:10:1228/03/2006
à
First of all, thank you for your analysis.

I don't think that it's a HDD problem nor a cable
problem because the servers are new. We have tried
different HDD (seagate, maxtor) but it has not help
anyway.
It's perhaps a temperature problem but we make a lot
tests in hard condition (high temperature)
successfuly...

One thinks that the problem comes from the VIA chipset
VT82c686 (it's also the opinion of Dick Johnson
(linux-os) whom advised me to try UDMA33 instead of
UDMA66).

How can I determine the problem?

I want to add that the HDD seems to be disconnected
(the BIOS can't find any drive for boot) after a
simple reset. We must switch off the servers to get
them work again.
However, it takes a long time (4 mounths and more)
before the HDD fell down. I want to work around by
write a module which will supervise the HDD. I know
how to write a module (I used the lkmpg guide
(http://www.tldp.org/LDP/lkmpg/) but how can I
shutdown Linux from inside a module...?

best regards.

Zine.


--- Alan Cox <al...@lxorguk.ukuu.org.uk> a écrit :


___________________________________________________________________________
Nouveau : téléphonez moins cher avec Yahoo! Messenger ! Découvez les tarifs exceptionnels pour appeler la France et l'international.
Téléchargez sur http://fr.messenger.yahoo.com

Alan Cox

non lue,
28 mars 2006, 10:30:1928/03/2006
à
On Maw, 2006-03-28 at 17:07 +0200, zine el abidine Hamid wrote:
> I don't think that it's a HDD problem nor a cable
> problem because the servers are new. We have tried

New. Thats another word for "untested" I believe 8)

> How can I determine the problem?

I would consult the hardware vendor

> I want to add that the HDD seems to be disconnected
> (the BIOS can't find any drive for boot) after a
> simple reset. We must switch off the servers to get
> them work again.

Thats strongly indicating a hardware problem.

> (http://www.tldp.org/LDP/lkmpg/) but how can I
> shutdown Linux from inside a module...?

See the softdog driver for an example.

Gene Heskett

non lue,
28 mars 2006, 13:00:2028/03/2006
à

I take it that you are aware of a drive monitoring utility called
smartd? By querying the drive after a new powerup, you may be able to
extract usefull information about its health.

--
Cheers, Gene
People having trouble with vz bouncing email to me should add the word
'online' between the 'verizon', and the dot which bypasses vz's
stupid bounce rules. I do use spamassassin too. :-)
Yahoo.com and AOL/TW attorneys please note, additions to the above
message by Gene Heskett are:
Copyright 2006 by Maurice Eugene Heskett, all rights reserved.

zine el abidine Hamid

non lue,
30 mars 2006, 03:20:1030/03/2006
à

I know about smartd, but the HDD are ok. When the
problem happen's, we have to switch off/on the servers
and then go on without any errors; The servers work's
after that like nothing happen's...

--- Gene Heskett <gene.h...@verizon.net> a écrit :


___________________________________________________________________________
Nouveau : téléphonez moins cher avec Yahoo! Messenger ! Découvez les tarifs exceptionnels pour appeler la France et l'international.
Téléchargez sur http://fr.messenger.yahoo.com

Bernd Eckenfels

non lue,
30 mars 2006, 07:30:1830/03/2006
à
Alan Cox <al...@lxorguk.ukuu.org.uk> wrote:
> See the softdog driver for an example.

The usermode agent (watchdog(8) can, btw monitor the availableness of a
file, no need to write a module. MAybe this feature was added after somebody
took that code over from you? :)

watchdog.conf(5) says:

# file = <filename> Set file name for file mode. This option can be
# given as often as you like to check several files.

Bernd

zine el abidine Hamid

non lue,
2 avr. 2006, 18:40:1802/04/2006
à
Hi Dick,

Excuses me for the silence.

Can you tell me why the drive stops to respond? Where
the problem starts? As I tell before, we tried
different drives and that has not help us.

Friday, The constructor comes to a meeting in our
offices and they affirm that their materials is ok.

Have you read some articles that pointed out failures
on the South Bridge VIA VT82c686 or other's chipset
off the VIA PN133 architecture?
(http://www.via.com.tw/en/products/chipsets/legacy/pn133/)

If it can help, this are the description :


00:00.0 Host bridge: VIA Technologies, Inc. VT8605
[ProSavage PM133] (rev 01)
00:01.0 PCI bridge: VIA Technologies, Inc. VT8605
[PM133 AGP]
00:07.0 ISA bridge: VIA Technologies, Inc. VT82C686
[Apollo Super South] (rev 40
)
00:07.1 IDE interface: VIA Technologies, Inc. Bus
Master IDE (rev 06)
00:07.2 USB Controller: VIA Technologies, Inc. USB
(rev 1a)
00:07.3 USB Controller: VIA Technologies, Inc. USB
(rev 1a)
00:07.4 Bridge: VIA Technologies, Inc. VT82C686
[Apollo Super ACPI] (rev 40)
00:07.5 Multimedia audio controller: VIA Technologies,
Inc. VT82C686 AC97 Audio
Controller (rev 50)
00:10.0 Ethernet controller: Intel Corp. 82557/8/9
[Ethernet Pro 100] (rev 08)
01:00.0 VGA compatible controller: S3 Inc. ProSavage
PM133 (rev 02)
[root@Porte_de_Clignancourt_nds_b root]# lspci -n
00:00.0 Class 0600: 1106:0605 (rev 01)
00:01.0 Class 0604: 1106:8605
00:07.0 Class 0601: 1106:0686 (rev 40)
00:07.1 Class 0101: 1106:0571 (rev 06)
00:07.2 Class 0c03: 1106:3038 (rev 1a)
00:07.3 Class 0c03: 1106:3038 (rev 1a)
00:07.4 Class 0680: 1106:3057 (rev 40)
00:07.5 Class 0401: 1106:3058 (rev 50)
00:10.0 Class 0200: 8086:1229 (rev 08)
01:00.0 Class 0300: 5333:8a25 (rev 02)

processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 8
model name : Pentium III (Coppermine)
stepping : 10
cpu MHz : 998.393
cache size : 256 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8
sep mtrr pge mca cmov pat pse36 mmx fxsr sse
bogomips : 1992.29

--- Alan Cox <al...@lxorguk.ukuu.org.uk> a écrit :

> On Maw, 2006-03-28 at 17:07 +0200, zine el abidine


> Hamid wrote:
> > I don't think that it's a HDD problem nor a cable
> > problem because the servers are new. We have tried
>
> New. Thats another word for "untested" I believe 8)
>
> > How can I determine the problem?
>
> I would consult the hardware vendor
>
> > I want to add that the HDD seems to be
> disconnected
> > (the BIOS can't find any drive for boot) after a
> > simple reset. We must switch off the servers to
> get
> > them work again.
>
> Thats strongly indicating a hardware problem.
>
> > (http://www.tldp.org/LDP/lkmpg/) but how can I
> > shutdown Linux from inside a module...?
>
> See the softdog driver for an example.
>
>


___________________________________________________________________________
Nouveau : téléphonez moins cher avec Yahoo! Messenger ! Découvez les tarifs exceptionnels pour appeler la France et l'international.
Téléchargez sur http://fr.messenger.yahoo.com

zine el abidine Hamid

non lue,
2 avr. 2006, 19:00:1902/04/2006
à

watchdog isn't helpfull because some parts of the
drives (some directories, and some command) seems to
be accessible and most aren't. But I think that the
files that seems to be readable are on the cache.


--- Bernd Eckenfels <be-n...@lina.inka.de> a écrit :


___________________________________________________________________________
Nouveau : téléphonez moins cher avec Yahoo! Messenger ! Découvez les tarifs exceptionnels pour appeler la France et l'international.
Téléchargez sur http://fr.messenger.yahoo.com

Répondre à tous
Répondre à l'auteur
Transférer
0 nouveau message