[SLUG] detecting hard drive failure ?

358 views
Skip to first unread message

Voytek Eymont

unread,
Nov 10, 2010, 4:21:30 PM11/10/10
to sl...@slug.org.au
I have a brand new QNAP NAS with 4 SATA HD as 'Striping Disk Volume: Drive
1 2 3 4', installed couple of month ago

when 1st installed, using QNAP web i/f, I've run SMART tests, all were
100%, etc

yesterday, it seems HD3 suffered total failure, if says:

---------------------
Summary HD3
Hard disk does not exist.
---------------------
(though, LCD panel says disk 4: "HD4 ejected")

I can ssh to the NAS:

- what sort of tests or whatever can I run before I pull the unit down ?

- what sort of utility can I run to 'detect and notify' should such
failure occurs again ?

# uname -a
Linux NAS01 2.6.33.2 #1 SMP Tue Sep 28 00:54:34 CST 2010 i686 unknown

# df -h
Filesystem Size Used Available Use% Mounted on
/dev/ram 124.0M 109.7M 14.3M 88% /
tmpfs 32.0M 92.0k 31.9M 0% /tmp
/dev/sda4 310.0M 160.5M 149.5M 52% /mnt/ext
/dev/md9 509.5M 41.3M 468.2M 8% /mnt/HDA_ROOT
/dev/md0 3.6T 2.5T 1.1T 69% /share/MD0_DATA

--
Voytek

--
SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/
Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html

Ben Donohue

unread,
Nov 10, 2010, 4:45:13 PM11/10/10
to sl...@slug.org.au
Hi,

sometimes you can get a loose connection. If it is in a raid set you
should be able to pull it out and put it back in again and it will
automatically rebuild into the raid set. (depending on the raid
controller...)
It just might need the connectors reseated. First thing I'd try...

Thanks,
Ben Donohue
dono...@icafe.com.au

Andrew Cowie

unread,
Nov 10, 2010, 5:21:27 PM11/10/10
to sl...@slug.org.au
On Thu, 2010-11-11 at 08:21 +1100, Voytek Eymont wrote:

> I can ssh to the NAS:
>
> - what sort of tests or whatever can I run before I pull the unit down ?

Does `mdadm` work?

# mdadm --detail /dev/md0

although the way QNap lays out their filesystems is slightly over the
top, so you have to be careful interpreting the results [one of the
reasons we switched to running a full Linux distro on them rather than
QNap's]

AfC
Sydney

--
Andrew Frederick Cowie

Operational Dynamics is an operations and engineering consultancy
focusing on IT strategy, organizational architecture, systems
review, and effective procedures for change management: enabling
successful deployment of mission critical information technology in
enterprises, worldwide.

http://www.operationaldynamics.com/

Sydney New York Toronto London

signature.asc

Voytek Eymont

unread,
Nov 10, 2010, 6:04:26 PM11/10/10
to sl...@slug.org.au

On Thu, November 11, 2010 9:21 am, Andrew Cowie wrote:
> On Thu, 2010-11-11 at 08:21 +1100, Voytek Eymont wrote:

> # mdadm --detail /dev/md0

Andrew, thanks

[/] # mdadm --detail /dev/md0
/dev/md0:
Version : 00.90.03
Creation Time : Sat Jun 19 04:35:02 2010
Raid Level : raid0
Array Size : 3900774400 (3720.07 GiB 3994.39 GB)
Raid Devices : 4
Total Devices : 4
Preferred Minor : 0
Persistence : Superblock is persistent

Update Time : Sat Jun 19 04:35:02 2010
State : clean
Active Devices : 4
Working Devices : 4
Failed Devices : 0
Spare Devices : 0

Chunk Size : 64K

UUID : 79e23cd2:b3f9618d:58a8936b:5e0d814b
Events : 0.1

Number Major Minor RaidDevice State
0 8 3 0 active sync /dev/sda3
1 8 19 1 active sync /dev/sdb3
2 8 35 2 active sync /dev/sdc3
3 8 51 3 active sync /dev/sdd3
[/] #


> although the way QNap lays out their filesystems is slightly over the top,
> so you have to be careful interpreting the results [one of the reasons we
> switched to running a full Linux distro on them rather than QNap's]

what distro do you run ?
it seems like a nice system, though, I'm coming from zero prior experience

Richard Ibbotson

unread,
Nov 11, 2010, 6:53:54 AM11/11/10
to sl...@slug.org.au
> - what sort of tests or whatever can I run before I pull the
> unit down ?
>
> - what sort of utility can I run to 'detect and notify' should
> such failure occurs again ?


You might try these... one of more of these should tell you
something........

http://www.meiring.org.uk/sheflug/mailarchive/2010/10/msg00016.html

--
Richard
http://www.sheflug.org.uk

Voytek Eymont

unread,
Nov 11, 2010, 11:45:17 PM11/11/10
to sl...@slug.org.au

On Thu, November 11, 2010 8:58 am, Richard Ibbotson wrote:
>>> - what sort of tests or whatever can I run before I pull the unit
>>> down ?
>>> - what sort of utility can I run to 'detect and notify' should
>>> such failure occurs again ?
>
> You might try these... one of more of these should tell you
> something........

Richard, thanks

in this case, there is no real need to recover any data, as 'nothing
happened';

this is a rolling storage for cameras, so we lost data that would soon be
overwritten, anyhow

but, what can I use to be notified should such a failure occur again ?

--
Voytek

Richard Ibbotson

unread,
Nov 12, 2010, 6:25:04 AM11/12/10
to sl...@slug.org.au
On Friday 12 November 2010 04:45:17 Voytek Eymont wrote:
> this is a rolling storage for cameras, so we lost data that would
> soon be overwritten, anyhow
>
> but, what can I use to be notified should such a failure occur
> again ?

Can't think of anything other than occasional checks with the disks
I've shown you. Other than that a bash script <scratches head>.
But, can't remember the script. Looking through my bookshelf doesn't
reveal anything either. Something that does a umount and then does an
fsck and then drop a log or send a script to an e-mail address ? This
might or might not be helpful.....

http://www.cyberciti.biz/tips/linux-find-out-if-harddisk-failing.html
http://www.linuxjournal.com/magazine/monitoring-hard-disks-smart

Hope this helps.

--
Richard
http://sleepypenguin.homelinux.org/blog/

pe...@chubb.wattle.id.au

unread,
Nov 12, 2010, 4:32:13 PM11/12/10
to richard....@gmail.com, sl...@slug.org.au
>>>>> "Richard" == Richard Ibbotson <richard....@googlemail.com> writes:

>> > - what sort of tests or whatever can I run before I pull the unit
>> > down ?
>> >
>> > - what sort of utility can I run to 'detect and notify' should
>> > such failure occurs again ?

Can you use SMART?

Also, the main problem I've seen has been random bit-flips in the RAM
before writing to disk, to solve which you need end-to-end CRCs.

Dr Peter Chubb peter DOT chubb AT nicta.com.au
http://www.ertos.nicta.com.au ERTOS within National ICT Australia
All things shall perish from under the sky/Music alone shall live, never to die

Richard Ibbotson

unread,
Nov 12, 2010, 5:13:24 PM11/12/10
to sl...@slug.org.au
> Can you use SMART?

Thinking was a bit fuzzy this morning. I was thinking that you might
be able to write a bash script using smartmon. Any bash script would
have to have something like this in it...

smartctl -d ata -H /dev/sdb

or whatever would be relevant. Also some part of the script would
send out a text file by e-mail or dump into /var/log with a cron job.
Of course, I'm supposing that Voytek is using a Linux or UNIX
operating system.

--
Richard
http://sleepypenguin.homelinux.org/blog/

David Balnaves

unread,
Nov 13, 2010, 1:52:03 AM11/13/10
to sl...@slug.org.au
Hi Voytek,

I'm not really sure what the best indicators are of a failing hard drive.
I've used smart on a lot of hard drives; I've seen undocumented smart
values and even hard drives function fine for a number of years when smart
reports they are "FAILING NOW'. I've also seen some drives enter a state
where they wont allow further smart tests (on/offline) to be run or aborted.
This has lead me to believe that smart as an indicator needs to be
considered on a per model basis and run carefully within the capabilities of
the drive. The whole process has given me more questions than answers.

I try to detect a failure by monitoring huge changes in the smart
attributes. I've configured munin to monitor the smart attributes; It
wouldn't be too hard to change the plugin to monitor these values on your
NAS (I imagine you can ssh/telnet to it). You will notice some variance in
things like temperature and ECC, but unless they start behaving erratically
then I wouldn't worry.

Hope this helps in 'detecting and notifying' potential failures.

Best Regards,
David Balnaves

Voytek Eymont

unread,
Nov 13, 2010, 8:39:39 PM11/13/10
to sl...@slug.org.au

On Sat, November 13, 2010 8:32 am, pe...@chubb.wattle.id.au wrote:

>>>> - what sort of tests or whatever can I run before I pull the unit
>>>> down ?
>>>>
>>>> - what sort of utility can I run to 'detect and notify' should
>>>> such failure occurs again ?

> Can you use SMART?
> Also, the main problem I've seen has been random bit-flips in the RAM
> before writing to disk, to solve which you need end-to-end CRCs.

Peter, thanks

well, the NAS box has 'smart' in its web admin config
but I'm not sure how much access I get to it
when I first got it, I enabled daily or weekly smart test
all the tests were 100%
I then disabled the daily test as I wanted to see if that was the spike in
my cacti cpu graph
then one hard drive simply dissapeared

but according to ps ax, smartd is running

I still haven't pulled the NAS down, I can log to the web admin, and,
everything is supposedly OK; under "HDD SMART":

----------------------------------------
Monitor hard disk health, temperature, and usage status by the hard disk
S.M.A.R.T. mechanism.
Select Hard Disk [3]
* Summary
* Hard Disk Information
* SMART Information
* Test
* Settings
Summary


Hard disk does not exist.

---------------------------------------

hmmmmm, above is what I *see*

but, when I copy/paste, it also has:

======================================
Loading data, please wait...
This is an error message!!!

Monitor hard disk health, temperature, and usage status by the hard disk
S.M.A.R.T. mechanism.
.....
======================================

hmmm, is this the error message that I'm supposed to see ?

it doesn't show in FF/Chrome/IE

in ssg I get:

# uname -a
Linux NAS01 2.6.33.2 #1 SMP Tue Sep 28 00:54:34 CST 2010 i686 unknown

# ps ax | grep smart
2071 admin 1528 S /sbin/qsmartd -d
18713 admin 580 S grep smart

# smartctl
-sh: smartctl: command not found

# qsmartd --help
Usage: qsmartd [-d] [-o log_file] [-v] [-h]

Smart monitor daemon

Options:
-d --daemon Running on daemon
-o log_file --output log_file Save debug message in log_file
-v --versbose Verbose mode
-h --help Help

--
Voytek

Voytek Eymont

unread,
Nov 13, 2010, 8:43:49 PM11/13/10
to sl...@slug.org.au

On Sat, November 13, 2010 9:13 am, Richard Ibbotson wrote:
>> Can you use SMART?
>>
>
> Thinking was a bit fuzzy this morning. I was thinking that you might
> be able to write a bash script using smartmon. Any bash script would have
> to have something like this in it...
>
> smartctl -d ata -H /dev/sdb
>
> or whatever would be relevant. Also some part of the script would send
> out a text file by e-mail or dump into /var/log with a cron job. Of
> course, I'm supposing that Voytek is using a Linux or UNIX operating
> system.

Richard, thanks

the NAS runs Linux, though I'm not sure how much I'm allowed to do to it,
I can ssh to it, it doesn't seem to have 'smartctl' though it runs
'smartd'

# uname -a
Linux NAS01 2.6.33.2 #1 SMP Tue Sep 28 00:54:34 CST 2010 i686 unknown

# smartctl


-sh: smartctl: command not found

# ps ax | grep smart


2071 admin 1528 S /sbin/qsmartd -d

24006 admin 580 S grep smart


--
Voytek

Voytek Eymont

unread,
Nov 13, 2010, 8:57:59 PM11/13/10
to sl...@slug.org.au

On Sat, November 13, 2010 5:52 pm, David Balnaves wrote:

> I'm not really sure what the best indicators are of a failing hard drive.
> I've used smart on a lot of hard drives; I've seen undocumented smart
> values and even hard drives function fine for a number of years when smart
> reports they are "FAILING NOW'. I've also seen some drives enter a
> state where they wont allow further smart tests (on/offline) to be run or
> aborted. This has lead me to believe that smart as an indicator needs to
> be considered on a per model basis and run carefully within the
> capabilities of the drive. The whole process has given me more questions
> than answers.
>
> I try to detect a failure by monitoring huge changes in the smart
> attributes. I've configured munin to monitor the smart attributes; It
> wouldn't be too hard to change the plugin to monitor these values on your
> NAS (I imagine you can ssh/telnet to it). You will notice some variance
> in things like temperature and ECC, but unless they start behaving
> erratically then I wouldn't worry.
>
> Hope this helps in 'detecting and notifying' potential failures.

David, thanks

yes, I can ssh to it

I'm not very familiar with the raid utilities (beyond knowing what the
acronym stand for...)

but I get:

# mdadm --detail /dev/md0
/dev/md0:
Version : 00.90.03
Creation Time : Sat Jun 19 04:35:02 2010
Raid Level : raid0
Array Size : 3900774400 (3720.07 GiB 3994.39 GB)
Raid Devices : 4
Total Devices : 4
Preferred Minor : 0
Persistence : Superblock is persistent

Update Time : Sat Jun 19 04:35:02 2010
State : clean
Active Devices : 4
Working Devices : 4
Failed Devices : 0
Spare Devices : 0

Chunk Size : 64K

UUID : 79e23cd2:b3f9618d:58a8936b:5e0d814b
Events : 0.1

Number Major Minor RaidDevice State
0 8 3 0 active sync /dev/sda3
1 8 19 1 active sync /dev/sdb3
2 8 35 2 active sync /dev/sdc3
3 8 51 3 active sync /dev/sdd3


# mount
/proc on /proc type proc (rw)
none on /dev/pts type devpts (rw,gid=5,mode=620)
sysfs on /sys type sysfs (rw)
tmpfs on /tmp type tmpfs (rw,size=32M)
none on /proc/bus/usb type usbfs (rw)
/dev/sda4 on /mnt/ext type ext3 (rw)
/dev/md9 on /mnt/HDA_ROOT type ext3 (rw)
/dev/md0 on /share/MD0_DATA type ext4
(rw,usrjquota=aquota.user,jqfmt=vfsv0,user_xattr,data=ordered,nodelalloc)

# ls /share/MD0_DATA
ls: /share/MD0_DATA/Web: Input/output error
ls: /share/MD0_DATA/Network Recycle Bin: Input/output error
ls: /share/MD0_DATA/lost+found: Input/output error
ls: /share/MD0_DATA/Download: Input/output error
ls: /share/MD0_DATA/aquota.user: Input/output error
ls: /share/MD0_DATA/Multimedia: Input/output error
ls: /share/MD0_DATA/Usb: Input/output error
ls: /share/MD0_DATA/Recordings: Input/output error
ls: /share/MD0_DATA/Public: Input/output error
cameras/

Ben Donohue

unread,
Nov 14, 2010, 12:17:43 AM11/14/10
to sl...@slug.org.au
Hi,
I'm not sure about NAS boxes... but HP raid stores the raid array config
on the disks themselves. Such that you could take out 4 disks of a raid
array and put them in another server and the raid would come up ok. And
this is on a different raid controller.

So... if you have a backup of the data, have you tried to just take out
the disks and put them back in the same NAS box in different places?
Perhaps the connector is faulty. See whether the problem follows the
disk or the problem follows the slot where the disk is.

Thanks,
Ben Donohue
dono...@icafe.com.au

Voytek Eymont

unread,
Nov 15, 2010, 4:53:01 PM11/15/10
to sl...@slug.org.au

On Sun, November 14, 2010 4:17 pm, Ben Donohue wrote:

> I'm not sure about NAS boxes... but HP raid stores the raid array config
> on the disks themselves. Such that you could take out 4 disks of a raid
> array and put them in another server and the raid would come up ok. And
> this is on a different raid controller.
>
> So... if you have a backup of the data, have you tried to just take out
> the disks and put them back in the same NAS box in different places?
> Perhaps the connector is faulty. See whether the problem follows the
> disk or the problem follows the slot where the disk is.

Ben,

thanks

I pulled the unit down yesterday, pulled the drives out, starred hard at
the bare drives for a little while, then, pushed them back in (same slots)

started up, it all seemed to work

I wasn't quite sure what to run from console, so used the web i/f to
run smart tests on all drives, all OK

but, 'check disk' failed to run, and, on one of several power up (I was
also testing UPS shut down/restart)

so I reformatted to whole volume, and, now it seems OK

EXCEPT, LCD panel has message about HD4 ejected, and, googleing for HD
info (WDC WD1002FBYS-02A6B03.0) brought another QNAP user (Raid5)
reporting same drive failing after just few weeks from bnew...

anyhow, I think I'll see how it goes over next few days

---------
[Strip Disk Volume: Drive 1 2 3 4] The file system is not clean. It is
suggested that you run "check disk".
--------
20:19:02 System 127.0.0.1 localhost [Strip Disk Volume: Drive 1 2 3 4]
Examination failed.
20:04:03 System 127.0.0.1 localhost [Strip Disk Volume: Drive 1 2 3 4]
Start examination.
-------

08:36:07 System 127.0.0.1 localhost [Strip Disk Volume: Drive 1 2 3 4]
Examination completed.
08:26:02 System 127.0.0.1 localhost [Strip Disk Volume: Drive 1 2 3 4]
Start examination.


--
Voytek

Reply all
Reply to author
Forward
0 new messages