failed drive in a zpool replaced but still degraded

Michael

unread,

Oct 23, 2012, 5:48:30 PM10/23/12

to

Hi,

One drive failed in this zpool today, the spare did not kick in so I did
# zpool replace pool00 c0t50014EE058B642CAd0 c0t50014EE058B6241Dd0

But after resilver, the mirror pair is still degraded!

Why?

bash-4.1# zpool status -v
pool: pool00
state: DEGRADED
status: One or more devices has been taken offline by the administrator.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Online the device using 'zpool online' or replace the device with
'zpool replace'.
scan: resilvered 1.96T in 7h8m with 0 errors on Tue Oct 23 23:07:37 2012
config:

NAME STATE READ WRITE CKSUM
pool00 DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
c0t50014EE0AE0C2D19d0 ONLINE 0 0 0
spare-1 DEGRADED 0 0 0
c0t50014EE058B642CAd0 OFFLINE 0 0 0
c0t50014EE058B6241Dd0 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
c0t50014EE058B67BB9d0 ONLINE 0 0 0
c0t50014EE003612F7Cd0 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
c0t50014EE058B6AB26d0 ONLINE 0 0 0
c0t50014EE00360E486d0 ONLINE 0 0 0
mirror-3 ONLINE 0 0 0
c0t50014EE2B1B56B85d0 ONLINE 0 0 0
c0t50014EE2070AB80Dd0 ONLINE 0 0 0
spares
c0t50014EE058B6241Dd0 INUSE currently in use
c0t50014EE00360A1B8d0 AVAIL
c0t50014EE25C5FACF0d0 AVAIL
c0t50014EE2070A8ADAd0 AVAIL
c0t50014EE2B1B59B97d0 AVAIL

errors: No known data errors

Ian Collins

unread,

Oct 23, 2012, 5:52:19 PM10/23/12

to

On 10/24/12 10:48, Michael wrote:
> Hi,
>
> One drive failed in this zpool today, the spare did not kick in so I did
> # zpool replace pool00 c0t50014EE058B642CAd0 c0t50014EE058B6241Dd0
>
> But after resilver, the mirror pair is still degraded!
>
> Why?
>
> bash-4.1# zpool status -v
> pool: pool00
> state: DEGRADED
> status: One or more devices has been taken offline by the administrator.
> Sufficient replicas exist for the pool to continue functioning in a
> degraded state.
> action: Online the device using 'zpool online' or replace the device with
> 'zpool replace'.
> scan: resilvered 1.96T in 7h8m with 0 errors on Tue Oct 23 23:07:37 2012
> config:
>
> NAME STATE READ WRITE CKSUM
> pool00 DEGRADED 0 0 0
> mirror-0 DEGRADED 0 0 0
> c0t50014EE0AE0C2D19d0 ONLINE 0 0 0
> spare-1 DEGRADED 0 0 0
> c0t50014EE058B642CAd0 OFFLINE 0 0 0
> c0t50014EE058B6241Dd0 ONLINE 0 0 0

If you remove the offline device from the pool, the state will be OK.

--
Ian Collins

Michael

unread,

Oct 23, 2012, 5:59:09 PM10/23/12

to

Hi,

Thanks for superprompt reply. yes I forgot to detach the failed drive :)

But why did the machine hung when the drive failed, I had to reset the
machine did not even make it into alom console!

I have each drive connected to dedicated SAS channel on the HBA, so the
drives should not block the controller or?

The disks are SATA, SAS is dual channel is that an issue?

I am really getting sick of failing SATA drives, now I have cheap drives
before I had Enterprise SATA (: SCSI was so so muchbetter!

Is the only way to replace all drives with SAS drives?

Ian Collins

unread,

Oct 23, 2012, 6:04:07 PM10/23/12

to

On 10/24/12 10:59, Michael wrote:
> Hi,

>
> Thanks for superprompt reply. yes I forgot to detach the failed drive :)
>
> But why did the machine hung when the drive failed, I had to reset the
> machine did not even make it into alom console!

Which OS version and patch SRU are you running?

> I have each drive connected to dedicated SAS channel on the HBA, so the
> drives should not block the controller or?

Bugs happen...

> The disks are SATA, SAS is dual channel is that an issue?

No.

> I am really getting sick of failing SATA drives, now I have cheap drives
> before I had Enterprise SATA (: SCSI was so so muchbetter!
>
> Is the only way to replace all drives with SAS drives?

I have several pools of SATA drives (including a couple of x4540s) that
have been going for years without a failure. Are your drives well cooled?

--
Ian Collins

Michael

unread,

Oct 23, 2012, 6:28:40 PM10/23/12

to

Hi,

On 10/24/12 12:04 AM, Ian Collins wrote:
> On 10/24/12 10:59, Michael wrote:
>> Hi,
>>
>> Thanks for superprompt reply. yes I forgot to detach the failed drive :)
>>
>> But why did the machine hung when the drive failed, I had to reset the
>> machine did not even make it into alom console!
>
> Which OS version and patch SRU are you running?
>

Sparc T2000 and two LSI HBA S11 11/11, and CPU_1204

>> I have each drive connected to dedicated SAS channel on the HBA, so the
>> drives should not block the controller or?
>
> Bugs happen...

:)

>
>> The disks are SATA, SAS is dual channel is that an issue?
>
> No.
>
>> I am really getting sick of failing SATA drives, now I have cheap drives
>> before I had Enterprise SATA (: SCSI was so so muchbetter!
>>
>> Is the only way to replace all drives with SAS drives?
>
> I have several pools of SATA drives (including a couple of x4540s) that
> have been going for years without a failure. Are your drives well cooled?
>

Yes
This time the failure did arrive during a cp on the server of 300GB data.

Do you scrub on a regular base? I find that its is during heavy load
issues arise!

/michael

Ian Collins

unread,

Oct 23, 2012, 6:43:57 PM10/23/12

to

Once every month or two. Some of our systems are under fairly
continuous load, receiving streams and spooling to tape. Neither of the
x4540s have lost a drive in over 3 years continuous operation, which
included 2 >2G ground acceleration seismic events!

--
Ian Collins

Michael

unread,

Oct 23, 2012, 6:45:44 PM10/23/12

to

Hi,

Are these drives Sun drives, or what model is it?

/michael

Ian Collins

unread,

Oct 23, 2012, 6:54:02 PM10/23/12

to

Yes, those ones are.

My other systems use Seagate and WD drives, both have been reliable.

--
Ian Collins