Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
Advice on dealing with permanent errors
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  5 messages - Collapse all  -  Translate all to Translated (View all originals)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Budric  
View profile  
 More options May 30 2012, 2:03 pm
From: Budric <bud...@gmail.com>
Date: Wed, 30 May 2012 11:03:50 -0700 (PDT)
Local: Wed, May 30 2012 2:03 pm
Subject: Advice on dealing with permanent errors
Hi,

I wanted to replace a drive in a functional RaidZ ("raid 5")
consisting of 3 drives.  The new drive is bigger in size, and I want
to make the pool bigger by replacing all 3 drives eventually.  The
last scrub completed without errors Monday May 28th, so I did not
issue another scrub before starting this on Wednesday.  I removed one
of the drives, started the system, pool was in degraded mode (one
drive couldn't be opened obviously).  I issued "zpool replace tank /
dev/old_drive /dev/new_drive" and it went on resilvering.

After a few hours of resilvering I see that there are 368 permanent
errors in various files.  Some in a snapshot, some on the pool
itself.  So I think there must have been errors in the   2 remaining
drives and now as it's trying to recover the information it can't.

I know hindsight is 20/20.  I should have scrubbed the pool before
replacing drives.  I should have left the drive in the pool and
replaced it while it was active, not doing it in degraded mode.  But,
any advice on what I can do now?  I still have the drive I removed.
If shutdown the machine now (as it's still resilvering), put the drive
in along side the new drive and restart the computer what will
happen?  Will the device that could not be opened be part of a pool
again?  Will the pool be in good state and resilver onto the bigger
drive will work?

Thanks for any suggestions.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Emmanuel Anne  
View profile   Translate to Translated (View Original)
 More options May 30 2012, 3:22 pm
From: Emmanuel Anne <emmanuel.a...@gmail.com>
Date: Wed, 30 May 2012 21:22:37 +0200
Local: Wed, May 30 2012 3:22 pm
Subject: Re: [zfs-fuse] Advice on dealing with permanent errors

368 permanent errors which appear 2 days after a full scrub just when you
replace one of the drives, it sounds very much like there is a big bug
somewhere or you are one of the most unlucky person on earth !... A
nightmare in this situation. Not sure there is much solution here...

Maybe playing with options --log-uberblocks / --min-uberblock-txg but not
sure it would help here, the errors are probably at the pool level and
these options are for the fs level.
In the same kind of idea : remove the new disk, run scrub again, but it
will be long and it's unlikely to work.
Except that no idea... Hum, maybe try to import the pool with another zfs
version, eventually one from opensolaris using some live dvd if there are
still some available to see how it will work. I know nothing about
opensolaris so I can't help here though.

2012/5/30 Budric <bud...@gmail.com>

--
my zfs-fuse git repository :
http://rainemu.swishparty.co.uk/cgi-bin/gitweb.cgi?p=zfs;a=summary

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Björn Kahl  
View profile   Translate to Translated (View Original)
 More options May 30 2012, 3:46 pm
From: Björn Kahl <googlelo...@bjoern-kahl.de>
Date: Wed, 30 May 2012 21:46:20 +0200
Local: Wed, May 30 2012 3:46 pm
Subject: Re: [zfs-fuse] Advice on dealing with permanent errors

Am 30.05.12 20:03, schrieb Budric:

> I wanted to replace a drive in a functional RaidZ ("raid 5")
> consisting of 3 drives.  The new drive is bigger in size, and I want
> to make the pool bigger by replacing all 3 drives eventually.  The
> last scrub completed without errors Monday May 28th, so I did not
> issue another scrub before starting this on Wednesday.  
> After a few hours of resilvering I see that there are 368 permanent
> errors in various files.

 More than 360 errors in two days sounds wrong.

 How are the three drives connected to the system?  The original three
 and the new ones?

 To me, this looks like you get random bit errors while reading, and /
 or suffer from misdirected writes (blocks meant for the new drive
 hitting one of the two old drives).

 I had a similar problem once with an external 4-bay eSATA enclosure.
 My specific combination of enclosure, controller card and driver
 managed to screw up the port multiplexing and occasionally wrote blocks
 to the wrong drive.  Even ZFS doesn't like such hardware.

 Best

    Björn
--
|     Bjoern Kahl   +++   Siegburg   +++    Germany     |
| "googlelogin@-my-domain-"   +++   www.bjoern-kahl.de  |
| Languages: German, English, Ancient Latin (a bit :-)) |

  signature.asc
< 1K Download

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Seth Heeren  
View profile   Translate to Translated (View Original)
 More options May 30 2012, 4:31 pm
From: Seth Heeren <sghee...@hotmail.com>
Date: Wed, 30 May 2012 13:31:44 -0700 (PDT)
Local: Wed, May 30 2012 4:31 pm
Subject: Re: Advice on dealing with permanent errors
My thought would be: kill the resilver and check the hardware.
Try to move to a different machine (PSU, memory and Sata controllers).

You might move to another zfs implementation but, my main suspect
would be the hardware there.

On May 30, 8:03 pm, Budric <bud...@gmail.com> wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Bud Bundy  
View profile   Translate to Translated (View Original)
 More options May 31 2012, 10:34 am
From: Bud Bundy <bud...@gmail.com>
Date: Thu, 31 May 2012 10:34:15 -0400
Local: Thurs, May 31 2012 10:34 am
Subject: Re: [zfs-fuse] Re: Advice on dealing with permanent errors

Thanks for the suggestions.  I shutdown the system, put the drive I took
out back in, and turned it back on to start resilvering.  The errors
remained.  There are 444 now.  Most of the errors were in a snapshot I
took, I didn't need it so I erased it, bringing the errors down to 85
files, which I'll try to recover from backup or lose.

The resilver process that I started a second time eventually crashed.
 Syslog has this, but to me that's not very helpful.  A segfault from
zfs-fuse followed by  cron-hourly complaining.

May 31 07:16:35 backup kernel: [44605.511111] zfs-fuse[1157]: segfault at
2f93b8c6a6d0 ip 00002b93b9038e84 sp 00002b93bbd64d00 error 4 in
libpthread-2.$
May 31 07:17:02 backup CRON[6993]: (root) CMD (   cd / && run-parts
--report /etc/cron.hourly)
May 31 07:19:11 backup kernel: [44761.792350] INFO: task zfs-fuse:1153
blocked for more than 120 seconds.
May 31 07:19:11 backup kernel: [44761.792494] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 31 07:19:11 backup kernel: [44761.792638] zfs-fuse        D
0000000000000000     0  1153      1 0x00000000
May 31 07:19:11 backup kernel: [44761.792652]  ffff88020ccf5cc8
0000000000000082 0000000000000000 ffffffffffffffe0
May 31 07:19:11 backup kernel: [44761.792667]  ffff88020ccf5fd8
ffff88020ccf5fd8 ffff88020ccf5fd8 0000000000013780
May 31 07:19:11 backup kernel: [44761.792679]  ffff88021068c4d0
ffff88020d49c4d0 ffff88020d49c4d0 ffff88020e4dad80
May 31 07:19:11 backup kernel: [44761.792692] Call Trace:
May 31 07:19:11 backup kernel: [44761.792713]  [<ffffffff8165a88f>]
schedule+0x3f/0x60
May 31 07:19:11 backup kernel: [44761.792726]  [<ffffffff8106b915>]
exit_mm+0x85/0x130
May 31 07:19:11 backup kernel: [44761.792737]  [<ffffffff8106bb2e>]
do_exit+0x16e/0x420
May 31 07:19:11 backup kernel: [44761.792748]  [<ffffffff8109d9bf>] ?
__unqueue_futex+0x3f/0x80
May 31 07:19:11 backup kernel: [44761.792761]  [<ffffffff8107a2ca>] ?
__dequeue_signal+0x6a/0xb0
May 31 07:19:11 backup kernel: [44761.792772]  [<ffffffff8106bf84>]
do_group_exit+0x44/0xa0
May 31 07:19:11 backup kernel: [44761.792783]  [<ffffffff8107ce0c>]
get_signal_to_deliver+0x21c/0x420
May 31 07:19:11 backup kernel: [44761.792795]  [<ffffffff81013865>]
do_signal+0x45/0x130
May 31 07:19:11 backup kernel: [44761.792806]  [<ffffffff8106652b>] ?
do_fork+0x15b/0x2e0
May 31 07:19:11 backup kernel: [44761.792816]  [<ffffffff810a0a0c>] ?
do_futex+0x7c/0x1b0
May 31 07:19:11 backup kernel: [44761.792825]  [<ffffffff810a0c4a>] ?
sys_futex+0x10a/0x1a0
May 31 07:19:11 backup kernel: [44761.792835]  [<ffffffff81013b15>]
do_notify_resume+0x65/0x80
May 31 07:19:11 backup kernel: [44761.792846]  [<ffffffff8101c668>] ?
sys_clone+0x28/0x30
May 31 07:19:11 backup kernel: [44761.792856]  [<ffffffff81665050>]
int_signal+0x12/0x17

My kernel version is: Linux backup 3.2.0-24-generic #39-Ubuntu, Ubuntu
12.04.
At the moment I don't have different hardware to try it on.  It's
resilvering for the third time, if that fails I'll try memtest or something.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »