zpool replace

11 views
Skip to first unread message

robsee

unread,
Mar 8, 2011, 12:33:08 AM3/8/11
to KQStor ZFS Discussion
Hi,

While trying to do a zpool replace, things went wrong. The old drive
was sdh, the new drive that I tried to replace it with was also sdh.
For some reason, the zfs code decided it should be sdh1. So it started
the resilver process. It seems that at the end of the resilver
process, it never does any clean up. The next time that the array is
brought up, it still says the drive is being replaced, and starts the
resilver all over again. Here is an example of what I see with zpool
status:

replacing-7 DEGRADED 0 0 0
sdh OFFLINE 0 0 0
sdh1 ONLINE 0 0 0
(resilvering)

How can I fix this ?

Thanks,
-Rob

Neependra Khare

unread,
Mar 9, 2011, 5:55:26 AM3/9/11
to kqstor-zf...@googlegroups.com, robsee
Hi, 

On Tue, Mar 8, 2011 at 11:03 AM, robsee <r...@rsee.net> wrote:
Hi,

 While trying to do a zpool replace, things went wrong. The old drive
was sdh, the new drive that I tried to replace it with was also sdh.
For some reason, the zfs code decided it should be sdh1.
I am not sure why this has happened.
 
So it started
the resilver process. It seems that at the end of the resilver
process, it never does any clean up. The next time that the array is
brought up, it still says the drive is being replaced, and starts the
resilver all over again. Here is an example of what I see with zpool
status:

Can you share the steps you followed to replace the disks. Also give us output of
following command:- 
zpool history -l -i 


--
Regards,
Neependra 

robsee

unread,
Mar 11, 2011, 11:26:15 PM3/11/11
to KQStor ZFS Discussion
Hi,

I was originally trying to get zfs to use entries from the /dev/disks/
by-id tree rather than the /dev/sdx entries which were changing for me
regularly. I did this because I was running into a problem where
almost every reboot I had to delete the zpool.cache and reimport the
pools to get them to come up. I've since replaced the esata
controllers and that problem seems to have rectified itself. First I
offlined the disk I was trying to switch, and then I tried to zpool
replace the old name with the new name. I did this with 2 disks of my
raidz2 volume before I decided that something was wrong, and I should
probably stop.
here is what my current zpool status looks like:

pool: eightbay2
state: DEGRADED
status: One or more devices is currently being resilvered. The pool
will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Fri Mar 11 23:08:36 2011
45.0G scanned out of 9.64T at 77.4M/s, 36h6m to go
4.01G resilvered, 0.46% done
config:

NAME STATE READ WRITE CKSUM
eightbay2 DEGRADED 0 0 0
raidz2-0 DEGRADED 0 0 0
sdd ONLINE 0 0 0
sdf ONLINE 0 0 0
sdg ONLINE 0 0 0
(resilvering)
sdh ONLINE 0 0 0
sdc ONLINE 0 0 0
sdb ONLINE 0 0 0
replacing-6 DEGRADED 0 0 200K
sde ONLINE 0 0 0
(resilvering)
17131933909712789936 UNAVAIL 0 0 0 was /
dev/sdh1
14275510670410547429 UNAVAIL 0 0 0 was /
dev/sda1
replacing-7 UNAVAIL 0 0 0
insufficient replicas
475396521707658792 UNAVAIL 0 0 0 was /
dev/sdh
15235888937654799190 UNAVAIL 0 0 0 was /
dev/sdh1

errors: No known data errors

Here is the zdb tree
eightbay2:
version: 28
name: 'eightbay2'
state: 0
txg: 8177712
pool_guid: 17389717143361898507
hostid: 8323329
hostname: 'eonfbsd'
vdev_children: 1
vdev_tree:
type: 'root'
id: 0
guid: 17389717143361898507
children[0]:
type: 'raidz'
id: 0
guid: 8240857012312489910
nparity: 2
metaslab_array: 23
metaslab_shift: 36
ashift: 9
asize: 12002376286208
is_log: 0
children[0]:
type: 'disk'
id: 0
guid: 7046159470888568936
path: '/dev/sdd'
whole_disk: 1
DTL: 378
children[1]:
type: 'disk'
id: 1
guid: 8000225162792066463
path: '/dev/sdf'
whole_disk: 1
DTL: 377
children[2]:
type: 'disk'
id: 2
guid: 5229947180552272116
path: '/dev/sdg'
whole_disk: 1
DTL: 372
children[3]:
type: 'disk'
id: 3
guid: 16384725385907175025
path: '/dev/sdh'
whole_disk: 1
DTL: 376
children[4]:
type: 'disk'
id: 4
guid: 504467111533001616
path: '/dev/sdc'
whole_disk: 1
DTL: 329
children[5]:
type: 'disk'
id: 5
guid: 11937532446586896064
path: '/dev/sdb'
whole_disk: 1
DTL: 375
children[6]:
type: 'replacing'
id: 6
guid: 14334310672528416367
whole_disk: 0
children[0]:
type: 'disk'
id: 0
guid: 8750903116896420211
path: '/dev/sde'
whole_disk: 1
DTL: 374
children[1]:
type: 'disk'
id: 1
guid: 17131933909712789936
path: '/dev/sdh1'
whole_disk: 0
not_present: 1
DTL: 435
resilvering: 1
children[2]:
type: 'disk'
id: 2
guid: 14275510670410547429
path: '/dev/sda1'
whole_disk: 0
not_present: 1
DTL: 438
resilvering: 1
children[7]:
type: 'replacing'
id: 7
guid: 8589017169731279389
whole_disk: 0
children[0]:
type: 'disk'
id: 0
guid: 475396521707658792
path: '/dev/sdh'
whole_disk: 1
not_present: 1
DTL: 373
children[1]:
type: 'disk'
id: 1
guid: 15235888937654799190
path: '/dev/sdh1'
whole_disk: 0
not_present: 1
DTL: 436
resilvering: 1

I'm also getting this stack dump, but I don't know whether it is
connected to my problem:
[ 352.713234] SPL: Showing stack for process 3109
[ 352.713239] Pid: 3109, comm: txg_sync Tainted: P
2.6.35-22-server #35-Ubuntu
[ 352.713241] Call Trace:
[ 352.713255] [<ffffffffa050b607>] spl_debug_dumpstack+0x27/0x40
[spl]
[ 352.713263] [<ffffffffa050f67d>] kmem_alloc_debug+0x11d/0x130
[spl]
[ 352.713296] [<ffffffffa05b3a21>] dsl_scan_setup_sync+0x1e1/0x210
[zfs]
[ 352.713322] [<ffffffffa05b604c>] dsl_scan_sync+0x1dc/0x3a0 [zfs]
[ 352.713351] [<ffffffffa060b50c>] ? zio_destroy+0xac/0xf0 [zfs]
[ 352.713378] [<ffffffffa05c179a>] spa_sync+0x3fa/0x9a0 [zfs]
[ 352.713384] [<ffffffff8107f096>] ? autoremove_wake_function
+0x16/0x40
[ 352.713388] [<ffffffff8104d203>] ? __wake_up+0x53/0x70
[ 352.713416] [<ffffffffa05d2bf5>] txg_sync_thread+0x215/0x3a0 [zfs]
[ 352.713444] [<ffffffffa05d29e0>] ? txg_sync_thread+0x0/0x3a0 [zfs]
[ 352.713452] [<ffffffffa05100f8>] thread_generic_wrapper+0x78/0x90
[spl]
[ 352.713459] [<ffffffffa0510080>] ? thread_generic_wrapper+0x0/0x90
[spl]
[ 352.713462] [<ffffffff8107eb26>] kthread+0x96/0xa0
[ 352.713466] [<ffffffff8100aee4>] kernel_thread_helper+0x4/0x10
[ 352.713469] [<ffffffff8107ea90>] ? kthread+0x0/0xa0
[ 352.713472] [<ffffffff8100aee0>] ? kernel_thread_helper+0x0/0x10

If you still need my zpool history I can send that to you directly (it
is large). On a side note, when I try and run zpool history on that
array, I get the following stack dump:
[ 1166.957630] SPL: Showing stack for process 4404
[ 1166.957633] Pid: 4404, comm: zpool Tainted: P 2.6.35-22-
server #35-Ubuntu
[ 1166.957635] Call Trace:
[ 1166.957643] [<ffffffffa050b607>] spl_debug_dumpstack+0x27/0x40
[spl]
[ 1166.957650] [<ffffffffa050f67d>] kmem_alloc_debug+0x11d/0x130
[spl]
[ 1166.957678] [<ffffffffa05f395f>] zfs_ioc_pool_get_history+0x9f/
0x110 [zfs]
[ 1166.957683] [<ffffffffa026dd4e>] ? pool_namecheck+0x5e/0x180
[zcommon]
[ 1166.957711] [<ffffffffa05f425f>] zfsdev_ioctl+0xef/0x1c0 [zfs]
[ 1166.957715] [<ffffffff81162e1d>] vfs_ioctl+0x3d/0xd0
[ 1166.957718] [<ffffffff811635b1>] do_vfs_ioctl+0x81/0x3d0
[ 1166.957721] [<ffffffff815a2569>] ? do_page_fault+0x159/0x350
[ 1166.957724] [<ffffffff81163981>] sys_ioctl+0x81/0xa0
[ 1166.957728] [<ffffffff8100a0f2>] system_call_fastpath+0x16/0x1b

Thanks,
-Rob

robsee

unread,
Mar 14, 2011, 3:40:41 PM3/14/11
to KQStor ZFS Discussion
Hi,

Just in case someone stumbles across this problem in the future I
found the work around (I hope). The problem I was running into
appears to be related to this bug: http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6782540
.
Using zpool detach <pool> <missing dev> removed the bad entries, and
it now says things are good. I will do a full scrub at some point to
confirm this is the case.

-Rob
Reply all
Reply to author
Forward
0 new messages