abort zpool replace

308 views
Skip to first unread message

Busty

unread,
Oct 15, 2014, 2:32:33 PM10/15/14
to zfs-...@googlegroups.com
In my pool, I had a disk that got a smart error (bad block), so I pulled it out, installed a new one and made a
"zpool replace disk5s2 806745480046791602". (That number was shown when typing "zpool status" as the missing device.)

The resilver process started, but it seems that the new disk is faulty, because it disappears from the device list infrequently, but still at least every 6 hours (I have Temperature Monitor running which shows me all disks by serial number).

So I want to change it. But zpool detach <poolname> dev/disk5s2 gives the error "no such device in pool".

How can I abort the resilvering process? Or is there another way to restart the resilvering with a new disk?

The original disk with the bad block is already on its way to Western Digital (it was still in warranty).

Bjoern Kahl

unread,
Oct 15, 2014, 2:49:09 PM10/15/14
to zfs-...@googlegroups.com
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


Hi 'Busty',

Am 15.10.14 um 20:32 schrieb 'Busty' via zfs-macos:
> In my pool, I had a disk that got a smart error (bad block), so I
> pulled it out, installed a new one and made a "zpool replace
> disk5s2 806745480046791602". (That number was shown when typing
> "zpool status" as the missing device.)
>
> The resilver process started, but it seems that the new disk is
> faulty, because it disappears from the device list infrequently,
> but still at least every 6 hours (I have Temperature Monitor
> running which shows me all disks by serial number).
>
> So I want to change it. But zpool detach <poolname> dev/disk5s2
> gives the error "no such device in pool".
>
> How can I abort the resilvering process? Or is there another way to
> restart the resilvering with a new disk?

Usually I would do in this situation exactly what you described:
Detach the disk and attach a new one.

"zpool detach" is supposed to detach any disk that can logically be
detached (i.e. does not remove data that is stored only on that disk).

To diagnose further, you would need to show us "zpool status -v".


> The original disk with the bad block is already on its way to
> Western Digital (it was still in warranty).


Generally, it is more wise to do the replace with the faulty disk
still present. In case of trouble with another disk, it still holds
most of the data and can provide good block if needed by the resilver
process.


Best regards

Björn

- --
| Bjoern Kahl +++ Siegburg +++ Germany |
| "googlelogin@-my-domain-" +++ www.bjoern-kahl.de |
| Languages: German, English, Ancient Latin (a bit :-)) |
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQCVAgUBVD7Bn1sDv2ib9OLFAQIilAP9GVkGr/pbpDp3dlZrX7LtmyEG6yNP7ISk
1Xtk0eJ7hmuz8tiZKNTitqkKbhzNkyJtsCaUO0sVctcsO/6WjScCJD5hPv5MZSoK
M9XmbJq+jcXYn4b05vcoQ5SlcpB4dsLZe+oDVq1+2ZgVOXIyqHNO8Jnq2K/xZ986
/a6Ee5c60ug=
=e1Rx
-----END PGP SIGNATURE-----

Busty

unread,
Oct 15, 2014, 3:18:49 PM10/15/14
to zfs-...@googlegroups.com
zpool status -v shows:

Server:~ busty$ zpool status -v
pool: Collection
state: DEGRADED
status: One or more devices is currently being resilvered. The pool
will continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scrub: resilver in progress, 0,98% done, 26h14m to go
config:

NAME STATE READ WRITE CKSUM
Collection DEGRADED 0 0 0
raidz1 DEGRADED 0 0 0
disk3s2 ONLINE 0 0 0
disk5s2 ONLINE 0 0 0
disk7s2 ONLINE 0 0 0
disk1s2 ONLINE 0 0 0
replacing DEGRADED 0 0 0
806745480046791602 FAULTED 0 0 0 was/dev/disk5s2
disk4 ONLINE 0 0 0
disk2s2 ONLINE 0 0 0
disk6s2 ONLINE 0 0 0

errors: No known data errors


Good info about letting the disk to be replaced in place until it's
done. My time was running up to send the disk away and it's somewhat
easier to just swap the disks, but I do have a spare SATA-port, so I
could do it the safer way next time.

Meanwhile, what to do with the "no such device in pool"?

Thanks

Busty

unread,
Oct 23, 2014, 8:01:15 AM10/23/14
to zfs-...@googlegroups.com
This was in fact easier than I thought. What did the trick was to
physically swap the faulty disk with a new one and then "zpool detach
(faulty disk)"

After that a "zpool replace" went like a charm.

Problem solved.

BelecMartin

unread,
Oct 23, 2014, 8:07:46 AM10/23/14
to zfs-...@googlegroups.com
Yeah!

Jason Belec
Sent from my "It's an iPod, a Phone, and an Internet Device..."
> --
>
> ---
> You received this message because you are subscribed to the Google Groups "zfs-macos" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to zfs-macos+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Busty

unread,
Oct 26, 2014, 7:09:51 AM10/26/14
to zfs-...@googlegroups.com
This generated a follow up question:

I did the zpool replace with an unformatted disk as described in the
oracle documentation. After that, zpool status showed the disk as part
of the pool, but as "disk2", not as "disk2s2". Accordingly, OSX wanted
to initialize the disk every time upon booting.

So I formatted the disk as described in the getting started guide on
MacZFS, which resolves the problem of OSX wanting to initialize the
disk, but still it shows as "disk2" (without the s2) with zpool status.
I was prepared to resilver the disk again after that, but it was still
part of the pool.

I started a scrub, had 6 checksum errors on that disk right at the
beginning, but otherwise the scrub seems to consider the data as good.
It is at 7 percent right now.

Should I be worried that the data is not integer?

Bjoern Kahl

unread,
Oct 26, 2014, 9:43:36 AM10/26/14
to zfs-...@googlegroups.com
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


(this is going to be a bit longer, but since it is a reappearing topic
I'd like to provide some background information on what happens
behind the scene)


Am 26.10.14 um 12:09 schrieb 'Busty' via zfs-macos:
> This generated a follow up question:
>
> I did the zpool replace with an unformatted disk as described in
> the oracle documentation. After that, zpool status showed the disk
> as part of the pool, but as "disk2", not as "disk2s2". Accordingly,
> OSX wanted to initialize the disk every time upon booting.
>
> So I formatted the disk as described in the getting started guide
> on MacZFS, which resolves the problem of OSX wanting to initialize
> the disk, but still it shows as "disk2" (without the s2) with zpool
> status. I was prepared to resilver the disk again after that, but
> it was still part of the pool.
>
> I started a scrub, had 6 checksum errors on that disk right at the
> beginning, but otherwise the scrub seems to consider the data as
> good. It is at 7 percent right now.
>
> Should I be worried that the data is not integer?

Yes, you should.

You basically did the following:

1)

Gave a whole disk to ZFS, telling it, it is OK to use the whole space
from first to last block of the disk.

ZFS did so and started writing data:

a) it's vdev label 0,1 from block 0 to 1023 (assuming 512 byte blocks)

b) it's vdev label 2,3 from block N-1024 to N-1 (assuming N block on
disk)

c) your pool data in between, following it's somewhat complex
allocation scheme


2)

Told OS X to write a disk label (aka GPT) on the disk.

OS X did so and started writing data:

a) A protective MBR in block 0 -> no damage, ZFS anticipates
that, leaving block 0 to 32 (16k) of its label alone.

b) The primary GPT structures, starting from block 1 (byte position
512) to end of block 33 (byte position 17408).
This trashed part of the configuration dictionary in vdev label 0

c) The secondary GPT structures, in the last 17408 bytes of the disk,
overwriting part of the uberblock array in vdev label 3.

d) The Mac OS X EFI area, usually around block 40 to 409600 (byte
positions up to 200 MB). This is "/dev/diskXs1".

e) The man partition "/dev/diskXs2", roughly starting at block 409640
and extending until some blocks before the secondary GPT structures.
This is just created but nor written in "noformat" has been used.



What does this mean?
--------------------


It depends on how ZFS sees the disk. Most likely it will continue to
use "diskX" (no slice). In that case:

The pool keeps functioning, since vdev labels 1 and 2 are undamaged (0
and 3 are overwritten, see above)

ZFS will almost instantly fix it's labels, completely overwriting the
secondary GPT. Mac OS X doesn't care, it writes the secondary GPT and
never looks there again.

The situation on the is start is more complex.

ZFS will also almost instantly fix its label 0. However, this writes
only from block 32 on (byte position 16384 onwards), since it
completely ignores the first 16 blocks (supposed to hold disk
identifier) and doesn't touch the next 16 in normal operation, since
they are supposed to hold ZFS boot code and are unused in current
implementations.

So the rewritten vdev label 0 trashes the last 512 bytes of the primary
GPT. This does concern Mac OS X and you should see a waring about an
invalid GPT CRC in the system log after boot.


So much for the administrative data structures. What about your data?

ZFS' data area starts after the vdev label 1, i.e. at block 1024
(byte position 512 kB). This is somewhere inside the EFI area,
overwriting whatever Mac OS X placed there (depends on version, older
Mac OS X version didn't placed anything there, don't know for newer
versions). In any case, Mac OS X does not access the EFI area in
normal operation, and so won't note the damage.

On the other hand, Mac OS X is initializing the EFI area when
initializing a disk, placing an empty FAT file system there.

This FAT overwrites part of the ZFS pool data and caused the checksum
errors.


What to do now?
---------------

I would detach the disk in question, zap the first and last several MB
of disk space (i.e. of diskX itself, not of the diskX2s slice) by
writing zero bytes to disk, for example using "dd", reformat with
diskutil and reattach as /dev/diskX2s.

Another approach for zapping the disk content is, to format as HFS+
with diskutil and then select "clear/erase free disk space" (or
whatever the English button label says).


Best regards

Björn

> On 23.10.14 14:01, 'Busty' via zfs-macos wrote:
>
>> This was in fact easier than I thought. What did the trick was
>> to physically swap the faulty disk with a new one and then "zpool
>> detach (faulty disk)"
>>
>> After that a "zpool replace" went like a charm.
>>
>> Problem solved.
>>
>> On 15.10.14 20:32, 'Busty' via zfs-macos wrote:
>>> In my pool, I had a disk that got a smart error (bad block), so
>>> I pulled it out, installed a new one and made a "zpool replace
>>> disk5s2 806745480046791602". (That number was shown when typing
>>> "zpool status" as the missing device.)
>>>
>>> The resilver process started, but it seems that the new disk is
>>> faulty, because it disappears from the device list
>>> infrequently, but still at least every 6 hours (I have
>>> Temperature Monitor running which shows me all disks by serial
>>> number).
>>>
>>> So I want to change it. But zpool detach <poolname> dev/disk5s2
>>> gives the error "no such device in pool".
>>>
>>> How can I abort the resilvering process? Or is there another
>>> way to restart the resilvering with a new disk?
>>>
>>> The original disk with the bad block is already on its way to
>>> Western Digital (it was still in warranty).


- --
| Bjoern Kahl +++ Siegburg +++ Germany |
| "googlelogin@-my-domain-" +++ www.bjoern-kahl.de |
| Languages: German, English, Ancient Latin (a bit :-)) |
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQCVAgUBVEz6gFsDv2ib9OLFAQKlvQQAsxcgdrwvKfqnOsnV8NDLQin6aJXm6D8P
loStpC/msUYUUklBYKxMgJ6J3yYDVHSMDivblZMtWUgWixiIWinsOs/kHJa1iS1o
hChP5YqislOKB5IzBSb8YrjoF7aJsLsuWmoy1hu3jwRElbugA1noF83R0wgLoqcY
Bm//sqMVxWA=
=9+bf
-----END PGP SIGNATURE-----

BelecMartin

unread,
Oct 26, 2014, 10:28:17 AM10/26/14
to zfs-...@googlegroups.com
Well that sure is detailed. And should be in the wiki as it is very useful and a great overall explanation. ;)

Jason Belec
Sent from my "It's an iPod, a Phone, and an Internet Device..."

Busty

unread,
Oct 29, 2014, 3:54:48 AM10/29/14
to zfs-...@googlegroups.com
Wow, thanks Bjoern for that, now I really know what was going on. I
really appreciate the time you took to explain all that.

The problem I'm facing is, that I can't detach the drive. A "zpool
detach pool diskx" gives me the error:
"cannot detach diskx: only applicable to mirror and replacing vdevs."

I managed to format the disk as hfs+, zero the drive completely and
then format as zfs, but zfs still considers this disk as one of the pool.

What can I do to get the drive out of the pool?

ilov...@icloud.com

unread,
Oct 29, 2014, 4:49:56 AM10/29/14
to zfs-...@googlegroups.com, junk...@yahoo.de
zpool attach makes a non-mirror into a mirror. zpool detach makes a mirror into a non-mirror.

I believe you are looking for zpool remove.

Busty

unread,
Oct 29, 2014, 6:49:46 AM10/29/14
to zfs-...@googlegroups.com
thanks for the input but:

"only inactive hot spares can be removed", whereas I need to
remove/detach/whatever one disk of a raidz1 pool, no mirrors, no duplicates.

I get the impression there is no way to do that, so I might have to
build the pool from scratch again, am I right?

Busty

unread,
Oct 29, 2014, 6:50:16 AM10/29/14
to zfs-...@googlegroups.com
thanks for the input but:

"only inactive hot spares can be removed", whereas I need to
remove/detach/whatever one disk of a raidz1 pool, no mirrors, no duplicates.

I get the impression there is no way to do that, so I might have to
build the pool from scratch again, am I right?


On 29.10.14 09:49, ilov...@icloud.com wrote:

Jason Belec

unread,
Oct 29, 2014, 7:19:31 AM10/29/14
to zfs-...@googlegroups.com
If I understand what I'm reading here, you have a disk that is in your pool and the pool is raidz, so you must always have the same number of devices attached to the pool, this is a raidz law. You can replace a new one with a damaged one, but you cannot remove the damaged one until the replace/resilver is complete. You cannot stop a resilver once it has begun, your going to have to be patient. Once done, you can proceed with rectifying the issue. The issues you are running into are due to not reading up and testing before committing, and it seems to happen a lot. ZFS seems frustrating to you right now because it is doing everything possible to protect data your messing with. ;)


--
Jason Belec
Sent from my iPad

ilov...@icloud.com

unread,
Oct 29, 2014, 7:35:11 AM10/29/14
to zfs-...@googlegroups.com
Yeah, zpool remove won't work on a device in a raidz vdev, nor will zpool detach.

What does your current zpool status look like?

Busty

unread,
Oct 29, 2014, 7:49:33 AM10/29/14
to zfs-...@googlegroups.com
Hey Jason,

not really that frustrated, as I feel I'm working my way towards the
solution with the help of you maczfs guys.

I clearly didn't think that out when telling zfs that it is ok to use
the whole disk instead of the s2 slice.

The issue seems to be that I can't tell zfs that I want to start from
scratch with that disk, zfs always recognizes the disk as already being
part of the pool. As a whole.

So, the options I see:

- I can either physically replace the disk with a new one, this time
formatting it as zfs before telling zfs to replace it

- I can build the pool from scratch

(I would go for building the pool from scratch, as the disk in question
is working when installed properly. Additionally, I don't have to buy
another disk and wait for it.)

What do you guys think: is there another option?


zpool status gives me:

Server:~ busty$ zpool status
pool: Collection
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://www.sun.com/msg/ZFS-8000-9P
scrub: none requested
config:

NAME STATE READ WRITE CKSUM
Collection ONLINE 0 0 0
raidz1 ONLINE 0 0 0
disk5s2 ONLINE 0 0 0
disk4s2 ONLINE 0 0 0
disk7s2 ONLINE 0 0 0
disk3s2 ONLINE 0 0 0
disk2 ONLINE 0 0 5
disk1s2 ONLINE 0 0 0
disk6s2 ONLINE 0 0 0

errors: No known data errors

But I bet I get a pocketful (big pocket) of errors on the disk2 when
doing a scrub, since I zeroed the disk completely.

ilov...@icloud.com

unread,
Oct 29, 2014, 8:01:48 AM10/29/14
to zfs-...@googlegroups.com, junk...@yahoo.de
OpenZFS on OS X has a command called "zpool labelclear" to handle this situation, but it rarely comes up because if you give OpenZFS on OS X a whole device, it will automatically partition it for you.

Since MacZFS does not have the zpool labelclear command, you can achieve the same effect by zeroing out the disk.

1) zpool offline the device
2) zero it out
3) partition it
4) zpool online the device
5) zpool replace the device with itself

You can use Disk Utility.app's "Erase" tab to complete step 2. Be sure to select writing a single pass zeros in the Security Options.

In reality you only need to zero out the labels, but it is sufficient to zero the whole device.
Reply all
Reply to author
Forward
0 new messages