Corrupt GPT on ZFS full-disks that shouldn't be using GPT

Chris Stankevitz

unread,

Jun 28, 2015, 12:35:42 AM6/28/15

to

Hi,

I have 11 drives in a zpools. The zpool is using the drives. i.e. I
used "zpool create da1 da2 da3 ... daN". The zpools are running well.

In a former life, each of these these drives held two gpart
partitions. Apparently I did not "gpart destroy" these drives before
creating a zpool out of them.

Now when my computer boots,

1. the zpool comes up and is healthy

2. ls /dev/daX* does not show any of the "old" partitions.

3. dmesg reports "the primary GPT table is corrupt or invalid" and
"using the secondary instead -- recovery strongly advised."

Q: Am I in danger of GPT wrestling control of the drive away from ZFS

Q: How can I remove the secondary GPT table from each of the drives
that are participating in the zpool? I suppose I could offline and
resilver each of them. I'm afraid to dd the secondary GPT header at
the last 512 bytes of the drive. Perhaps there is a way I can ask ZFS
to do that for me?

Thank you,

Chris
_______________________________________________
freebsd-...@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "freebsd-questi...@freebsd.org"

Warren Block

unread,

Jun 28, 2015, 2:26:30 AM6/28/15

to

On Sat, 27 Jun 2015, Chris Stankevitz wrote:

> I have 11 drives in a zpools. The zpool is using the drives. i.e. I
> used "zpool create da1 da2 da3 ... daN". The zpools are running well.
>
> In a former life, each of these these drives held two gpart
> partitions. Apparently I did not "gpart destroy" these drives before
> creating a zpool out of them.
>
> Now when my computer boots,
>
> 1. the zpool comes up and is healthy
>
> 2. ls /dev/daX* does not show any of the "old" partitions.
>
> 3. dmesg reports "the primary GPT table is corrupt or invalid" and
> "using the secondary instead -- recovery strongly advised."
>
> Q: Am I in danger of GPT wrestling control of the drive away from ZFS

No, but I would want to fix it so it doesn't surprise me at some
inopportune later date.

> Q: How can I remove the secondary GPT table from each of the drives
> that are participating in the zpool? I suppose I could offline and
> resilver each of them. I'm afraid to dd the secondary GPT header at
> the last 512 bytes of the drive.

Erasing just the last block would probably be enough. Still, the only
reason to be afraid of it is if you do not have a full backup. And if
you don't have a full backup, that is the first thing to do.

> Perhaps there is a way I can ask ZFS
> to do that for me?

Not that I know of. If the backup GPT were in an area that ZFS writes,
it would have been overwritten when the ZFS label was installed.

Here is the forum thread where I show the math about erasing the backup
GPT:
https://forums.freebsd.org/threads/gpt-table-corrupt.52102/#post-292341

Quartz

unread,

Jun 28, 2015, 2:53:03 AM6/28/15

to

> 3. dmesg reports "the primary GPT table is corrupt or invalid" and
> "using the secondary instead -- recovery strongly advised."

> Q: How can I remove the secondary GPT table from each of the drives
> that are participating in the zpool?

First off, you should double check what's going on with your layout. You
didn't mention what system you're running or how this array was created.
In several cases even if you meant to use the whole disk you can
accidentally or unknowingly end up making gpt headers anyway, either for
labels, compatibility, or because you did something that ended up
requiring partitions. Also, a lot of zfs-based front ends (eg; freenas)
always create zfs-on-partitions, so if this array was ported from
another system it's possible it's supposed to have a legit gpt layout.

Additionally, some motherboards and expansion cards that offer raid
services can cause problems that can screw with gpt. I have a
motherboard where I have to set the sata ports as old style ide
compatible, because turning on ahci mode automatically reserves/locks
off a chunk of the end of the disk for raid metadata (even if I have the
raid options disabled) causing dmesg to complain about corrupt gpt
headers. So double check if you've changed anything related to that.

Either way, before you go any further, explain the steps you did to
create this pool and dump out everything that the gpt commands tell you
about the disks. It would especially help to get a dump of either/both
headers to see what's going on there.

>I suppose I could offline and
> resilver each of them.

Simply resilvering is not guaranteed to fix the problem, depending on
what's going on. If you're feeling adventurous you can always offline a
drive and 'gpart destroy' it, then see what zfs says if you try to bring
it back or reboot.

> I'm afraid to dd the secondary GPT header at
> the last 512 bytes of the drive. Perhaps there is a way I can ask ZFS
> to do that for me?

Zfs doesn't mess with gpt directly like that, so no. If you don't want
to 'gpart destroy' it for some reason it's not hard to nuke it yourself
though with dd; you just need the output from 'diskinfo' and a calculator.

Quartz

unread,

Jun 28, 2015, 4:59:28 AM6/28/15

to

>Also, a lot of zfs-based front ends (eg; freenas)
> always create zfs-on-partitions, so if this array was ported from
> another system it's possible it's supposed to have a legit gpt layout.

As an aside, I believe the linux implementation of zfs also requires
partitions (or at least it used to).

Chris Stankevitz

unread,

Jun 28, 2015, 5:12:29 PM6/28/15

to

On Sat, Jun 27, 2015 at 11:52 PM, Quartz <qua...@sneakertech.com> wrote:
> First off, you should double check what's going on with your layout.

Thank you for your help. I have four 11-drive raidz3 pools that, in a
prior life, lived in FreeNAS. Of course, being in FreeNAS, they were
gpart-ed to have two parititons (one for zfs, one for swap).

I took these 4 groups of 11-drives over to my FreeBSD box. For each
group of 11 drives I:
gpart destroy -F /dev/da0
gpart destroy -F /dev/da1
...
gpart destroy -F /dev/da10
zpool create poolname raidz3 /dev/da0 /dev/da1 ... /dev/da10

Unfortunately on one of the 11 groups I forgot to perform the "gpart
destroy" step. I did perform the "zpool create" step. This is the
group of drives that triggers the dmesg "the primary GPT table is

corrupt or invalid" and "using the secondary instead -- recovery
strongly advised."

>> I suppose I could offline and
>> resilver each of them.
>
>
> Simply resilvering is not guaranteed to fix the problem

I agree. What I means to say was "offline the drive, dd if=/dev/zero
the drive, then resilver it.

>> I'm afraid to dd the secondary GPT header at
>> the last 512 bytes of the drive. Perhaps there is a way I can ask ZFS
>> to do that for me?
>
>
> Zfs doesn't mess with gpt directly like that, so no. If you don't want to

What I meant here was to say "Perhaps I can politely ask ZFS 'hey if
you are not using the last 512 bytes of these devices, would you mind
just filling that with zeros?'". I would feel more comfortable if
there was a command like that offered by ZFS rather than me just using
dd and hoping it doesn't interfere with ZFS.

Chris

Chris Stankevitz

unread,

Jun 28, 2015, 5:17:22 PM6/28/15

to

On Sat, Jun 27, 2015 at 11:26 PM, Warren Block <wbl...@wonkity.com> wrote:
> Erasing just the last block would probably be enough. Still, the only
> reason to be afraid of it is if you do not have a full backup. And if you
> don't have a full backup, that is the first thing to do.

Warren,

Thank you. I do indeed have backups so perhaps I shouldn't be afraid
to just experiment... especially if I experiment only on one of my
raidz3 drives. Do I need to export the pool before using dd on the
raw device?

Chris

Quartz

unread,

Jun 28, 2015, 6:10:40 PM6/28/15

to

> Unfortunately on one of the 11 groups I forgot to perform the "gpart
> destroy" step.

OK, well you should still get a readout of all the gpt stuff off the
disks just to make sure you're not doing something elsewhere that
actually needs gpt (like labels or something). In this case though you
can compare the 'good' drives to the 'bad' drives.

> What I means to say was "offline the drive, dd if=/dev/zero
> the drive, then resilver it.

> Do I need to export the pool before using dd on the
> raw device?

Given that this is a raidz3 and you're just going to dd nuke a single
drive, then no. Just offline the drive first so the pool's not trying to
use it. Exporting a pool completely shuts it down and packages
everything up in a machine-independent way so it can be physically moved
to a new box, but that's overkill for your situation. (Also, I'm not
100% sure what zfs does when you try to import a pool with one drive
"missing").

> What I meant here was to say "Perhaps I can politely ask ZFS 'hey if
> you are not using the last 512 bytes of these devices, would you mind
> just filling that with zeros?'". I would feel more comfortable if
> there was a command like that offered by ZFS rather than me just using
> dd and hoping it doesn't interfere with ZFS.

I *think* that zfs is like other file system partitioning schemes in
that it just writes from the beginning of the drive and doesn't care
about the end until it gets there... however don't quote me on it. Again
though, you could always offline the drive, dd the end, then reattach it
and do a scrub or something. If you end up blowing away something zfs
needs, it won't stay silent about it.

Warren Block

unread,

Jun 28, 2015, 7:06:42 PM6/28/15

to

On Sun, 28 Jun 2015, Chris Stankevitz wrote:

> On Sat, Jun 27, 2015 at 11:26 PM, Warren Block <wbl...@wonkity.com> wrote:
>> Erasing just the last block would probably be enough. Still, the only
>> reason to be afraid of it is if you do not have a full backup. And if you
>> don't have a full backup, that is the first thing to do.
>
> Warren,
>
> Thank you. I do indeed have backups so perhaps I shouldn't be afraid
> to just experiment... especially if I experiment only on one of my
> raidz3 drives. Do I need to export the pool before using dd on the
> raw device?

It depends on how confident you are in those backups. Remember, ZFS
leaves space unused at the end of a disk to allow for variations in
nominal disk size.

Overwriting even just the last block will destroy the backup GPT header
without touching any ZFS data. In theory, anyway, which is why you have
backups. ZFS ought to notice if there was any problem during the next
zpool scrub. So do one drive, do a scrub, and if red lights don't start
flashing and an urgent resilver does not start... it's good. But still,
keep good backups.

Message has been deleted

Chris Stankevitz

unread,

Jun 28, 2015, 8:46:58 PM6/28/15

to

On Sun, Jun 28, 2015 at 4:22 PM, <kpn...@pobox.com> wrote:
> I'm not sure if offlining the disk you are touching will make any difference.
> It may just make recovery more complicated. If it was me I'd just zero the
> last block without the offline/online dance.

Kevin,

Thank you. Can I use dd on a /dev/daX that is currently part of an
imported, mounted, and online zpool? Also, if anyone can answer this
question more generally, I'd appreciate it: are there times (other
than r-x) when I do not have permission/ability to dd if=/dev/zero
of=/dev/daX?

Chris

Quartz

unread,

Jun 28, 2015, 10:05:00 PM6/28/15

to

> Remember, ZFS
> leaves space unused at the end of a disk to allow for variations in
> nominal disk size.

Holy what the heck, no it doesn't! One big issue with zfs is that you
CANNOT shrink a pool's size once it's been created, for any reason. You
can't remove vdevs, and any replacement disk must bigger or exactly
equal in size; even a disk with one less sector and you're SOL. This is
my biggest gripe with zfs by far and in fact I just asked freebsd-fs
about this less than a week ago wondering if it had been addressed
finally (it hasn't).

Quartz

unread,

Jun 28, 2015, 10:29:51 PM6/28/15

to

> When making changes like dd'ing the end of a disk be sure to do a scrub
> after touching _only_ _one_ disk. I suggest doing scrubs until you get a
> clean one. Then move on to the next disk, doing the write/scrub steps on
> each disk in turn.

This is good advice.

> It may just make recovery more complicated. If it was me I'd just zero the
> last block without the offline/online dance.

... but I'm not sure that is.

>Can I use dd on a /dev/daX that is currently part of an
> imported, mounted, and online zpool?

There's nothing technically stopping you, but screwing with it "live" is
not a great idea. At the least, if zfs IS using those blocks or
otherwise notices, it will consider the drive to be throwing errors and
mark it as failing, so you'll have to deal with the issue anyway.

Honestly though, this whole thread is really better suited for
freebsd-fs rather than freebsd-questions. You'll probably get better
answers there.

>Also, if anyone can answer this
> question more generally, I'd appreciate it: are there times (other
> than r-x) when I do not have permission/ability to dd if=/dev/zero
> of=/dev/daX?

It can happen. I've bumped into that once when a dvd was improperly
cloned onto a usb drive and it got stuck in a weird read-only mode and I
had to jump through a couple hoops to erase it (this wasn't freebsd though).

Message has been deleted

Quartz

unread,

Jun 29, 2015, 4:33:48 AM6/29/15

to

>Do I need to export the pool before using dd on the
> raw device?

So I think my earlier comment saying export is overkill might have been
wrong. The handbook page explaining how zfs checksums work has an
example that explicitly uses export:

[20.3.8. Self-Healing] "Data corruption is simulated by writing random
data to the beginning of one of the disks in the mirror. To prevent ZFS
from healing the data as soon as it is detected, the pool is exported
before the corruption and imported again afterwards."

https://www.freebsd.org/doc/en_US.ISO8859-1/books/handbook/zfs-zpool.html

Warren Block

unread,

Jun 29, 2015, 10:19:56 AM6/29/15

to

On Sun, 28 Jun 2015, Quartz wrote:

>> Remember, ZFS
>> leaves space unused at the end of a disk to allow for variations in
>> nominal disk size.
>
> Holy what the heck, no it doesn't! One big issue with zfs is that you CANNOT
> shrink a pool's size once it's been created, for any reason. You can't remove
> vdevs, and any replacement disk must bigger or exactly equal in size; even a
> disk with one less sector and you're SOL. This is my biggest gripe with zfs
> by far and in fact I just asked freebsd-fs about this less than a week ago
> wondering if it had been addressed finally (it hasn't).

It's possible I've confused this with something else. The person who I
thought told me about this now denies saying anything like that.
However, there are copies of the ZFS label at the end of the drive that
might explain the GPT backup not being overwritten. I have queries in.
The fact that the backup GPT is still present indicates that ZFS has not
written to that area, at least so far, and it should be safe to
overwrite.

Paul Kraus

unread,

Jun 29, 2015, 10:36:00 AM6/29/15

to

On Jun 29, 2015, at 10:19, Warren Block <wbl...@wonkity.com> wrote:

> On Sun, 28 Jun 2015, Quartz wrote:
>
>>> Remember, ZFS
>>> leaves space unused at the end of a disk to allow for variations in
>>> nominal disk size.
>>
>> Holy what the heck, no it doesn't! One big issue with zfs is that you CANNOT shrink a pool's size once it's been created, for any reason. You can't remove vdevs, and any replacement disk must bigger or exactly equal in size; even a disk with one less sector and you're SOL. This is my biggest gripe with zfs by far and in fact I just asked freebsd-fs about this less than a week ago wondering if it had been addressed finally (it hasn't).

I do recall a change in ZFS behavior to leave a very small amount of space unused at the every end of the drive to account for the differences in real sizes between various vendors drives that were nominally the same size. This only applied if you used the entire disk and did not use any partitioning. This was in both the Solaris and OpenSolaris versions of ZFS, so it predates the fork of the ZFS code.

I have had no issues using disks of different manufacturers and even models within manufacturers (which sometimes do vary in size by a few blocks) as long as they were all the same nominal size (1 TB or 500 GB in my case) and I had handed the entire disk to ZFS and not a partition.

This is NOT an indication of any sort that you can shrink an existing zpool nor does it imply that any given zpool is not writing to certain blocks at the end of the disk, but that the space allocated by the zpool create, when using an entire disk, leaves a little bit of wiggle room at the end that is NOT part of the zpool.

I will see if I can dig up the documentation on this. Note that it is a very small amount as drives of the same nominal capacity vary very little in real capacity.

--
Paul Kraus
pa...@kraus-haus.org

Quartz

unread,

Jun 29, 2015, 7:54:47 PM6/29/15

to

> I do recall a change in ZFS behavior to leave a very small amount of
> space unused at the every end of the drive to account for the
> differences in real sizes between various vendors drives that were
> nominally the same size. This only applied if you used the entire
> disk and did not use any partitioning. This was in both the Solaris
> and OpenSolaris versions of ZFS, so it predates the fork of the ZFS
> code.
>
> I have had no issues using disks of different manufacturers and even
> models within manufacturers

That runs counter to everything I've ever heard or read. Many people on
all platforms have complained about this issue over the years and tried
to come up with workarounds, there's no shortage of hits if you search
for it. Here's a few random examples:

https://www.mail-archive.com/zfs-d...@opensolaris.org/msg23070.html

https://lists.freebsd.org/pipermail/freebsd-stable/2010-July/057880.html

http://blog.dest-unreach.be/2012/06/30/create-future-proof-zfs-pools

http://www.freebsddiary.org/zfs-with-gpart.php

> I will see if I can dig up the documentation on this.

Please do, because if zfs does have this ability buried somewhere I'd
love to see how and when you can activate it.

>Note that it is
> a very small amount as drives of the same nominal capacity vary very
> little in real capacity.

The second link of the ones I posted above is from a guy with two 1.5TB
drives that vary by one MB. I'm not sure what you're considering
"nominal capacity" in this context, but any margin smaller than that is
probably not useful in practice.

Chris Stankevitz

unread,

Jul 15, 2015, 2:29:23 AM7/15/15

to

On Sun, Jun 28, 2015 at 5:46 PM, Chris Stankevitz
<chrisst...@gmail.com> wrote:
> Can I use dd on a /dev/daX that is currently part of an
> imported, mounted, and online zpool?

No. If you try to use dd to overwrite the last sector of the drive
while it (the entire drive) is used by zfs, you will get "dd:
/dev/daX: Operation not permitted".