Re: Attempting to roll back zfs transactions on a disk to recover a destroyed ZFS filesystem

Alan Somers

unread,

Jul 11, 2013, 3:04:33 PM7/11/13

to

"zpool export" does not wipe the transaction history. It does,
however, write new labels and some metadata, so there is a very slight
chance that it might overwrite some of the blocks that you're trying
to recover. But it's probably safe. An alternative, much more
complicated, solution would be to have ZFS open the device
non-exclusively. This patch will do that. Caveat programmer: I
haven't tested this patch in isolation.

Change 624068 by willa@willa_SpectraBSD on 2012/08/09 09:28:38

Allow multiple opens of geoms used by vdev_geom.
Also ignore the pool guid for spares when checking to decide whether
it's ok to attach a vdev.

This enables using hotspares to replace other devices, as well as
using a given hotspare in multiple pools.

We need to investigate alternative solutions in order to allow
opening the geoms exclusive.

Affected files ...

... //SpectraBSD/stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c#2
edit

Differences ...

==== //SpectraBSD/stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c#2
(text) ====

@@ -179,49 +179,23 @@
gp = g_new_geomf(&zfs_vdev_class, "zfs::vdev");
gp->orphan = vdev_geom_orphan;
gp->attrchanged = vdev_geom_attrchanged;
- cp = g_new_consumer(gp);
- error = g_attach(cp, pp);
- if (error != 0) {
- printf("%s(%d): g_attach failed: %d\n", __func__,
- __LINE__, error);
- g_wither_geom(gp, ENXIO);
- return (NULL);
- }
- error = g_access(cp, 1, 0, 1);
- if (error != 0) {
- printf("%s(%d): g_access failed: %d\n", __func__,
- __LINE__, error);
- g_wither_geom(gp, ENXIO);
- return (NULL);
- }
- ZFS_LOG(1, "Created geom and consumer for %s.", pp->name);
- } else {
- /* Check if we are already connected to this provider. */
- LIST_FOREACH(cp, &gp->consumer, consumer) {
- if (cp->provider == pp) {
- ZFS_LOG(1, "Provider %s already in use by ZFS. "
- "Failing attach.", pp->name);
- return (NULL);
- }
- }
- cp = g_new_consumer(gp);
- error = g_attach(cp, pp);
- if (error != 0) {
- printf("%s(%d): g_attach failed: %d\n",
- __func__, __LINE__, error);
- g_destroy_consumer(cp);
- return (NULL);
- }
- error = g_access(cp, 1, 0, 1);
- if (error != 0) {
- printf("%s(%d): g_access failed: %d\n",
- __func__, __LINE__, error);
- g_detach(cp);
- g_destroy_consumer(cp);
- return (NULL);
- }
- ZFS_LOG(1, "Created consumer for %s.", pp->name);
+ }
+ cp = g_new_consumer(gp);
+ error = g_attach(cp, pp);
+ if (error != 0) {
+ printf("%s(%d): g_attach failed: %d\n", __func__,
+ __LINE__, error);
+ g_wither_geom(gp, ENXIO);
+ return (NULL);
+ }
+ error = g_access(cp, /*r*/1, /*w*/0, /*e*/0);
+ if (error != 0) {
+ printf("%s(%d): g_access failed: %d\n", __func__,
+ __LINE__, error);
+ g_wither_geom(gp, ENXIO);
+ return (NULL);
}
+ ZFS_LOG(1, "Created consumer for %s.", pp->name);

cp->private = vd;
vd->vdev_tsd = cp;
@@ -251,7 +225,7 @@
cp->private = NULL;

gp = cp->geom;
- g_access(cp, -1, 0, -1);
+ g_access(cp, -1, 0, 0);
/* Destroy consumer on last close. */
if (cp->acr == 0 && cp->ace == 0) {
ZFS_LOG(1, "Destroyed consumer to %s.", cp->provider->name);
@@ -384,6 +358,18 @@
cp->provider->name);
}

+static inline boolean_t
+vdev_attach_ok(vdev_t *vd, uint64_t pool_guid, uint64_t vdev_guid)
+{
+ boolean_t pool_ok;
+ boolean_t vdev_ok;
+
+ /* Spares can be assigned to multiple pools. */
+ pool_ok = vd->vdev_isspare || pool_guid == spa_guid(vd->vdev_spa);
+ vdev_ok = vdev_guid == vd->vdev_guid;
+ return (pool_ok && vdev_ok);
+}
+
static struct g_consumer *
vdev_geom_attach_by_guids(vdev_t *vd)
{
@@ -420,8 +406,7 @@
g_topology_lock();
g_access(zcp, -1, 0, 0);
g_detach(zcp);
- if (pguid != spa_guid(vd->vdev_spa) ||
- vguid != vd->vdev_guid)
+ if (!vdev_attach_ok(vd, pguid, vguid))
continue;
cp = vdev_geom_attach(pp, vd);
if (cp == NULL) {
@@ -498,8 +483,10 @@
g_topology_unlock();
vdev_geom_read_guids(cp, &pguid, &vguid);
g_topology_lock();
- if (pguid != spa_guid(vd->vdev_spa) ||
- vguid != vd->vdev_guid) {
+ if (vdev_attach_ok(vd, pguid, vguid)) {
+ ZFS_LOG(1, "guids match for provider %s.",
+ vd->vdev_path);
+ } else {
vdev_geom_close_locked(vd);
cp = NULL;
ZFS_LOG(1, "guid mismatch for provider %s: "
@@ -507,9 +494,6 @@
(uintmax_t)spa_guid(vd->vdev_spa),
(uintmax_t)vd->vdev_guid,
(uintmax_t)pguid, (uintmax_t)vguid);
- } else {
- ZFS_LOG(1, "guids match for provider %s.",
- vd->vdev_path);
}
}
}
@@ -601,8 +585,8 @@
g_topology_lock();
}
if (error != 0) {
- printf("ZFS WARNING: Unable to open %s for
writing (error=%d).\n",
- vd->vdev_path, error);
+ printf("ZFS WARNING: Error %d opening %s for write.\n",
+ error, vd->vdev_path);
vdev_geom_close_locked(vd);
cp = NULL;

On Thu, Jul 11, 2013 at 8:43 AM, Reid Linnemann <linne...@gmail.com> wrote:
> So recently I was trying to transfer a root-on-ZFS zpool from one pair of
> disks to a single, larger disk. As I am wont to do, I botched the transfer
> up and decided to destroy the ZFS filesystems on the destination and start
> again. Naturally I was up late working on this, being sloppy and drowsy
> without any coffee, and lo and behold I issued my 'zfs destroy -R' and
> immediately realized after pressing [ENTER[ that I had given it the
> source's zpool name. oops. Fortunately I was able to interrupt the
> procedure with only /usr being destroyed from the pool and I was able to
> send/receive the truly vital data in my /var partition to the new disk and
> re-deploy the base system to /usr on the new disk. The only thing I'm
> really missing at this point is all of the third-party software
> configuration I had in /usr/local/etc and my apache data in /usr/local/www.
>
> After a few minutes on Google I came across this wonderful page:
>
> http://www.solarisinternals.com/wiki/index.php/ZFS_forensics_scrollback_script
>
> where the author has published information about his python script which
> locates the uberblocks on the raw disk and shows the user the most recent
> transaction IDs, prompts the user for a transaction ID to roll back to, and
> zeroes out all uberblocks beyond that point. Theoretically, I should be
> able to use this script to get back to the transaction prior to my dreaded
> 'zfs destroy -R', then be able to recover the data I need (since no further
> writes have been done to the source disks).
>
> First, I know there's a problem in the script on FreeBSD in which the grep
> pattern for the od output expects a single space between the output
> elements. I've attached a patch that allows the output to be properly
> grepped in FreeBSD, so we can actually get to the transaction log.
>
> But now we are to my current problem. When attempting to roll back with
> this script, it tries to dd zero'd bytes to offsets into the disk device
> (/dev/ada1p3 in my case) where the uberblocks are located. But even
> with kern.geom.debugflags
> set to 0x10 (and I am runnign this as root) I get 'Operation not permitted'
> when the script tries to zero out the unwanted transactions. I'm fairly
> certain this is because the geom is in use by the ZFS subsystem, as it is
> still recognized as a part of the original pool. I'm hesitant to zfs export
> the pool, as I don't know if that wipes the transaction history on the
> pool. Does anyone have any ideas?
>
> Thanks,
> -Reid
>
> _______________________________________________
> freebsd...@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
> To unsubscribe, send any mail to "freebsd-hacke...@freebsd.org"
_______________________________________________
freebsd...@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hacke...@freebsd.org"

Will Andrews

unread,

Jul 11, 2013, 3:59:05 PM7/11/13

to

On Thu, Jul 11, 2013 at 9:04 AM, Alan Somers <aso...@freebsd.org> wrote:
> "zpool export" does not wipe the transaction history. It does,
> however, write new labels and some metadata, so there is a very slight
> chance that it might overwrite some of the blocks that you're trying
> to recover. But it's probably safe. An alternative, much more
> complicated, solution would be to have ZFS open the device
> non-exclusively. This patch will do that. Caveat programmer: I
> haven't tested this patch in isolation.

This change is quite a bit more than necessary, and probably wouldn't
apply to FreeBSD given the other changes in the code. Really, to make
non-exclusive opens you just have to change the g_access() calls in
vdev_geom.c so the third argument is always 0.

However, see below.

> On Thu, Jul 11, 2013 at 8:43 AM, Reid Linnemann <linne...@gmail.com> wrote:
>> But now we are to my current problem. When attempting to roll back with
>> this script, it tries to dd zero'd bytes to offsets into the disk device
>> (/dev/ada1p3 in my case) where the uberblocks are located. But even
>> with kern.geom.debugflags
>> set to 0x10 (and I am runnign this as root) I get 'Operation not permitted'
>> when the script tries to zero out the unwanted transactions. I'm fairly
>> certain this is because the geom is in use by the ZFS subsystem, as it is
>> still recognized as a part of the original pool. I'm hesitant to zfs export
>> the pool, as I don't know if that wipes the transaction history on the
>> pool. Does anyone have any ideas?

You do not have a choice. Changing the on-disk state does not mean
the in-core state will update to match, and the pool could get into a
really bad state if you try to modify the transactions on disk while
it's online, since it may write additional transactions (which rely on
state you're about to destroy), before you export.

Also, rolling back transactions in this manner assumes that the
original blocks (that were COW'd) are still in their original state.
If you're using TRIM or have a pretty full pool, the odds are not in
your favor. It's a roll of the dice, in any case.

--Will.

Reid Linnemann

unread,

Jul 11, 2013, 4:05:37 PM7/11/13

to

Will,

Thanks, that makes sense. I know this is all a crap shoot, but I've really
got nothing to lose at this point, so this is just a good opportunity to
rummage around the internals of ZFS and learn a few things. I might even
get lucky and recover some data!

Volodymyr Kostyrko

unread,

Jul 12, 2013, 12:33:32 PM7/12/13

to

11.07.2013 17:43, Reid Linnemann написав(ла):

> So recently I was trying to transfer a root-on-ZFS zpool from one pair of
> disks to a single, larger disk. As I am wont to do, I botched the transfer
> up and decided to destroy the ZFS filesystems on the destination and start
> again. Naturally I was up late working on this, being sloppy and drowsy
> without any coffee, and lo and behold I issued my 'zfs destroy -R' and
> immediately realized after pressing [ENTER[ that I had given it the
> source's zpool name. oops. Fortunately I was able to interrupt the
> procedure with only /usr being destroyed from the pool and I was able to
> send/receive the truly vital data in my /var partition to the new disk and
> re-deploy the base system to /usr on the new disk. The only thing I'm
> really missing at this point is all of the third-party software
> configuration I had in /usr/local/etc and my apache data in /usr/local/www.

You can try to experiment with zpool hidden flags. Look at this command:

zpool import -N -o readonly=on -f -R /pool <pool>

It will try to import pool in readonly mode so no data would be written
on it. It also doesn't mount anything on import so if any fs is damaged
you have less chances triggering a coredump. Also zpool import has a
hidden -T switch that gives you ability to select transaction that you
want to try to restore. You'll need a list of available transaction though:

zdb -ul <vdev>

This one when given a vdev lists all uberblocks with their respective
transaction ids. You can take the highest one (it's not the last one)
and try to mount pool with:

zpool import -N -o readonly=on -f -R /pool -F -T <transaction_id> <pool>

Then check available filesystems.

--
Sphinx of black quartz, judge my vow.

Reid Linnemann

unread,

Jul 12, 2013, 2:41:56 PM7/12/13

to

Hey presto!

/> zfs list
NAME USED AVAIL REFER MOUNTPOINT
bucket 485G 1.30T 549M legacy
bucket/tmp 21K 1.30T 21K legacy
bucket/usr 29.6G 1.30T 29.6G /mnt/usr
bucket/var 455G 1.30T 17.7G /mnt/var
bucket/var/srv 437G 1.30T 437G /mnt/var/srv

There's my old bucket! Thanks much for the hidden -T argument, Volodymyr!
Now I can get back the remainder of my missing configuration.

Stefan Esser

unread,

Jul 14, 2013, 11:46:03 AM7/14/13

to

Am 12.07.2013 14:33, schrieb Volodymyr Kostyrko:
> You can try to experiment with zpool hidden flags. Look at this command:
>
> zpool import -N -o readonly=on -f -R /pool <pool>
>
> It will try to import pool in readonly mode so no data would be written
> on it. It also doesn't mount anything on import so if any fs is damaged
> you have less chances triggering a coredump. Also zpool import has a
> hidden -T switch that gives you ability to select transaction that you
> want to try to restore. You'll need a list of available transaction though:
>
> zdb -ul <vdev>
>
> This one when given a vdev lists all uberblocks with their respective
> transaction ids. You can take the highest one (it's not the last one)
> and try to mount pool with:
>
> zpool import -N -o readonly=on -f -R /pool -F -T <transaction_id> <pool>

I had good luck with ZFS recovery with the following approach:

1) Use zdb to identify a TXG for which the data structures are intact

2) Select recovery mode by loading the ZFS KLD with "vfs.zfs.recover=1"
set in /boot/loader.conf

3) Import the pool with the above -T option referring to a suitable TXG
found with the help zdb.

The zdb commands to use are:

# zdb -AAA -L -t <TXG> -bcdmu <POOL>

(Both -AAA and -L reduce the amount of consistency checking performed.
A pool (at TXG) that needs these options to allow zdb to succeed is
damaged, but may still allow recovery of most or all files. Be sure
to only import that pool R/O, or your data will probably be lost!)

A list of TXGs to try can be retrieved with "zdb -hh <POOL>".

You may need to add "-e" to the list of zdb options, since the port is
exported / not currently mounted).

Regards, STefan

yee...@gmail.com

unread,

Sep 6, 2013, 10:56:29 AM9/6/13

to

thank gawd for this thread! doing an all nighter, i too managed to destroy a filesystem, although my mistake was more sinister:

$ time zfs send -R -i stream@2013-09-04 stream@2013-09-05 | zfs recv -Fduv tank

which then proceeded with ....

umount2: Device or resource busy
umount: /var/lib/mysql: device is busy.
(In some cases useful info about processes that use
the device is found by lsof(8) or fuser(1))
umount2: Device or resource busy
cannot unmount '/var/lib/mysql': umount failed
attempting destroy tank/data
success

^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C

!!!

why does destroy finish so quick!! ... tank/data is ~3TB of data! :(

so i immediately exported the entire pool as i slowly saw the free space increasing.... then i found this posting...

so from reading the various posts, is it really as simple as using zdb to determine the last transaction (txg) and importing it with the -T flag?? then simply cp the data over to somewhere else?

i'm currently running:

$ zpool import -d /dev/disk/by-id -o readonly=on -N -f tank

however it's taking a very long time.... i happy to wait as long as i can recover (most of) my data back! i'm guessing that it's slowly deleting all the files...? (which given that's it's read only means that it will repeat this for every import with a new txg?)

assuming this all works; i don't have enough disks around to dump the data from the pool, would it be okay to just import my pool with a suitable -T <txg> as read-write and just continue as if nothing had happened? or should i really start on a fresh pool?