zpool offline not releasing device?

38 views
Skip to first unread message

Maximilian Mehnert

unread,
Nov 25, 2009, 8:17:44 AM11/25/09
to zfs-fuse
Hi!
I tried to "zpool offline" a device from a mirrored pool.
It was created with cryptsetup.
"zpool status" shows the device begin OFFLINE.
However, "cryptsetup remove" claims that it's still busy.

Did anyone experience something similar and/or has a suggestion for a solution?

Regards,
Maximilian

Seth Heeren

unread,
Nov 25, 2009, 8:21:57 AM11/25/09
to zfs-...@googlegroups.com
I'm not familiar with cryptsetup (anymore) but perhaps you have an
intermediate layer for devicemapping still active (sudo dmsetup ls). I
don't actually recall if cryptsetup was able to directly expose 'raw'
block devices. If I'm not incorrect, I suppose cryptsetup conveniently
calls the corresponding dmsetup calls to set it up, but it might not
automatically remove (in this scenario).

Anyway, something to check: does stopping zfs-fuse confirm that the
busy-reference was indeed only from zfs-fuse?

Maximilian Mehnert

unread,
Nov 25, 2009, 9:46:05 AM11/25/09
to zfs-fuse
Hi, Seth!
Thanks for the reply!

Well, fuser says:

# fuser -v /dev-initrd/mapper/zfs1
Cannot stat file /proc/8864/fd/29: No such file or directory
USER PID ACCESS COMMAND
/dev-initrd/mapper/zfs1:
root 1190 F.... zfs-fuse

so it's definitely zfs that is occupying the device.

dmsetup says:
Name: zfs1
State: ACTIVE
Read Ahead: 256
Tables present: LIVE
Open count: 1
Event number: 0
Major, minor: 253, 1
Number of targets: 1

So there's no other process apart from zfs-fuse.

So I think, it's established that it's zfs-fuse that is responsible
and not freeing the device after "zpool offline".


2009/11/25 Seth Heeren <se...@zfs-fuse.net>:

Seth Heeren

unread,
Nov 25, 2009, 4:39:44 PM11/25/09
to zfs-...@googlegroups.com
Problem confirmed. I reproduced the problem on my lvm2 based system[1]. It gets weirder:

lvcreate -L5g -n z1 media
lvcreate -L5g -n z2 media
cryptsetup -v luksFormat /dev/media/z1
cryptsetup -v luksFormat /dev/media/z2
dmsetup info media-z1
dmsetup info media-z2
zfs-fuse
zpool create crypt mirror /dev/media/z[12]
zpool status
zpool offline crypt /dev/media/z2
zpool status
fuser -v /dev/media/z[12]
lsof /dev/media/z[12]
# both showed both devs in use by zfs-fuse
zfs umount -a
fuser -v /dev/media/z[12]
lsof /dev/media/z[12]
# no change
zpool export crypt
zpool status
fuser -v /dev/media/z[12]
# WEIRD: only z2 still in use

So it definitely looks like a bug in the offlining code; Since once offlined, it is not even closed properly after exporting the pool.

Export works ok in all instances:
# fuser -v /dev/media/z[12]
# zpool import -d /dev/mapper crypt
# fuser -v /dev/media/z[12]
                     GEBRUIKER   PID SOORT PROGRAMMA
/dev/media/z1:       root      10100 F.... zfs-fuse
# zpool status
  pool: crypt
 state: DEGRADED
status: One or more devices has been taken offline by the administrator.
    Sufficient replicas exist for the pool to continue functioning in a
    degraded state.
action: Online the device using 'zpool online' or replace the device with
    'zpool replace'.
 scrub: none requested
config:

    NAME                 STATE     READ WRITE CKSUM
    crypt                DEGRADED     0     0     0
      mirror-0           DEGRADED     0     0     0
        mapper/media-z1  ONLINE       0     0     0
        mapper/media-z2  OFFLINE      0     0     0

errors: No known data errors
# zpool export crypt
# fuser -v /dev/media/z[12]
I'm getting a suspicion that this may be broken due to symlinked alternative names for the device nodes. Here's why:
# zpool import -d /dev/mapper crypt
# zpool status
  pool: crypt
 state: DEGRADED
status: One or more devices has been taken offline by the administrator.
    Sufficient replicas exist for the pool to continue functioning in a
    degraded state.
action: Online the device using 'zpool online' or replace the device with
    'zpool replace'.
 scrub: none requested
config:

    NAME                 STATE     READ WRITE CKSUM
    crypt                DEGRADED     0     0     0
      mirror-0           DEGRADED     0     0     0
        mapper/media-z1  ONLINE       0     0     0
        mapper/media-z2  OFFLINE      0     0     0

errors: No known data errors
# zpool online crypt /dev/media/z2
cannot online /dev/media/z2: no such device in pool
# zpool online crypt /dev/mapper/media-z2
# zpool status
  pool: crypt
 state: ONLINE
 scrub: resilver completed after 0h0m with 0 errors on Wed Nov 25 22:35:58 2009
config:

    NAME                 STATE     READ WRITE CKSUM
    crypt                ONLINE       0     0     0
      mirror-0           ONLINE       0     0     0
        mapper/media-z1  ONLINE       0     0     0
        mapper/media-z2  ONLINE       0     0     0  16,5K resilvered

errors: No known data errors

As you can see I could not online my 2nd mirror volume _unless_ I used the 'realpath' (canonical?) spelling for it (/dev/mapper/media*).
Now this is just a hunch. Let me if it works out: (brb, kids calling)

[1] (this is with a dev version based on rev 0f8387bd0b1c016b017a7e4885328695b71a5d6c from Emmanuels repo)

Rudd-O

unread,
Nov 25, 2009, 5:19:47 PM11/25/09
to zfs-...@googlegroups.com
Emmanuel, can you chime in? It seems like the device file is never
getting closed.

I would like to see some sort of trace syslog debug infos when we do
operations like these, so we can actually see what ZFS-FUSE is doing
behind the scenes, and ask our users for these debug logs to know what
was happening. This is very easy to do with Python (just wrap the
relevant functions/methods with logging decorators), any ideas on how
best to go about this thing in our code?

Seth Heeren

unread,
Nov 25, 2009, 5:21:04 PM11/25/09
to zfs-...@googlegroups.com
Strike that, there is no difference:

root@karmic:~# zpool export crypt; sleep 1; killall zfs-fuse; sleep
5; zfs-fuse; sleep 3; zpool status
root@karmic:~# zpool import -d /dev/media crypt
root@karmic:~# zpool online crypt /dev/media/z2
root@karmic:~# zpool offline crypt /dev/media/z2
root@karmic:~# fuser -v /dev/media/z[12]
GEBRUIKER PID SOORT PROGRAMMA
/dev/media/z1: root 11392 F.... zfs-fuse
/dev/media/z2: root 11392 F.... zfs-fuse


Contrast with:

root@karmic:~# zpool export crypt; sleep 1; killall zfs-fuse; sleep
5; zfs-fuse; sleep 3; zpool status
root@karmic:~# zpool import -d /dev/mapper crypt
root@karmic:~# zpool online crypt /dev/mapper/media-z2
root@karmic:~# zpool offline crypt /dev/mapper/media-z2
root@karmic:~# fuser -v /dev/mapper/media-z[12]
GEBRUIKER PID SOORT PROGRAMMA
/dev/mapper/media-z1: root 11392 F.... zfs-fuse
/dev/mapper/media-z2: root 11392 F.... zfs-fuse

I'm looking on....

Emmanuel Anne

unread,
Nov 25, 2009, 6:47:02 PM11/25/09
to zfs-...@googlegroups.com
I am currently busy trying to make ztest to work again with the latest version.
I am making some slow and paintfull progress, but I am not going to tell the whole story before it's over and it's not over yet !

+ I don't have an experience on using this setup to encrypt data.
You know crypt support was added directly to zfs, it must be a property or something, but maybe it's not documented because it's very recent. Maybe it would be worth investigating using the zfs native solution for that...

But yes there is probably a bug somewhere there, I'll look into it after I have finished with ztest, if nobody has found a fix before that...

I'll just say that the ztest new tests are really serious, they start 23 threads by default in parrallel to create load (read and write operations at the same time), and then expect the commits to be done at the right time. Well obviously with a solution outside of the kernel we have a problem with this... So it's slow to test, and slow to improve, but I still hope to see the end of it soon !

2009/11/25 Rudd-O <rud...@rudd-o.com>



--
zfs-fuse git repository : http://rainemu.swishparty.co.uk/cgi-bin/gitweb.cgi?p=zfs;a=summary

Seth Heeren

unread,
Nov 25, 2009, 6:53:04 PM11/25/09
to zfs-...@googlegroups.com
Emmanuel Anne wrote:
> + I don't have an experience on using this setup to encrypt data.
> You know crypt support was added directly to zfs, it must be a
> property or something, but maybe it's not documented because it's very
> recent. Maybe it would be worth investigating using the zfs native
> solution for that...
>
> But yes there is probably a bug somewhere there, I'll look into it
> after I have finished with ztest, if nobody has found a fix before that...
Too late! I just published one, see other post

Rudd-O

unread,
Nov 25, 2009, 6:53:32 PM11/25/09
to zfs-...@googlegroups.com
OK, I am filing a bug about the crypt setup.

Rudd-O

unread,
Nov 25, 2009, 6:55:16 PM11/25/09
to zfs-...@googlegroups.com
Done.

http://zfs-fuse.net/issues/4/view

Everyone add more info on the bug.

On Thu, 2009-11-26 at 00:47 +0100, Emmanuel Anne wrote:

Rudd-O

unread,
Nov 25, 2009, 6:55:55 PM11/25/09
to zfs-fuse
What other post?

Link?

Seth Heeren

unread,
Nov 25, 2009, 6:58:09 PM11/25/09
to zfs-...@googlegroups.com
So here is my first ever real patch to zfs-fuse (huzah!):

It appears this has been broken by a commit at Nov 9th[1]. This way I know you're running a fairly unstable version of zfs-fuse (http://git.zfs-fuse.net/official doesn't have it). I've published an updated branch based on Emmanuels 'stable' branch (as of today) with just this single fix, in case you want to test:

    http://zfs-fuse.sehe.nl/?p=zfs-fuse

You can easily get it by

    git clone git://zfs-fuse.sehe.nl/git/zfs-fuse

or if you already have your setup I'd rather do something like

    git remote add sehe git://zfs-fuse.sehe.nl/git/zfs-fuse
    git remote update
    git checkout -b sehe sehe/stable

Cheers,
Seth  
====================================
[1] hg commit 10850:6840704 osol_0906 PV guests sometimes hang at login prompt
====================================

So here's the run-down:

after a dozen (or so) checks, vdev.c marks the device as vdev_offline (L:2222) and tries to vdev_reopen() the toplevel vdev

Inside vdev_reopen(), somewhere along the way we end up in vdev_file_close() with
Breakpoint 3, vdev_file_close (vd=0xb6f6d300) at lib/libzpool/vdev_file.c:119
119        vdev_file_t *vf = vd->vdev_tsd;
2: vd->vdev_path = 0xb7e07ec0 "/dev/mapper/media-z2"
1: vd->vdev_reopening = B_TRUE
4: vd->vdev_offline = 1
Note the reopening flag, which mildly contradicts the offline flag. From the actual onnv-gate code it is pretty clear, that even in upstream (i.e. OpenSolaris) ZFS will skip the close on the device, since the toplevel device (the mirror) is actually being reopened. This is the bug.

If found the spot to 'fix' this by tracing the caller for this function:
Breakpoint 6, vdev_close (vd=0xb6f6d300) at lib/libzpool/vdev.c:1376
1376        if (pvd != NULL && pvd->vdev_reopening)
11: vd->vdev_offline = 1
10: pvd->vdev_offline = 0
9: pvd->vdev_reopening = B_TRUE
6: vd->vdev_path = 0xb7e07ec0 "/dev/mapper/media-z2"
5: vd->vdev_reopening = B_FALSE
It seemed obvious to say:

        if (pvd != NULL && pvd->vdev_reopening)
-               vd->vdev_reopening = pvd->vdev_reopening;
+    {
+        // avoid reopening the vdev if it should go offline
+               vd->vdev_reopening = vd->vdev_offline? B_FALSE : pvd->vdev_reopening;
+    }


This patch indeed fixes it for me:

root@karmic:~# zpool import -d /dev/mapper/ crypt

root@karmic:~# zpool online crypt /dev/mapper/media-z2
root@karmic:~# fuser -v /dev/mapper/media-z?

                     GEBRUIKER   PID SOORT PROGRAMMA
/dev/mapper/media-z1:
                     root      15249 F.... zfs-fuse
/dev/mapper/media-z2:
                     root      15249 F.... zfs-fuse

root@karmic:~# zpool offline crypt /dev/mapper/media-z2
root@karmic:~# fuser -v /dev/mapper/media-z?

                     GEBRUIKER   PID SOORT PROGRAMMA
/dev/mapper/media-z1:
                     root      15249 F.... zfs-fuse
I'd be very interested in reporting this upstream, because I'm not actually sure whether this is safe to do with all vdev layouts (apparently, vdevs could be nested multiple levels deep, and whole groups of underlying vdevs could be offlined at once. Things might break with this simple patch). If someone knows the intricacies of this, please do let me know.

Does anybody know how to send our fixes upstream?

Seth Heeren

unread,
Nov 25, 2009, 7:02:59 PM11/25/09
to zfs-...@googlegroups.com
Rudd-O wrote:
> What other post?
>
> Link?
>
Patience?

Rudd-O

unread,
Nov 25, 2009, 7:05:49 PM11/25/09
to zfs-...@googlegroups.com
Seth,

Let's not close the bug until someone else has actually merged your
changes (logically, I cannot merge your tree into mine since I am way
behind, so until your bug fix has propagated into Emmanuel and/or the
other way around, let's keep the bug outstanding). Or, we can leave the
bug closed, AND merge the issue ASAP.

Seth Heeren

unread,
Nov 25, 2009, 7:05:56 PM11/25/09
to zfs-...@googlegroups.com
Rudd-O wrote:
> Done.
>
> http://zfs-fuse.net/issues/4/view
>
> Everyone add more info on the bug.
>
Resolved. Now how's that for turnaround time!

I also added links + steps to reproduce

Cheers,
Seth

Rudd-O

unread,
Nov 25, 2009, 7:09:56 PM11/25/09
to zfs-...@googlegroups.com
Awesome turnaround time. Mad Propz!

Seth Heeren

unread,
Nov 25, 2009, 7:11:08 PM11/25/09
to zfs-...@googlegroups.com
Good point to raise.

In that case, let's _not_ mark it fixed until it hits the (your) official repo. We'd be confusing people both ways (my repo or Emmanuels... it's the same deal: both are not official).

So if _you_ could merge from my stable, you'll have exactly Emmanuels stable branch (to date) +my fix.

$0.02

Rudd-O

unread,
Nov 25, 2009, 7:23:48 PM11/25/09
to zfs-...@googlegroups.com
I'd rather Emmanuel merged your fix into his repo, so you guys can
continue moving forward without much history intermix. Then I can merge
it into main. What we can do is keep the bug open, yes.

Seth Heeren

unread,
Nov 25, 2009, 7:28:04 PM11/25/09
to zfs-...@googlegroups.com
Rudd-O wrote:
> I'd rather Emmanuel merged your fix into his repo, so you guys can
> continue moving forward without much history intermix. Then I can merge
> it into main. What we can do is keep the bug open, yes.
>
The way I recall you were asking for Emmanuel to arrange a stable branch
for you to merge from.
It is there, and it is the one I based my fix off.

Any history intermix is (a) not relevant to git (b) a matter of rebase?

Anyways, the real issue is whether the OP is content with the
availability of the fix in this way [ <--- invites comment :) ]

TO ALL DEVELOPERS: I'll be making my public fixes available on my
sehe/stable branch. Emmanuel, please do merge (or review) everything I
post on there (it won't be much anyway... still no time).


Maximilian Mehnert

unread,
Nov 25, 2009, 8:01:51 PM11/25/09
to zfs-fuse
This is incredibly cool! Thanks a lot!
Patch applied and tested.
Final test will be tomorrow morning, when resilvering is finished ;-)


2009/11/26 Seth Heeren <se...@zfs-fuse.net>:

Emmanuel Anne

unread,
Nov 26, 2009, 10:30:09 AM11/26/09
to zfs-...@googlegroups.com
Eh congratulations, Seth, I knew you were going to send some patches one day ! ;-)
Well I promise to look into that asap, but it will probably not be before tomorrow, sorry...

Meanwhile, I have just one question : why is vdev_reopening set to true ?
Maybe it's something which works differently in linux compared to opensolaris ?
Anyway I'll look into this in more detail later, thanks for the patch, it looks interesting !

2009/11/26 Seth Heeren <se...@zfs-fuse.net>



--~--~---------~--~----~------------~-------~--~----~
To post to this group, send email to zfs-...@googlegroups.com
To visit our Web site, click on http://zfs-fuse.net/
-~----------~----~----~----~------~----~------~--~---

Seth

unread,
Nov 26, 2009, 10:39:42 AM11/26/09
to zfs-...@googlegroups.com
Emmanuel Anne wrote:
> Eh congratulations, Seth, I knew you were going to send some patches
> one day ! ;-)
> Well I promise to look into that asap, but it will probably not be
> before tomorrow, sorry...
>
> Meanwhile, I have just one question : why is vdev_reopening set to true ?
> Maybe it's something which works differently in linux compared to
> opensolaris ?
> Anyway I'll look into this in more detail later, thanks for the patch,
> it looks interesting !
reopening is true by design. When offlining a vdev (e.g. a 'leg' of a
mirror) the entire _parent_ vdev (tree) is reopened. This is in order to
detect any sanity problems (e.g. slog vdevs cannot in all circumstances
be reopened in this way, in which the vdev_close will rollback and
return E_BUSY after all.

So what it does is:

mark vdev as offline
reopen the parent vdev(s) (these will make sure that the offlined
device is skipped)
verify status (sanity check)
return
>
> 2009/11/26 Seth Heeren <se...@zfs-fuse.net <mailto:se...@zfs-fuse.net>>
>
>
> Rudd-O wrote:
> > I'd rather Emmanuel merged your fix into his repo, so you guys can
> > continue moving forward without much history intermix. Then I
> can merge
> > it into main. What we can do is keep the bug open, yes.
> >
> The way I recall you were asking for Emmanuel to arrange a stable
> branch
> for you to merge from.
> It is there, and it is the one I based my fix off.
>
> Any history intermix is (a) not relevant to git (b) a matter of
> rebase?
>
> Anyways, the real issue is whether the OP is content with the
> availability of the fix in this way [ <--- invites comment :) ]
>
> TO ALL DEVELOPERS: I'll be making my public fixes available on my
> sehe/stable branch. Emmanuel, please do merge (or review) everything I
> post on there (it won't be much anyway... still no time).
>
>
>
> --~--~---------~--~----~------------~-------~--~----~
> To post to this group, send email to zfs-...@googlegroups.com
> <mailto:zfs-...@googlegroups.com>
> To visit our Web site, click on http://zfs-fuse.net/
> -~----------~----~----~----~------~----~------~--~---
>
>
>
>
> --
> zfs-fuse git repository :
> http://rainemu.swishparty.co.uk/cgi-bin/gitweb.cgi?p=zfs;a=summary
> --

Seth Heeren

unread,
Nov 26, 2009, 3:49:17 PM11/26/09
to zfs-...@googlegroups.com
Ok gents,

I have tested some more with my now-famous dorky script :) It'll concoct multi-terabyte beasts of pools in mere seconds out of thin air. What's more, it'll create the device files for you on the fly. So here goes:

# ./dorky.sh mirror "A" "1 2"
forced clean relaunch of zfs-fuse
  pool: dorky
 state: ONLINE

 scrub: none requested
config:

    NAME                    STATE     READ WRITE CKSUM
    dorky                   ONLINE       0     0     0

      mirror-0              ONLINE       0     0     0
        /tmp/dorky_blk/zA1  ONLINE       0     0     0
        /tmp/dorky_blk/zA2  ONLINE       0     0     0


errors: No known data errors
NAME    USED  AVAIL  REFER  MOUNTPOINT
dorky  82,5K   488G    21K  /tmp/dorky
# fuser /tmp/dorky_blk/z* 2>&1 | wc -l
2
# zpool offline dorky /tmp/dorky_blk/z?1
# fuser /tmp/dorky_blk/z* 2>&1 | wc -l
1

===============================================================

# ./dorky.sh mirror "A B C D E F G H I" "1 2 3 4 5 6 7 8"
cannot open 'dorky': no such pool
forced clean relaunch of zfs-fuse
  pool: dorky
 state: ONLINE

 scrub: none requested
config:

    NAME                    STATE     READ WRITE CKSUM
    dorky                   ONLINE       0     0     0

      mirror-0              ONLINE       0     0     0
        /tmp/dorky_blk/zA1  ONLINE       0     0     0
        /tmp/dorky_blk/zA2  ONLINE       0     0     0
        /tmp/dorky_blk/zA3  ONLINE       0     0     0
... snip ...
      mirror-8              ONLINE       0     0     0
        /tmp/dorky_blk/zI1  ONLINE       0     0     0
        /tmp/dorky_blk/zI2  ONLINE       0     0     0
        /tmp/dorky_blk/zI3  ONLINE       0     0     0
        /tmp/dorky_blk/zI4  ONLINE       0     0     0
        /tmp/dorky_blk/zI5  ONLINE       0     0     0
        /tmp/dorky_blk/zI6  ONLINE       0     0     0
        /tmp/dorky_blk/zI7  ONLINE       0     0     0
        /tmp/dorky_blk/zI8  ONLINE       0     0     0


errors: No known data errors
NAME    USED  AVAIL  REFER  MOUNTPOINT
dorky   100K  4,29T    21K  /tmp/dorky
# fuser /tmp/dorky_blk/z* 2>&1 | wc -l
72
# zpool offline dorky /tmp/dorky_blk/z?[1345678]
# fuser /tmp/dorky_blk/z* 2>&1 | wc -l
9
# zpool online dorky /tmp/dorky_blk/z?[1345678]
# fuser /tmp/dorky_blk/z* 2>&1 | wc -l
72

======================================================

# ./dorky.sh raidz3 "A B C D E F G H I" "1 2 3 4 5 6 7 8"
forced clean relaunch of zfs-fuse
  pool: dorky
 state: ONLINE

 scrub: none requested
config:

    NAME                    STATE     READ WRITE CKSUM
    dorky                   ONLINE       0     0     0
      raidz3-0              ONLINE       0     0     0
        /tmp/dorky_blk/zA1  ONLINE       0     0     0
        /tmp/dorky_blk/zA2  ONLINE       0     0     0
        /tmp/dorky_blk/zA3  ONLINE       0     0     0
... snip ...
      raidz3-8              ONLINE       0     0     0
        /tmp/dorky_blk/zI1  ONLINE       0     0     0
        /tmp/dorky_blk/zI2  ONLINE       0     0     0
        /tmp/dorky_blk/zI3  ONLINE       0     0     0
        /tmp/dorky_blk/zI4  ONLINE       0     0     0
        /tmp/dorky_blk/zI5  ONLINE       0     0     0
        /tmp/dorky_blk/zI6  ONLINE       0     0     0
        /tmp/dorky_blk/zI7  ONLINE       0     0     0
        /tmp/dorky_blk/zI8  ONLINE       0     0     0


errors: No known data errors
NAME    USED  AVAIL  REFER  MOUNTPOINT
dorky   186K  21,5T  52,2K  /tmp/dorky
# fuser /tmp/dorky_blk/z* 2>&1 | wc -l
72
# zpool offline dorky /tmp/dorky_blk/z?[158]
# fuser /tmp/dorky_blk/z* 2>&1 | wc -l
45
# zpool online dorky /tmp/dorky_blk/z?[18]
# fuser /tmp/dorky_blk/z* 2>&1 | wc -l
63

===============================================================
dorky.sh
Reply all
Reply to author
Forward
0 new messages