[PATCH] Allow userspace block device implementation

Zachary Amsden

unread,

Jul 27, 2009, 6:10:13 AM7/27/09

to

Well, it may be a good, bad, idiotic or brilliant idea depending on your
personal philosophy. I went down this route out of pragmatism.
Hopefully I have not fully re-invented the wheel.

The patch included allows one to implement a kernel level block device
in userspace, using an ioctl() based interface to create a sized device
with given properties, and then receive and respond to bio requests
issued to the device. One can poll on the associated control socket to
allow efficient servicing of device requests. So far only strict copy
to/from user memory is supported, there is no fancy page flipping or
mapping operations.

Which there probably should not be. This device is not about
performance, is it about extending the boundaries of the kernel to the
almost improbable. Now one can literally create any kind of device
imaginable and use it as a block device in the kernel, mounting
partitions and such and using them as if they existed natively. I have
attached a very simple dummy program showing how to do this.

The design requirements 'kernel block device in user space' to me
demanded that the interface be stateless. Userspace can crash, be
killed, or interrupted. Block devices cannot, they must answer all
requests, even if that answer is a failure. Thus there exists no state
between the kernel and the userspace process(es) or threads serving the
device. No establishment of connections, just a queue which can be read
and answered via get and put, the ioctl operators available. This
allows a completely flexible userspace implementation, with multiple
processes, etc, and allows complete recovery via a simple reset command
if those programs fail. I believe this also prevents any possibility of
accidental deadlock. There may of course be some hidden deep deadlock
potential in such a device, especially if one decided to use it as a
swap device, but again, this is a philosophical issue.

Enough talking, let's have at it and see where this goes. Obviously
this is experimental and open to feedback. Considering it turns kernel
interfaces on their head, I have given it what I feel is an appropriate
name.

If there is any person or list you know that I forgot to copy this to,
please forward it on to them.

Thanks,

Zach

abuse-module.patch

abusectl.c

Peter Zijlstra

unread,

Jul 27, 2009, 9:00:17 AM7/27/09

to

On Sun, 2009-07-26 at 23:57 -1000, Zachary Amsden wrote:
> Well, it may be a good, bad, idiotic or brilliant idea depending on your
> personal philosophy. I went down this route out of pragmatism.
> Hopefully I have not fully re-invented the wheel.
>
> The patch included allows one to implement a kernel level block device
> in userspace, using an ioctl() based interface to create a sized device
> with given properties, and then receive and respond to bio requests
> issued to the device. One can poll on the associated control socket to
> allow efficient servicing of device requests. So far only strict copy
> to/from user memory is supported, there is no fancy page flipping or
> mapping operations.

Somehow this made me think of FUSE/CUSE... should this be named aBUSE?
Oh wait it is :-), what I'm after is I guess is, can we share some of
the FUSE/CUSE code?

I can only imagine the fun we'll end up with when someone tries swapon
on a user-space block device.. aptly named.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Alan Cox

unread,

Jul 27, 2009, 9:30:26 AM7/27/09

to

> Somehow this made me think of FUSE/CUSE... should this be named aBUSE?
> Oh wait it is :-), what I'm after is I guess is, can we share some of
> the FUSE/CUSE code?

It reminds me of the existing and perfectly functional network block
device (nbd) we already have and which has also been present for years.

Alan

Zachary Amsden

unread,

Jul 27, 2009, 4:10:16 PM7/27/09

to

Alan Cox wrote:
>> Somehow this made me think of FUSE/CUSE... should this be named aBUSE?
>> Oh wait it is :-), what I'm after is I guess is, can we share some of
>> the FUSE/CUSE code?

Well, it is A Block device in User SpacE :) I don't think there is a
lot of code sharing benefit in some 800 odd lines, but I could be wrong.

> It reminds me of the existing and perfectly functional network block
> device (nbd) we already have and which has also been present for years.

Yes, I agree, in fact I looked at nbd as I was writing this, but I
believe it is different enough to warrant further investigation.

The network block device requires access to a socket, which the code at
least seems to imply brings up the potential for deadlocks when
self-hosting. This was designed to explicitly support self-hosting.

This device can be used without CONFIG_NET (not a big advantage, I
agree), and is completely connectionless, which I would argue is a big
advantage.

NBD is perfectly functional, but it seemed more complicated than
necessary for a purely local implementation. A fully functional null
server (just returns zeros, full error checking and normal whitespace)
can be implemented in about 60 lines of C code, which I don't think is
the case for NBD. Of course, I'm sure it is possible with PERL bindings
as a one-liner, but the fundamental argument isn't about lines, it's
about complexity. NBD requires socket allocation, listening and
connection; this requires only opening of a device node.

Can you swap over NBD? Assuming one had pinned the userspace program
and it pre-allocated all memory so no pagein / alloc was required, would
it be deadlock proof? I believe there are structure allocations
required for the socket implementation that go beyond the basic BIO
allocations, therefore making it impossible. In /theory/, one should be
able to swap over this device. In practice, it's probably a really bad
idea.

It seems then that NBD is a strict subset of the functionality provided
by this type of module.

Zach

Peter Zijlstra

unread,

Jul 27, 2009, 4:30:23 PM7/27/09

to

On Mon, 2009-07-27 at 09:46 -1000, Zachary Amsden wrote:
> Can you swap over NBD? Assuming one had pinned the userspace program
> and it pre-allocated all memory so no pagein / alloc was required, would
> it be deadlock proof? I believe there are structure allocations
> required for the socket implementation that go beyond the basic BIO
> allocations, therefore making it impossible. In /theory/, one should be
> able to swap over this device. In practice, it's probably a really bad
> idea.

I've got patches to make swap over network work, with those swap over
NBD works until you loose connection. NBDs great weakness (aside from
funny code) is that it does the connection management in userspace,
which makes recovering from connection loss when swapping over it
utterly impossible.

Alan Cox

unread,

Jul 27, 2009, 5:10:14 PM7/27/09

to

> Can you swap over NBD? Assuming one had pinned the userspace program
> and it pre-allocated all memory so no pagein / alloc was required, would
> it be deadlock proof? I believe there are structure allocations
> required for the socket implementation that go beyond the basic BIO
> allocations, therefore making it impossible. In /theory/, one should be
> able to swap over this device. In practice, it's probably a really bad
> idea.

In practice since you mmap an object for write and the only free pages
left may be mmap write dirty pages to go to a file system it ought to be
possible.

Tejun Heo

unread,

Jul 27, 2009, 9:30:14 PM7/27/09

to

Hello,

Alan Cox wrote:
>> Somehow this made me think of FUSE/CUSE... should this be named aBUSE?
>> Oh wait it is :-), what I'm after is I guess is, can we share some of
>> the FUSE/CUSE code?
>
> It reminds me of the existing and perfectly functional network block
> device (nbd) we already have and which has also been present for years.

Yeah, I think this is the biggest hurdle against (a)BUSE. Is it
sufficiently different from nbd? nbd-like functionality can be
implemented something via FUSE and maybe it can be said that things
are cleaner that way but nbd has been in the kernel for a long time
now and it's definitely much easier to do swap over it when the whole
thing is in kernel.

Thanks.

--
tejun

Zachary Amsden

unread,

Jul 28, 2009, 12:10:07 AM7/28/09

to

Tejun Heo wrote:
> Hello,
>
> Alan Cox wrote:
>>> Somehow this made me think of FUSE/CUSE... should this be named aBUSE?
>>> Oh wait it is :-), what I'm after is I guess is, can we share some of
>>> the FUSE/CUSE code?
>> It reminds me of the existing and perfectly functional network block
>> device (nbd) we already have and which has also been present for years.
>
> Yeah, I think this is the biggest hurdle against (a)BUSE. Is it
> sufficiently different from nbd? nbd-like functionality can be
> implemented something via FUSE and maybe it can be said that things
> are cleaner that way but nbd has been in the kernel for a long time
> now and it's definitely much easier to do swap over it when the whole
> thing is in kernel.

The only real difference from this and the nbd is that the nbd is
explicitly connection oriented, while this is intentionally
connectionless. That was an interesting property, but turned out to be
not to be the best for what I was trying to do.

I'm actually going to go ahead and use nbd instead. All I need a block
device that supports partitions with a userspace driver.

So maybe someone will find this useful, for now it is preserved in LKML
archives and the patch should continue to apply for some time.

BTW, implementing something like this via FUSE would be extremely
unpleasant. I'd need another layer on top, probably via the loop
device, to get to the actual partitions of the block devices.

Zach

Alan Cox

unread,

Jul 28, 2009, 6:30:13 AM7/28/09

to

> BTW, implementing something like this via FUSE would be extremely
> unpleasant. I'd need another layer on top, probably via the loop
> device, to get to the actual partitions of the block devices.

Use device mapper. Really we should shoot all the partition code in the
kernel but the back compatibility is a bit tricky. We don't actually need
the partition code any more.

Linus Torvalds

unread,

Jul 28, 2009, 12:10:18 PM7/28/09

to

On Tue, 28 Jul 2009, Alan Cox wrote:
>
> Use device mapper. Really we should shoot all the partition code in the
> kernel but the back compatibility is a bit tricky. We don't actually need
> the partition code any more.

Really, we should _not_ "shoot all the partition code in the kernel".
Quite the reverse.

You need the kernel to read the disk anyway, you're _much_ better off
having the kernel know about the partitioning etc. There are absolutely
zero upsides to making the bootup be dependent on yet another user land
tool, and then effectively forcing people to use initrd whether they want
it or not - just in order to find the real root.

The fact that some distributions already go too far, and use DM whether it
makes sense or not is only inconveniencing real users. It makes things
like data portability much harder. I have had real-life cases where I
wanted to move a disk from one machine to another, only to notice that the
crazy default for the distro I had used was to make it impossible, because
all the filesystems crossed disks.

I've since learnt to not use DM (and instead doing a very inconvenient
"partition everything by hand because the install tool doesn't allow for
any simple automated way to make a sane install"), and to just put /home
on one disk and / on the other, and then I can way more easily just move
my /home disk around, for example.

Yes, I realize that MD is convenient for a certain class of users, but a
_lot_ of distro people seem to totally miss all the inconveniences.
Possibly because they care more about "enterprise" customers than about
people who tinker.

Linus

Kyle Moffett

unread,

Jul 28, 2009, 2:40:21 PM7/28/09

to

On Tue, Jul 28, 2009 at 12:00, Linus
Torvalds<torv...@linux-foundation.org> wrote:
> The fact that some distributions already go too far, and use DM whether it
> makes sense or not is only inconveniencing real users. It makes things
> like data portability much harder. I have had real-life cases where I
> wanted to move a disk from one machine to another, only to notice that the
> crazy default for the distro I had used was to make it impossible, because
> all the filesystems crossed disks.
>
> I've since learnt to not use DM (and instead doing a very inconvenient
> "partition everything by hand because the install tool doesn't allow for
> any simple automated way to make a sane install"), and to just put /home
> on one disk and / on the other, and then I can way more easily just move
> my /home disk around, for example.

That's not so much an argument against LVM as it is an argument for
fixing those distro installer tools... Using device-mapper to map
standard Linux partition-tables has the following benefits:

(1) The ability to rearrange, resize, and restructure
partition-tables on the fly. The existing "re-read partition tables"
infrastructure does not safely and reasonably handle changes to the
partition-table while partitions are mounted. Using device-mapper you
can shrink the mapped space associated with a partition then insert
and map a new partition in that gap... all without rebooting.

(2) If you use DM via LVM and you have a bit of unallocated space,
you can create block-level snapshots. This is useful for *much* more
than just a datacenter, it makes home backup tools much easier and
safer too.

(3) Again, using LVM you can shrink one partition (/) and grow
another (/home), even if you didn't guess right in your initial
allocations.

Personally I am also extremely fond of running commands like "mke2fs
-j /dev/mapper/ares-tempdata" instead of "mke2fs -j /dev/sdb4"... err,
shoot, I meant /dev/sda4, there goes my /home partition...

Even when you are moving hard drives from one computer into another,
it makes it much easier keep track of them if you use the server name
in the "volume group" name. When plug both backup drives into my
desktop, they're easily distinguished as /dev/mapper/ares_bkup-home
and /dev/mapper/philyra_bkup-home.

Admittedly there are some pretty crappy tools out there... I've had
problems with a few which could not reliably do partition math. (if
the partitioner tells you that you have a 10240MB disk, and you tell
it to put 5120MB on one partition and 5120MB on the other, it should
not tell you that you over-allocated the disk by 1MB, even if it might
need that for metadata).

Perhaps what we need is a really minimal klibc toolkit built (by
default) as part of the kernel and embedded into the kernel image. If
the bootloader specifies an external initrd then the in-kernel one
would either be ignored and discarded; otherwise it would provide
clean backwards-compatibility for any boot-time features and arguments
that have been removed from the kernel proper.

Cheers,
Kyle Moffett

Linus Torvalds

unread,

Jul 28, 2009, 3:00:15 PM7/28/09

to

On Tue, 28 Jul 2009, Kyle Moffett wrote:

> On Tue, Jul 28, 2009 at 12:00, Linus
> Torvalds<torv...@linux-foundation.org> wrote:
> >
> > I've since learnt to not use DM (and instead doing a very inconvenient
> > "partition everything by hand because the install tool doesn't allow for
> > any simple automated way to make a sane install"), and to just put /home
> > on one disk and / on the other, and then I can way more easily just move
> > my /home disk around, for example.
>
> That's not so much an argument against LVM as it is an argument for
> fixing those distro installer tools...

Oh, I agree. I'd love the distros to not force DM on me.

But that wasn't my point. My point was that people who argue for DM (and
user-space tools for partition detection) always argue without even taking
the disadvantages into account.

> Using device-mapper to map standard Linux partition-tables has the
> following benefits:

You're missing the point.

I _know_ the benefits. I'm pointing out the problems and downsides. Which
too often get ignored, just because people think that the benefits are
so big, and benefits to everybody. They're not.

The whole dynamic resizing etc is totally worthless for many users: the
fact that it is an advantage to _some_ doesn't make it an advantage to
everybody. And some of the advantages you mention (naming by mount-point
or UUID or etc) have nothing to do with DM itself, and work fine without
it.

Linus

Alan Cox

unread,

Jul 28, 2009, 3:10:10 PM7/28/09

to

> That's not so much an argument against LVM as it is an argument for
> fixing those distro installer tools... Using device-mapper to map
> standard Linux partition-tables has the following benefits:

I'll add another one: If your disk blows up when you touch sector 0 you
can rescue it in Linux without hacking the kernel.

It doesn't mean you have to ditch partition processing out of the kernel,
merely to be able to turn it off for some devices. Actually removing it
would never be practical.

Andi Kleen

unread,

Jul 28, 2009, 3:50:11 PM7/28/09

to

Kyle Moffett <ky...@moffetthome.net> writes:
>
> (1) The ability to rearrange, resize, and restructure
> partition-tables on the fly. The existing "re-read partition tables"
> infrastructure does not safely and reasonably handle changes to the
> partition-table while partitions are mounted.

It doesn't today (and I really hate it too), but is there a hard reason it
couldn't be fixed to support that properly?

-Andi

--
a...@linux.intel.com -- Speaking for myself only.

dev...@web.de

unread,

Jul 28, 2009, 4:40:08 PM7/28/09

to

Hello Zach,

this older thread deals with some aspects of that idea: http://communities.vmware.com/message/577841
i have collected some links (added there) quite a while ago and also added a project proposal to http://kernelnewbies.org/KernelProjects, too.
i don`t know if you came across them, but it`s nice to see that someone comes up with this stuff again and maybe it`s of interest for you.

as we had vmware vmdk image mounter v1 being implemented via nbd and v2 via fuse, i assume both are not optimal solutions?
at least the nbd version sucked big.

regards
roland

ps:
oh, btw - you quit vmware? that`s quite a loss for them and for the vmware community, i think. too much conflicting basic attitude concerning opensource/gpl? ;)

> List: linux-kernel
> Subject: [PATCH] Allow userspace block device implementation
> From: Zachary Amsden <zamsden () redhat ! com>
> Date: 2009-07-27 9:57:10
> Message-ID: 4A6D79F6.3050509 () redhat ! com
> [Download message RAW]

>
> Well, it may be a good, bad, idiotic or brilliant idea depending on your
> personal philosophy. I went down this route out of pragmatism.
> Hopefully I have not fully re-invented the wheel.
>
> The patch included allows one to implement a kernel level block device
> in userspace, using an ioctl() based interface to create a sized device
> with given properties, and then receive and respond to bio requests
> issued to the device. One can poll on the associated control socket to
> allow efficient servicing of device requests. So far only strict copy
> to/from user memory is supported, there is no fancy page flipping or
> mapping operations.
>

> _________________________________________________________________
> Neu: WEB.DE Doppel-FLAT mit Internet-Flatrate + Telefon-Flatrate
> fï¿œr nur 19,99 Euro/mtl.!* http://produkte.web.de/go/01/
>
>

______________________________________________________
GRATIS fï¿œr alle WEB.DE-Nutzer: Die maxdome Movie-FLAT!
Jetzt freischalten unter http://movieflat.web.de

Linus Torvalds

unread,

Jul 28, 2009, 5:00:14 PM7/28/09

to

On Tue, 28 Jul 2009, Andi Kleen wrote:

> Kyle Moffett <ky...@moffetthome.net> writes:
> >
> > (1) The ability to rearrange, resize, and restructure
> > partition-tables on the fly. The existing "re-read partition tables"
> > infrastructure does not safely and reasonably handle changes to the
> > partition-table while partitions are mounted.
>
> It doesn't today (and I really hate it too), but is there a hard reason it
> couldn't be fixed to support that properly?

If something has a partition open (and it doesn't really even have to be a
mounted filesystem, altough that's obviously the most relevant case), how
can you reasonably change the partition from underneath it? So I assume
you mean that partitions were opened earlier (for a mount) would not be
touched.

And these days, that _should_ just work. The "reread partition table"
operation should just leave the old bdev's around (so a mounted filesystem
simply won't _see_ the new partitions, but will continue to use the old
one), and for all I know that might even work these days.

[ Here "these days" is admittedly only in comparison to the _original_
Linux code, which used block numbers. Many years ago. ]

Filesystems long ago _used_ to index things by device number and block -
and that meant that re-reading partition tables was _really_ dangerous,
because the "device number" would just magically mean something else for a
mounted filesystem. But we've indexed things by bdev for a longish time
now, and most (all?) filesystems use "sb_bread()" instead of bread etc.

So I think re-reading the partition tables should be safe these days. It
definitely didn't _use_ to be the case due to dev_t issues, but that's
really ancient.

It may be that we just have the old check in place ("don't allow
re-reading if something has mounted a partition"), and we could just get
rid of it. I have not looked.

But if you actually meant that re-reading the partition table should
_change_ a "struct block_dev" that is in use, then I think that would be a
bad idea. At the very least, it should involve a re-mount or something.

Linus

Andi Kleen

unread,

Jul 28, 2009, 5:10:11 PM7/28/09

to

On Tue, Jul 28, 2009 at 01:50:56PM -0700, Linus Torvalds wrote:
>
>
> On Tue, 28 Jul 2009, Andi Kleen wrote:
>
> > Kyle Moffett <ky...@moffetthome.net> writes:
> > >
> > > (1) The ability to rearrange, resize, and restructure
> > > partition-tables on the fly. The existing "re-read partition tables"
> > > infrastructure does not safely and reasonably handle changes to the
> > > partition-table while partitions are mounted.
> >
> > It doesn't today (and I really hate it too), but is there a hard reason it
> > couldn't be fixed to support that properly?
>
> If something has a partition open (and it doesn't really even have to be a
> mounted filesystem, altough that's obviously the most relevant case), how
> can you reasonably change the partition from underneath it?

Well LVM can do that, why not standard partitions?

e.g. extending should be totally fine. The file system can continue
using the old size until you run the online fs extender tool which
does then the right magic to sync the file system state. I believe
that is how it works on LVM.

Shrinking is more difficult, but giving root enough rope ...
And we got offline shrinkers at least.

> So I assume
> you mean that partitions were opened earlier (for a mount) would not be
> touched.

Also right now you can't change any other partition.

I know part of the problem is that I like using fdisk
(simply because I think the person who designed parted's user interface
was on something unholy) and apparently it works better
when you use the right ioctls to add/remove partitions
instead of wholesale reread like fdisk.
Perhaps the reread table ioctl can be just fixed.

> It may be that we just have the old check in place ("don't allow
> re-reading if something has mounted a partition"), and we could just get
> rid of it. I have not looked.

Yes I'm sure there's lot of historical baggage here.

>
> But if you actually meant that re-reading the partition table should
> _change_ a "struct block_dev" that is in use, then I think that would be a
> bad idea. At the very least, it should involve a re-mount or something.

LVM already does it afaik.

-Andi

--
a...@linux.intel.com -- Speaking for myself only.

Theodore Tso

unread,

Jul 28, 2009, 7:00:17 PM7/28/09

to

On Tue, Jul 28, 2009 at 01:50:56PM -0700, Linus Torvalds wrote:
>

> Filesystems long ago _used_ to index things by device number and block -
> and that meant that re-reading partition tables was _really_ dangerous,
> because the "device number" would just magically mean something else for a
> mounted filesystem. But we've indexed things by bdev for a longish time
> now, and most (all?) filesystems use "sb_bread()" instead of bread etc.

Filesystems don't, but some userspace programs do depend on the dev_t
returned by stat to uniquely identify a mounted filesystem. (And it's
guaranteed by POSIX). So what this means is that if we're going to
allow re-reading the partition table, we should (a) avoid changing the
dev_t used by any mounted filesystem, and (b) we should either assign
a new dev_t for any new partitions, or we should disallow mounting a
filesystem with a new dev_t already in use by an already mounted
filesystem with the same dev_t before the partition table was
reorganized.

- Ted

Pavel Machek

unread,

Aug 8, 2009, 11:30:13 AM8/8/09

to

Hi!

> Well, it may be a good, bad, idiotic or brilliant idea depending on your
> personal philosophy. I went down this route out of pragmatism.
> Hopefully I have not fully re-invented the wheel.

I did, long ago. I called it nbd... aha and you know about it (from
following mails in thread).

> accidental deadlock. There may of course be some hidden deep deadlock
> potential in such a device, especially if one decided to use it as a
> swap device, but again, this is a philosophical issue.

What's philosophical about 'it does not work for swap or dirty mmap'?

(last time I checked, dirty mmap data behaved very much like swap).

(And yes, nbd has same problem. It should be safe for r/o access to
localhost, but may deadlock when it is mounted locally...)

And yes, I believe that's show stopper. OTOH if you _can_ solve
that... then you have some rather significant advantage over nbd.

(But guaranteeing progress for dirty writeout will be tricky even with
mlocked userland, AFAICT...)

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

Zachary Amsden

unread,

Aug 10, 2009, 7:00:15 PM8/10/09

to

Pavel Machek wrote:
> Hi!

Hey!

> And yes, I believe that's show stopper. OTOH if you _can_ solve
> that... then you have some rather significant advantage over nbd.
>
> (But guaranteeing progress for dirty writeout will be tricky even with
> mlocked userland, AFAICT...)

Actually, impossible, even with mlocked userland (*) which is what led
me to abandon going any further with it. The problem is, to commit any
data, one must make a system call, thus consuming more resources. It's
merely a toy, nothing more. Sometimes it might be a useful toy, as nbd,
but nbd, being in kernel, has at least a better chance of solving the
swap problem.

(*) strictly speaking, it is possible to guarantee progress of the
device for read/write only to a finite region of mlocked memory and an
infinite region (limited only by size of off_t) of read-only data
computable with finite mlocked space. Obviously, neither of these
(swap-to-ram), and (swap-over-ro-media) are actually useful for swap.

Zach