[ANNOUNCE] Btrfs: a copy on write, snapshotting FS

Chris Mason

unread,

Jun 12, 2007, 12:14:03 PM6/12/07

to linux-...@vger.kernel.org, linux-...@vger.kernel.org

Hello everyone,

After the last FS summit, I started working on a new filesystem that
maintains checksums of all file data and metadata. Many thanks to Zach
Brown for his ideas, and to Dave Chinner for his help on
benchmarking analysis.

The basic list of features looks like this:

* Extent based file storage (2^64 max file size)
* Space efficient packing of small files
* Space efficient indexed directories
* Dynamic inode allocation
* Writable snapshots
* Subvolumes (separate internal filesystem roots)
- Object level mirroring and striping
* Checksums on data and metadata (multiple algorithms available)
- Strong integration with device mapper for multiple device support
- Online filesystem check
* Very fast offline filesystem check
- Efficient incremental backup and FS mirroring

The ones with marked with * are mostly working, and the others are on
my todo list. There are more details on the FS design, some benchmarks
and download links here:

http://oss.oracle.com/~mason/btrfs/

The current status is a very early alpha state, and the kernel code
weighs in at a sparsely commented 10,547 lines. I'm releasing now in
hopes of finding people interested in testing, benchmarking,
documenting, and contributing to the code.

I've gotten this far pretty quickly, and plan on continuing to knock off
the features as fast as I can. Hopefully I'll manage a release every
few weeks or so. The disk format will probably change in some major way
every couple of releases.

The TODO list has some critical stuff:

* Ability to return -ENOSPC instead of oopsing
* mmap()ed writes
* Fault tolerance, (EIO, bad metadata etc)
* Concurrency. I use one mutex for all operations today
* ACLs and extended attributes
* Reclaim dead roots after a crash
* Various other bits from the feature list above

And finally, here's a quick and dirty summary of the FS design points:

* One large Btree per subvolume
* Copy on write logging for all data and metadata
* Reference count snapshots are the basis of the transaction
system. A transaction is just a snapshot where the old root
is immediately deleted on commit
* Subvolumes can be snapshotted any number of times
* Snapshots are read/write and can be snapshotted again
* Directories are doubly indexed to improve readdir speeds

So, please give it a try or a look and let me know what you think.

-chris

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Mike Snitzer

unread,

Jun 12, 2007, 3:53:32 PM6/12/07

to Chris Mason, linux-...@vger.kernel.org, linux-...@vger.kernel.org

Chris,

Given the substantial work that you've already put into btrfs and the
direction you're Todo list details; it feels as though Btrfs will
quickly provide the features that only Sun's ZFS provides.

Looking at your Btrfs benchmark and design pages it is clear that
you're motivation is a filesystem that addresses modern concerns
(performance that doesn't degrade over time, online fsck, fast offline
fsck, data/metadata checksums, unlimited snapshots, efficient remote
mirroring, etc). There is still much "Todo" but you've made very
impressive progress for the first announcement!

I have some management oriented questions/comments.

1)
Regarding the direction of Btrfs as it relates to integration with DM.
The allocation policies, the ease of configuring DM-based
striping/mirroring, management of large pools of storage all seems to
indicate that Btrfs will manage the physical spindles internally.
This is very ZFS-ish (ZFS pools) so I'd like to understand where you
see Btrfs going in this area.

Your initial benchmarks were all done ontop of a single disk with an
LVM stack yet your roadmap/todo and design speaks to a tighter
integration of the volume management features. So long term is
traditional LVM/MD functionality to be pulled directly into Btrfs?

2)
The Btrfs notion of subvolumes and snapshots is very elegant and
provides for a fluid management of the filesystem system data. It
feels as though each subvolume/snapshot is just folded into the parent
Btrfs volumes' namespace. Was there any particular reason you elected
to do this? I can see that it lends itself to allowing snapshots of
snapshots. If you could elaborate I'd appreciate it.

In practice subvolumes and/or snapshots appear to be implicitly
mounted upon creation (refcount of parent is incremented). Is this
correct? For snapshots, this runs counter to mapping the snapshots'
data into the namespace of the origin Btrfs (e.g. with a .snapshot
dir, but this is only useful for read-only snaps). Having snapshot
namespaces in terms of monolithic subvolumes puts a less intuitive
face on N Btrfs snapshots. The history of a given file/dir feels to
be lost with this model.

Aside from folding snapshot history into the origin's namespace... It
could be possible to have a mount.btrfs that allows subvolumes and/or
snapshot volumes to be mounted as unique roots? I'd imagine a bind
mount _could_ provide this too? Anyway, I'm just interested in
understanding the vision for managing the potentially complex nature
of a Btrfs namespace.

Thanks for doing all this work; I think the Linux community got a much
needed shot in the arm with this Btrfs announcement.

regards,
Mike

Chris Mason

unread,

Jun 12, 2007, 4:17:36 PM6/12/07

to Mike Snitzer, linux-...@vger.kernel.org, linux-...@vger.kernel.org

On Tue, Jun 12, 2007 at 03:53:03PM -0400, Mike Snitzer wrote:
> On 6/12/07, Chris Mason <chris...@oracle.com> wrote:
> >Hello everyone,
> >
> >After the last FS summit, I started working on a new filesystem that
> >maintains checksums of all file data and metadata. Many thanks to Zach
> >Brown for his ideas, and to Dave Chinner for his help on
> >benchmarking analysis.
>

> Chris,
>
> Given the substantial work that you've already put into btrfs and the
> direction you're Todo list details; it feels as though Btrfs will
> quickly provide the features that only Sun's ZFS provides.
>
> Looking at your Btrfs benchmark and design pages it is clear that
> you're motivation is a filesystem that addresses modern concerns
> (performance that doesn't degrade over time, online fsck, fast offline
> fsck, data/metadata checksums, unlimited snapshots, efficient remote
> mirroring, etc). There is still much "Todo" but you've made very
> impressive progress for the first announcement!
>
> I have some management oriented questions/comments.
>
> 1)
> Regarding the direction of Btrfs as it relates to integration with DM.
> The allocation policies, the ease of configuring DM-based
> striping/mirroring, management of large pools of storage all seems to
> indicate that Btrfs will manage the physical spindles internally.
> This is very ZFS-ish (ZFS pools) so I'd like to understand where you
> see Btrfs going in this area.

There's quite a lot of hand waving in that section. What I'd like to do
is work closely with the LVM/DM/MD maintainers and come up with
something that leverages what linux already does. I don't want to
rewrite LVM into the FS, but I do want to make better use of info about
the underlying storage.

>
> Your initial benchmarks were all done ontop of a single disk with an
> LVM stack yet your roadmap/todo and design speaks to a tighter
> integration of the volume management features. So long term is
> traditional LVM/MD functionality to be pulled directly into Btrfs?
>
> 2)
> The Btrfs notion of subvolumes and snapshots is very elegant and
> provides for a fluid management of the filesystem system data. It
> feels as though each subvolume/snapshot is just folded into the parent
> Btrfs volumes' namespace. Was there any particular reason you elected
> to do this? I can see that it lends itself to allowing snapshots of
> snapshots. If you could elaborate I'd appreciate it.
>

Yes, I wanted snapshots to be writable and resnapshottable. It also
lowers the complexity to keep each snapshot as a subvolume/tree.

subvolumes are only slightly more expensive than a directory. So, even
though a subvolume is a large grained unit for a snapshot, you can get
around this by just making more subvolumes.

> In practice subvolumes and/or snapshots appear to be implicitly
> mounted upon creation (refcount of parent is incremented). Is this
> correct? For snapshots, this runs counter to mapping the snapshots'
> data into the namespace of the origin Btrfs (e.g. with a .snapshot
> dir, but this is only useful for read-only snaps). Having snapshot
> namespaces in terms of monolithic subvolumes puts a less intuitive
> face on N Btrfs snapshots. The history of a given file/dir feels to
> be lost with this model.

That's somewhat true, the disk format does have enough information to
show you that history, but cleanly expressing it to the user is a
daunting task.

>
> Aside from folding snapshot history into the origin's namespace... It
> could be possible to have a mount.btrfs that allows subvolumes and/or
> snapshot volumes to be mounted as unique roots? I'd imagine a bind
> mount _could_ provide this too? Anyway, I'm just interested in
> understanding the vision for managing the potentially complex nature
> of a Btrfs namespace.

One option is to put the real btrfs root into some directory in
(/sys/fs/btrfs/$device?) and then use tools in userland to mount -o bind
outside of that. I wanted to wait to get fancy until I had a better
idea of how people would use the feature.

>
> Thanks for doing all this work; I think the Linux community got a much
> needed shot in the arm with this Btrfs announcement.
>

Thanks for the comments.

-chris

Christoph Hellwig

unread,

Jun 12, 2007, 11:08:49 PM6/12/07

to Chris Mason, Mike Snitzer, linux-...@vger.kernel.org, linux-...@vger.kernel.org

On Tue, Jun 12, 2007 at 04:14:39PM -0400, Chris Mason wrote:
> > Aside from folding snapshot history into the origin's namespace... It
> > could be possible to have a mount.btrfs that allows subvolumes and/or
> > snapshot volumes to be mounted as unique roots? I'd imagine a bind
> > mount _could_ provide this too? Anyway, I'm just interested in
> > understanding the vision for managing the potentially complex nature
> > of a Btrfs namespace.
>
> One option is to put the real btrfs root into some directory in
> (/sys/fs/btrfs/$device?) and then use tools in userland to mount -o bind
> outside of that. I wanted to wait to get fancy until I had a better
> idea of how people would use the feature.

We already support mounting into subdirectories of a filesystem for
nfs connection sharing. The patch below makes use of this to allow
mounting any subdirectory of a btrfs filesystem by specifying it in
the form of /dev/somedevice:directory and when no subdirectory
is specified uses 'default'. To make this more useful btrfs directories
should grow some way to be marked head of a subvolume, and we'd need
a more useful way to actually create subvolumes and snapshots without
fugly ioctls.

Btw, do create a subvolume with my patch you need to escape back to
the global root which is done by specifing /dev/somedevice:. as the
device on the mount command line.

Index: btrfs-0.2/super.c
===================================================================
--- btrfs-0.2.orig/super.c 2007-06-13 03:44:38.000000000 +0200
+++ btrfs-0.2/super.c 2007-06-13 03:48:35.000000000 +0200
@@ -17,6 +17,7 @@
*/

#include <linux/module.h>
+#include <linux/blkdev.h>
#include <linux/buffer_head.h>
#include <linux/fs.h>
#include <linux/pagemap.h>
@@ -26,6 +27,7 @@
#include <linux/string.h>
#include <linux/smp_lock.h>
#include <linux/backing-dev.h>
+#include <linux/mount.h>
#include <linux/mpage.h>
#include <linux/swap.h>
#include <linux/writeback.h>
@@ -135,11 +137,114 @@ static void btrfs_write_super(struct sup
sb->s_dirt = 0;
}

+/*
+ * This is almost a copy of get_sb_bdev in fs/super.c.
+ * We need the local copy to allow direct mounting of
+ * subvolumes, but this could be easily integrated back
+ * into the generic version. --hch
+ */
+
+/* start copy & paste */
+static int set_bdev_super(struct super_block *s, void *data)
+{
+ s->s_bdev = data;
+ s->s_dev = s->s_bdev->bd_dev;
+ return 0;
+}
+
+static int test_bdev_super(struct super_block *s, void *data)
+{
+ return (void *)s->s_bdev == data;
+}
+
+int btrfs_get_sb_bdev(struct file_system_type *fs_type,
+ int flags, const char *dev_name, void *data,
+ int (*fill_super)(struct super_block *, void *, int),
+ struct vfsmount *mnt, const char *subvol)
+{
+ struct block_device *bdev = NULL;
+ struct super_block *s;
+ struct dentry *root;
+ int error = 0;
+
+ bdev = open_bdev_excl(dev_name, flags, fs_type);
+ if (IS_ERR(bdev))
+ return PTR_ERR(bdev);
+
+ /*
+ * once the super is inserted into the list by sget, s_umount
+ * will protect the lockfs code from trying to start a snapshot
+ * while we are mounting
+ */
+ down(&bdev->bd_mount_sem);
+ s = sget(fs_type, test_bdev_super, set_bdev_super, bdev);
+ up(&bdev->bd_mount_sem);
+ if (IS_ERR(s))
+ goto error_s;
+
+ if (s->s_root) {
+ if ((flags ^ s->s_flags) & MS_RDONLY) {
+ up_write(&s->s_umount);
+ deactivate_super(s);
+ error = -EBUSY;
+ goto error_bdev;
+ }
+
+ close_bdev_excl(bdev);
+ bdev = NULL;
+ } else {
+ char b[BDEVNAME_SIZE];
+
+ s->s_flags = flags;
+ strlcpy(s->s_id, bdevname(bdev, b), sizeof(s->s_id));
+ sb_set_blocksize(s, block_size(bdev));
+ error = fill_super(s, data, flags & MS_SILENT ? 1 : 0);
+ if (error) {
+ up_write(&s->s_umount);
+ deactivate_super(s);
+ goto error;
+ }
+
+ s->s_flags |= MS_ACTIVE;
+ }
+
+ if (subvol) {
+ root = lookup_one_len(subvol, s->s_root, strlen(subvol));
+ if (!root) {
+ error = -ENXIO;
+ goto error_bdev;
+ }
+ } else {
+ root = dget(s->s_root);
+ }
+
+ mnt->mnt_sb = s;
+ mnt->mnt_root = root;
+ return 0;
+
+error_s:
+ error = PTR_ERR(s);
+error_bdev:
+ if (bdev)
+ close_bdev_excl(bdev);
+error:
+ return error;
+}
+/* end copy & paste */
+
static int btrfs_get_sb(struct file_system_type *fs_type,
- int flags, const char *dev_name, void *data, struct vfsmount *mnt)
+ int flags, const char *identifier, void *data, struct vfsmount *mnt)
{
- return get_sb_bdev(fs_type, flags, dev_name, data,
- btrfs_fill_super, mnt);
+ char *_identifier = kstrdup(identifier, GFP_KERNEL);
+ const char *dev_name;
+
+ dev_name = strsep(&_identifier, ":");
+ if (!dev_name)
+ return -ENOMEM;
+
+ return btrfs_get_sb_bdev(fs_type, flags, dev_name, data,
+ btrfs_fill_super, mnt,
+ _identifier ? _identifier : "default");
}

static int btrfs_statfs(struct dentry *dentry, struct kstatfs *buf)

John Stoffel

unread,

Jun 12, 2007, 11:46:48 PM6/12/07

to Chris Mason, linux-...@vger.kernel.org, linux-...@vger.kernel.org

>>>>> "Chris" == Chris Mason <chris...@oracle.com> writes:

Chris> After the last FS summit, I started working on a new filesystem
Chris> that maintains checksums of all file data and metadata. Many
Chris> thanks to Zach Brown for his ideas, and to Dave Chinner for his
Chris> help on benchmarking analysis.

Chris> The basic list of features looks like this:

Chris> * Extent based file storage (2^64 max file size)
Chris> * Space efficient packing of small files
Chris> * Space efficient indexed directories
Chris> * Dynamic inode allocation
Chris> * Writable snapshots
Chris> * Subvolumes (separate internal filesystem roots)
Chris> - Object level mirroring and striping
Chris> * Checksums on data and metadata (multiple algorithms available)
Chris> - Strong integration with device mapper for multiple device support
Chris> - Online filesystem check
Chris> * Very fast offline filesystem check
Chris> - Efficient incremental backup and FS mirroring

So, can you resize a filesystem both bigger and smaller? Or is that
implicit in the Object level mirroring and striping?

As a user of Netapps, having quotas (if only for reporting purposes)
and some way to migrate non-used files to slower/cheaper storage would
be great.

Ie. being able to setup two pools, one being RAID6, the other being
RAID1, where all currently accessed files are in the RAID1 setup, but
if un-used get migrated to the RAID6 area.

And of course some way for efficient backups and more importantly
RESTORES of data which is segregated like this.

John

Albert Cahalan

unread,

Jun 13, 2007, 1:45:49 AM6/13/07

to linux-...@vger.kernel.org, linux-...@vger.kernel.org, chris...@oracle.com, h...@infradead.org, sni...@gmail.com

Neat! It's great to see somebody else waking up to the idea that
storage media is NOT to be trusted.

Judging by the design paper, it looks like your structs have some
alignment problems.

The usual wishlist:

* inode-to-pathnames mapping
* a subvolume that is a single file (disk image, database, etc.)
* directory indexes to better support Wine and Samba
* secure delete via destruction of per-file or per-block random crypto keys
* fast (seekless) access to normal-sized SE Linux data
* atomic creation of copy-on-write directory trees
* immutable bits like UFS has
* hole punch ability
* insert/delete ability (add/remove a chunk in the middle of a file)

Chris Mason

unread,

Jun 13, 2007, 6:21:04 AM6/13/07

to Christoph Hellwig, Mike Snitzer, linux-...@vger.kernel.org, linux-...@vger.kernel.org

On Wed, Jun 13, 2007 at 04:08:30AM +0100, Christoph Hellwig wrote:
> On Tue, Jun 12, 2007 at 04:14:39PM -0400, Chris Mason wrote:
> > > Aside from folding snapshot history into the origin's namespace... It
> > > could be possible to have a mount.btrfs that allows subvolumes and/or
> > > snapshot volumes to be mounted as unique roots? I'd imagine a bind
> > > mount _could_ provide this too? Anyway, I'm just interested in
> > > understanding the vision for managing the potentially complex nature
> > > of a Btrfs namespace.
> >
> > One option is to put the real btrfs root into some directory in
> > (/sys/fs/btrfs/$device?) and then use tools in userland to mount -o bind
> > outside of that. I wanted to wait to get fancy until I had a better
> > idea of how people would use the feature.
>
> We already support mounting into subdirectories of a filesystem for
> nfs connection sharing. The patch below makes use of this to allow
> mounting any subdirectory of a btrfs filesystem by specifying it in
> the form of /dev/somedevice:directory and when no subdirectory
> is specified uses 'default'.

Neat, thanks Christoph, this will be much nicer longer term. I'll
integrate it after I finish off -enospc.

> To make this more useful btrfs directories
> should grow some way to be marked head of a subvolume,

They are already different in the btree, but maybe I'm not 100% sure
what you mean by marked as the head of a subvolume?

> and we'd need
> a more useful way to actually create subvolumes and snapshots without
> fugly ioctls.

One way I can think of that doesn't involve an ioctl is to have a
special subdir at the root of the subvolume:

cd
/mnt/default/.snaps
mkdir new_snapshot
rmdir old_snapshot

cd /mnt
mkdir new_subvol
rmdir old_subvol

-chris

Chris Mason

unread,

Jun 13, 2007, 6:39:22 AM6/13/07

to John Stoffel, linux-...@vger.kernel.org, linux-...@vger.kernel.org

On Tue, Jun 12, 2007 at 11:46:20PM -0400, John Stoffel wrote:
> >>>>> "Chris" == Chris Mason <chris...@oracle.com> writes:
>
> Chris> After the last FS summit, I started working on a new filesystem
> Chris> that maintains checksums of all file data and metadata. Many
> Chris> thanks to Zach Brown for his ideas, and to Dave Chinner for his
> Chris> help on benchmarking analysis.
>
> Chris> The basic list of features looks like this:
>
> Chris> * Extent based file storage (2^64 max file size)
> Chris> * Space efficient packing of small files
> Chris> * Space efficient indexed directories
> Chris> * Dynamic inode allocation
> Chris> * Writable snapshots
> Chris> * Subvolumes (separate internal filesystem roots)
> Chris> - Object level mirroring and striping
> Chris> * Checksums on data and metadata (multiple algorithms available)
> Chris> - Strong integration with device mapper for multiple device support
> Chris> - Online filesystem check
> Chris> * Very fast offline filesystem check
> Chris> - Efficient incremental backup and FS mirroring
>
> So, can you resize a filesystem both bigger and smaller? Or is that
> implicit in the Object level mirroring and striping?

Growing the FS is just either extending or adding a new extent tree.
Shrinking is more complex. The extent trees do have back pointers to
the objectids that own the extent, but snapshotting makes that a little
non-deterministic. The good news is there are no fixed locations for
any of the metadata. So it is at least possible to shrink and pop out
arbitrary chunks.

>
> As a user of Netapps, having quotas (if only for reporting purposes)
> and some way to migrate non-used files to slower/cheaper storage would
> be great.

So far, I'm not planning quotas beyond the subvolume level.

>
> Ie. being able to setup two pools, one being RAID6, the other being
> RAID1, where all currently accessed files are in the RAID1 setup, but
> if un-used get migrated to the RAID6 area.

HSM in general is definitely interesting. I'm afraid it is a long ways
off, but it could be integrated into the scrubber that wanders the trees
in the background.

-chris

Chris Mason

unread,

Jun 13, 2007, 8:04:26 AM6/13/07

to Albert Cahalan, linux-...@vger.kernel.org, linux-...@vger.kernel.org, h...@infradead.org, sni...@gmail.com

On Wed, Jun 13, 2007 at 01:45:28AM -0400, Albert Cahalan wrote:
> Neat! It's great to see somebody else waking up to the idea that
> storage media is NOT to be trusted.
>
> Judging by the design paper, it looks like your structs have some
> alignment problems.

Actual defs are all packed, but I may still shuffle around the structs
to optimize alignment. The keys are fixed, although I may make the u32
in the middle smaller.

>
> The usual wishlist:
>
> * inode-to-pathnames mapping

This one I'll code, it will help with inode link count verification. I
want to be able to detect at run time that an inode with a link count of
zero is still actually in a directory. So there will be back pointers
from the inode to the directory.

Also, the incremental backup code will be able to walk the btree to find
inodes that have changed, and the backpointers will help make a list of
file names that need to be rsync'd or whatever.

> * a subvolume that is a single file (disk image, database, etc.)

subvolumes can be made that have a single file in them, but they have to
be directories right now. Doing otherwise would complicate mounts and
other management tools (inside the btree, it doesn't really matter).

> * directory indexes to better support Wine and Samba
> * secure delete via destruction of per-file or per-block random crypto keys

I'd rather keep secure delete as a userland problem (or a layered FS
problem). When you take backups and other copies of the file into
account, it's a bigger problem than btrfs wants to tackle right now.

> * fast (seekless) access to normal-sized SE Linux data

acls and xattrs will adjacent to the inode in the tree. Most of the
time it'll be seekless.

> * atomic creation of copy-on-write directory trees

Do you mean something more fine grained than the current snapshotting
system?

> * immutable bits like UFS has

I'll do the ext2 chattr calls.

> * hole punch ability

Hole punching isn't harder or easier in btrfs than most other
filesystems that support holes. It's largely a VM issue.

> * insert/delete ability (add/remove a chunk in the middle of a file)

The disk format makes this O(extent records past the chunk). It's
possible to code but it would not be optimized.

-chris

John Stoffel

unread,

Jun 13, 2007, 10:02:45 AM6/13/07

to Chris Mason, John Stoffel, linux-...@vger.kernel.org, linux-...@vger.kernel.org

>>>>> "Chris" == Chris Mason <chris...@oracle.com> writes:

>> So, can you resize a filesystem both bigger and smaller? Or is that
>> implicit in the Object level mirroring and striping?

Chris> Growing the FS is just either extending or adding a new extent
Chris> tree. Shrinking is more complex. The extent trees do have
Chris> back pointers to the objectids that own the extent, but
Chris> snapshotting makes that a little non-deterministic. The good
Chris> news is there are no fixed locations for any of the metadata.
Chris> So it is at least possible to shrink and pop out arbitrary
Chris> chunks.

That's good to know. Being able to grow (online of course!) is great,
but so would shrinking as well. It makes life so much more flexible
for the SysAdmins, which is my particular focus... since it's my day
job.

>> As a user of Netapps, having quotas (if only for reporting purposes)
>> and some way to migrate non-used files to slower/cheaper storage would
>> be great.

Chris> So far, I'm not planning quotas beyond the subvolume level.

So let me get this straight. Are you saying that quotas would only be
on the volume level, and for the initial level of sub-volumes below
that level? Or would *all* sub-volumes have quota support? And does
that include snapshots as well?

>> Ie. being able to setup two pools, one being RAID6, the other being
>> RAID1, where all currently accessed files are in the RAID1 setup, but
>> if un-used get migrated to the RAID6 area.

Chris> HSM in general is definitely interesting. I'm afraid it is a
Chris> long ways off, but it could be integrated into the scrubber
Chris> that wanders the trees in the background.

Neat. As long as the idea is kept around a bit, that would be nice
for an eventual addition. Or maybe someone needs to come up with a
stackable filesystems to take care of this...

John

Chris Mason

unread,

Jun 13, 2007, 10:58:58 AM6/13/07

to John Stoffel, linux-...@vger.kernel.org, linux-...@vger.kernel.org

On Wed, Jun 13, 2007 at 10:00:56AM -0400, John Stoffel wrote:
> >>>>> "Chris" == Chris Mason <chris...@oracle.com> writes:
> >> As a user of Netapps, having quotas (if only for reporting purposes)
> >> and some way to migrate non-used files to slower/cheaper storage would
> >> be great.
>
> Chris> So far, I'm not planning quotas beyond the subvolume level.
>
> So let me get this straight. Are you saying that quotas would only be
> on the volume level, and for the initial level of sub-volumes below
> that level? Or would *all* sub-volumes have quota support? And does
> that include snapshots as well?

On disk, snapshots and subvolumes are identical...the only difference is
their starting state (sorry, it's confusing, and it doesn't help that I
interchange the terms when describing features).

Every subvolume will have a quota on the number of blocks it can
consume. I haven't yet decided on the best way to account for blocks
that are actually shared between snapshots, but it'll be in there
somehow. So if you wanted to make a snapshot readonly, you just set the
quota to 1 block.

But, I'm not planning on adding a way to say user X in subvolume Y has
quota Z. I'll just be: this subvolume can't get bigger than a given
size. (at least for version 1.0).

-chris

John Stoffel

unread,

Jun 13, 2007, 12:12:48 PM6/13/07

to Chris Mason, John Stoffel, linux-...@vger.kernel.org, linux-...@vger.kernel.org

>>>>> "Chris" == Chris Mason <chris...@oracle.com> writes:

Chris> On Wed, Jun 13, 2007 at 10:00:56AM -0400, John Stoffel wrote:
>> >>>>> "Chris" == Chris Mason <chris...@oracle.com> writes:
>> >> As a user of Netapps, having quotas (if only for reporting purposes)
>> >> and some way to migrate non-used files to slower/cheaper storage would
>> >> be great.
>>
Chris> So far, I'm not planning quotas beyond the subvolume level.
>>
>> So let me get this straight. Are you saying that quotas would only be
>> on the volume level, and for the initial level of sub-volumes below
>> that level? Or would *all* sub-volumes have quota support? And does
>> that include snapshots as well?

Chris> On disk, snapshots and subvolumes are identical...the only
Chris> difference is their starting state (sorry, it's confusing, and
Chris> it doesn't help that I interchange the terms when describing
Chris> features).

Ok, that's fine. A sub-volume is the unit and depending on it's
state, it's either a snapshot of an existing volume, or it's a volume
on it's own, though it still has a parent (?) which it is mounted
below? Do I have it right now?

Chris> Every subvolume will have a quota on the number of blocks it
Chris> can consume. I haven't yet decided on the best way to account
Chris> for blocks that are actually shared between snapshots, but
Chris> it'll be in there somehow. So if you wanted to make a snapshot
Chris> readonly, you just set the quota to 1 block.

Ok, so you really aren't talking about Quotas here, but space
reservations instead.

Also, I think you're wrong here when you state that making a snapshot
(sub-volume?) RO just requires you to set the quota to 1 block. What
is to stop me from writing 1 block to a random file that already
exists?

Chris> But, I'm not planning on adding a way to say user X in
Chris> subvolume Y has quota Z. I'll just be: this subvolume can't
Chris> get bigger than a given size. (at least for version 1.0).

Ok, so version 1.0 isn't as interesting to me in a production
environment, since we pretty much need quotas (or a quick way to
monitor how much space a user has been allocated on a volume.

But for a home system, it's certainly looking interesting as well,
since I could give each home directory it's own sub-volume and just
grow/shrink them as needed.

Maybe. :]

Thanks for your work on this.

John

Albert Cahalan

unread,

Jun 13, 2007, 12:15:00 PM6/13/07

to Chris Mason, linux-...@vger.kernel.org, linux-...@vger.kernel.org, h...@infradead.org, sni...@gmail.com

On 6/13/07, Chris Mason <chris...@oracle.com> wrote:
> On Wed, Jun 13, 2007 at 01:45:28AM -0400, Albert Cahalan wrote:

> > The usual wishlist:
> >
> > * inode-to-pathnames mapping
>
> This one I'll code, it will help with inode link count verification. I
> want to be able to detect at run time that an inode with a link count of
> zero is still actually in a directory. So there will be back pointers
> from the inode to the directory.

Great, but fsck improvement wasn't on my mind. This is
a desirable feature for the NFS server, and for regular users.
Think about a backup program trying to maintain hard links.

> Also, the incremental backup code will be able to walk the btree to find
> inodes that have changed, and the backpointers will help make a list of
> file names that need to be rsync'd or whatever.
>
> > * a subvolume that is a single file (disk image, database, etc.)
>
> subvolumes can be made that have a single file in them, but they have to
> be directories right now. Doing otherwise would complicate mounts and
> other management tools (inside the btree, it doesn't really matter).

Bummer. As I understand it, ZFS provides this. :-)

> > * directory indexes to better support Wine and Samba
> > * secure delete via destruction of per-file or per-block random crypto keys
>
> I'd rather keep secure delete as a userland problem (or a layered FS
> problem). When you take backups and other copies of the file into
> account, it's a bigger problem than btrfs wants to tackle right now.

It can't be a userland problem if you allow disk blocks to move.
Volume resizing, logging/journalling, etc. -- they combine to make
the userland solution essentially impossible. (one could wipe the
whole partition, or maybe fill ALL space on the volume)

I think it needs to be per-extent.

At each level in the btree, you place a randomly generated key
for the more leafward nodes. This means that secure deletion is
merely the act of wiping the key... which can itself occur by
wiping the key of the more rootward node.

> > * atomic creation of copy-on-write directory trees
>
> Do you mean something more fine grained than the current snapshotting
> system?

I believe so. Example: I have a linux-2.6 directory. It's not
a mount point or anything special like that. I want to copy
it to a new directory called wip, without actually copying
all the blocks. To all the normal POSIX API stuff, this copy
should look like the result of "cp -a", not hard links.

> > * insert/delete ability (add/remove a chunk in the middle of a file)
>
> The disk format makes this O(extent records past the chunk). It's
> possible to code but it would not be optimized.

That's understandable, but note that Reiserfs can support this.

Grzegorz Kulewski

unread,

Jun 13, 2007, 12:24:51 PM6/13/07

to Chris Mason, John Stoffel, linux-...@vger.kernel.org, linux-...@vger.kernel.org

On Wed, 13 Jun 2007, Chris Mason wrote:
> But, I'm not planning on adding a way to say user X in subvolume Y has
> quota Z. I'll just be: this subvolume can't get bigger than a given
> size. (at least for version 1.0).

I am affraid that this one is a major stopper for any production usage.
Think about OpenVZ (or similar) VPSes. Of course having each VPS in own
subvolume on the same device and being able to limit each subvolume is
more than cool but on the other hand admin in VPS really needs to be able
to set normal quotas for his users.

Other than that your project looks really good and interesting.

I also wonder if it is (would be) possible to set per-tree quotas like
this:

/a - 20GB
/a/b - 10GB
/a/b/c - 2GB
/a/d - 5GB
/e - 30GB

meaning that whole subtree under /a is limited to 20GB, whole tree under
/a/b is limited to both 20GB of /a and also by 10GB of /a/b, tree under
/a/b/c is limited by 20GB of /a, 10GB of /a/b and 2GB of /a/b/c and
so on? Or only /a and /e could be limited?

Thanks,

Grzegorz Kulewski

Chris Mason

unread,

Jun 13, 2007, 12:38:34 PM6/13/07

to John Stoffel, linux-...@vger.kernel.org, linux-...@vger.kernel.org

On Wed, Jun 13, 2007 at 12:12:23PM -0400, John Stoffel wrote:
> >>>>> "Chris" == Chris Mason <chris...@oracle.com> writes:

[ nod ]

> Also, I think you're wrong here when you state that making a snapshot
> (sub-volume?) RO just requires you to set the quota to 1 block. What
> is to stop me from writing 1 block to a random file that already
> exists?

It's copy on write, so changing one block means allocating a new one and
putting the new contents there. The old blocks don't become available
for reuse until the transaction commits.

-chris

Chris Mason

unread,

Jun 13, 2007, 1:00:59 PM6/13/07

to Albert Cahalan, linux-...@vger.kernel.org, linux-...@vger.kernel.org, h...@infradead.org, sni...@gmail.com

On Wed, Jun 13, 2007 at 12:14:40PM -0400, Albert Cahalan wrote:
> On 6/13/07, Chris Mason <chris...@oracle.com> wrote:
> >On Wed, Jun 13, 2007 at 01:45:28AM -0400, Albert Cahalan wrote:
>
> >> The usual wishlist:
> >>
> >> * inode-to-pathnames mapping
> >
> >This one I'll code, it will help with inode link count verification. I
> >want to be able to detect at run time that an inode with a link count of
> >zero is still actually in a directory. So there will be back pointers
> >from the inode to the directory.
>
> Great, but fsck improvement wasn't on my mind. This is
> a desirable feature for the NFS server, and for regular users.
> Think about a backup program trying to maintain hard links.

Sure, it'll be there either way ;)

>
> >Also, the incremental backup code will be able to walk the btree to find
> >inodes that have changed, and the backpointers will help make a list of
> >file names that need to be rsync'd or whatever.
> >
> >> * a subvolume that is a single file (disk image, database, etc.)
> >
> >subvolumes can be made that have a single file in them, but they have to
> >be directories right now. Doing otherwise would complicate mounts and
> >other management tools (inside the btree, it doesn't really matter).
>
> Bummer. As I understand it, ZFS provides this. :-)

Grin, when the pain of typing cd subvol is btrfs' biggest worry, I'll be
doing very well.

>
> >> * directory indexes to better support Wine and Samba
> >> * secure delete via destruction of per-file or per-block random crypto
> >keys
> >
> >I'd rather keep secure delete as a userland problem (or a layered FS
> >problem). When you take backups and other copies of the file into
> >account, it's a bigger problem than btrfs wants to tackle right now.
>
> It can't be a userland problem if you allow disk blocks to move.
> Volume resizing, logging/journalling, etc. -- they combine to make
> the userland solution essentially impossible. (one could wipe the
> whole partition, or maybe fill ALL space on the volume)

Right about here is where I would insert a long story about ecryptfs, or
encryption solutions that happen all in userland. At any rate, it is
outside the scope of v1.0, even though I definitely agree it is an
important problem for some people.

> >> * atomic creation of copy-on-write directory trees
> >
> >Do you mean something more fine grained than the current snapshotting
> >system?
>
> I believe so. Example: I have a linux-2.6 directory. It's not
> a mount point or anything special like that. I want to copy
> it to a new directory called wip, without actually copying
> all the blocks. To all the normal POSIX API stuff, this copy
> should look like the result of "cp -a", not hard links.

This would be a snapshot, which has to be done on a subvolume right now.
It is not as nice as being able to pick a random directory, but I've
only been able to get this far by limiting the feature scope
significantly. What I did do was make subvolumes very cheap...just make
a bunch of them.

Keep in mind that if you implement a cow directory tree without a
snapshot, and you don't want to duplicate any blocks in the cow, you're
going to have fun with inode numbers.

-chris

Albert Cahalan

unread,

Jun 14, 2007, 2:59:45 AM6/14/07

to Chris Mason, linux-...@vger.kernel.org, linux-...@vger.kernel.org, h...@infradead.org, sni...@gmail.com

On 6/13/07, Chris Mason <chris...@oracle.com> wrote:
> On Wed, Jun 13, 2007 at 12:14:40PM -0400, Albert Cahalan wrote:
> > On 6/13/07, Chris Mason <chris...@oracle.com> wrote:
> > >On Wed, Jun 13, 2007 at 01:45:28AM -0400, Albert Cahalan wrote:

> > >> * secure delete via destruction of per-file or per-block random crypto
> > >keys
> > >
> > >I'd rather keep secure delete as a userland problem (or a layered FS
> > >problem). When you take backups and other copies of the file into
> > >account, it's a bigger problem than btrfs wants to tackle right now.
> >
> > It can't be a userland problem if you allow disk blocks to move.
> > Volume resizing, logging/journalling, etc. -- they combine to make
> > the userland solution essentially impossible. (one could wipe the
> > whole partition, or maybe fill ALL space on the volume)
>
> Right about here is where I would insert a long story about ecryptfs, or
> encryption solutions that happen all in userland. At any rate, it is
> outside the scope of v1.0, even though I definitely agree it is an
> important problem for some people.

I'm sure you do have a nice long story, and I'm sure it seems
correct, but there is something not quite right about the add-on
hacks.

BTW, I'm suggesting that this be about deletion, not protection
of data you wish to keep. It covers more than just file bodies.
It covers inode data, block allocations, etc.

> > >> * atomic creation of copy-on-write directory trees
> > >
> > >Do you mean something more fine grained than the current snapshotting
> > >system?
> >
> > I believe so. Example: I have a linux-2.6 directory. It's not
> > a mount point or anything special like that. I want to copy
> > it to a new directory called wip, without actually copying
> > all the blocks. To all the normal POSIX API stuff, this copy
> > should look like the result of "cp -a", not hard links.
>
> This would be a snapshot, which has to be done on a subvolume right now.
> It is not as nice as being able to pick a random directory, but I've
> only been able to get this far by limiting the feature scope
> significantly. What I did do was make subvolumes very cheap...just make
> a bunch of them.

Can a regular user create and use a subvolume? If not, then
this doesn't work. (if so, then I have other concerns...)

Chris Mason

unread,

Jun 14, 2007, 8:34:28 AM6/14/07

to Albert Cahalan, linux-...@vger.kernel.org, linux-...@vger.kernel.org, h...@infradead.org, sni...@gmail.com

On Thu, Jun 14, 2007 at 02:59:23AM -0400, Albert Cahalan wrote:
> On 6/13/07, Chris Mason <chris...@oracle.com> wrote:

[ secure deletion in btrfs ]

> >
> >Right about here is where I would insert a long story about ecryptfs, or
> >encryption solutions that happen all in userland. At any rate, it is
> >outside the scope of v1.0, even though I definitely agree it is an
> >important problem for some people.
>
> I'm sure you do have a nice long story, and I'm sure it seems
> correct, but there is something not quite right about the add-on
> hacks.
>
> BTW, I'm suggesting that this be about deletion, not protection
> of data you wish to keep. It covers more than just file bodies.
> It covers inode data, block allocations, etc.

Sorry, it's still way outside the scope of v1.0.

>
> >> >> * atomic creation of copy-on-write directory trees
> >> >
> >> >Do you mean something more fine grained than the current snapshotting
> >> >system?
> >>
> >> I believe so. Example: I have a linux-2.6 directory. It's not
> >> a mount point or anything special like that. I want to copy
> >> it to a new directory called wip, without actually copying
> >> all the blocks. To all the normal POSIX API stuff, this copy
> >> should look like the result of "cp -a", not hard links.
> >
> >This would be a snapshot, which has to be done on a subvolume right now.
> >It is not as nice as being able to pick a random directory, but I've
> >only been able to get this far by limiting the feature scope
> >significantly. What I did do was make subvolumes very cheap...just make
> >a bunch of them.
>
> Can a regular user create and use a subvolume? If not, then
> this doesn't work. (if so, then I have other concerns...)

That's the long term goal, but I'll have to reorganize things such that
subvolumes created by a user can all fall under sane accounting.

-chris

Chuck Lever

unread,

Jun 14, 2007, 2:22:03 PM6/14/07

to Chris Mason, John Stoffel, linux-...@vger.kernel.org, linux-...@vger.kernel.org

Hi Chris-

John Stoffel wrote:
> As a user of Netapps, having quotas (if only for reporting purposes)
> and some way to migrate non-used files to slower/cheaper storage would
> be great.
>
> Ie. being able to setup two pools, one being RAID6, the other being
> RAID1, where all currently accessed files are in the RAID1 setup, but
> if un-used get migrated to the RAID6 area.
>
> And of course some way for efficient backups and more importantly
> RESTORES of data which is segregated like this.

I like the way dump and restore was handled in AFS (and now ZFS and
NetApp). There is a simple command to flatten a file system and send it
to another system, which can receive it and re-expand it. The
dump/restore process uses snapshots and can easily send incremental
backups which are significantly smaller than 0-level. This is somewhat
better than rsync, because you don't need checksums to discover what
data has changed -- you already have the new data segregated into
copied-on-write blocks.

NetApp happens to use the standard NDMP protocol for sending the
flattened file system. NetApp uses it for synchronous replication,
volume migration, and back up to nearline storage and tape. AFS used
"vol dump" and "vol restore" for migration, replication, and back-up.
ZFS has the "zfs send" and "zfs receive" commands that do basically the
same (Eric Kustarz recently published a blog entry that described how
these work). And of course, all file system objects are able to be sent
this way: streams, xattrs, ACLs, and so on are all supported.

Note also that NFSv4 supports the idea of migrated or replicated file
objects. All that is needed to support it is a mechanism on the servers
to actually move the data.

chuck.lever.vcf

Chris Mason

unread,

Jun 14, 2007, 2:52:27 PM6/14/07

to Chuck Lever, John Stoffel, linux-...@vger.kernel.org, linux-...@vger.kernel.org

On Thu, Jun 14, 2007 at 02:20:26PM -0400, Chuck Lever wrote:
> Hi Chris-
>
> John Stoffel wrote:
> >As a user of Netapps, having quotas (if only for reporting purposes)
> >and some way to migrate non-used files to slower/cheaper storage would
> >be great.
> >
> >Ie. being able to setup two pools, one being RAID6, the other being
> >RAID1, where all currently accessed files are in the RAID1 setup, but
> >if un-used get migrated to the RAID6 area.
> >
> >And of course some way for efficient backups and more importantly
> >RESTORES of data which is segregated like this.
>
> I like the way dump and restore was handled in AFS (and now ZFS and
> NetApp). There is a simple command to flatten a file system and send it
> to another system, which can receive it and re-expand it. The
> dump/restore process uses snapshots and can easily send incremental
> backups which are significantly smaller than 0-level. This is somewhat
> better than rsync, because you don't need checksums to discover what
> data has changed -- you already have the new data segregated into
> copied-on-write blocks.

The planned scheme in btrfs involves storing the transaction id in the
header of every btree block, the inodes and the file extents. So, the
only info you need to generate the incremental is the transaction id of
the last time you sync'd. The tree would get walked and everything with
a higher transaction id sent down the pipe.

Since there have back pointers from inode to directory, it also be able
to generate a list of filenames that have changed, along with a list of
extents that were modified.

Between the two schemes, you would be able to do metadata aware (ie
storing exact duplicates) syncs between two boxes, or just have a faster
rsync that doesn't need to walk the directory structure to find changes.

I will probably code the rsync version first, just because it allows
you to backup to anything, not just a btrfs specific target. It is
somewhat safer for an experimental FS (metadata errors are not
duplicated) and will hopefully be easier to code.

>
> NetApp happens to use the standard NDMP protocol for sending the
> flattened file system. NetApp uses it for synchronous replication,
> volume migration, and back up to nearline storage and tape. AFS used
> "vol dump" and "vol restore" for migration, replication, and back-up.
> ZFS has the "zfs send" and "zfs receive" commands that do basically the
> same (Eric Kustarz recently published a blog entry that described how
> these work). And of course, all file system objects are able to be sent
> this way: streams, xattrs, ACLs, and so on are all supported.
>
> Note also that NFSv4 supports the idea of migrated or replicated file
> objects. All that is needed to support it is a mechanism on the servers
> to actually move the data.

Stringing the replication together with the underlying FS would be neat.
Is there a way to deal with a master/slave setup, where the slave may be
out of date?

-chris

Chuck Lever

unread,

Jun 15, 2007, 1:18:10 PM6/15/07

to Chris Mason, John Stoffel, linux-...@vger.kernel.org, linux-...@vger.kernel.org

Among the implementations I'm aware of, there is a varying degree of
integration into the physical file system. In general, it depends on
how far out of date the slave is, and how closely the slave is supposed
to be synchronized to the master.

A hot backup file system, for example, should be data-consistent within
a few seconds of the master. A snapshot is used to initialize a slave,
followed by a live stream of updates to the master being sent to slaves.
Such a mechanism already exists on NetApp filers because they gather
changes in NVRAM before committing them to the local file system.
Simply put, these changes can also be bundled and sent to a local hot
backup filer that is attached via Infiniband, or over the network to a
remote hot backup filer.

For AFS, replication is done by maintaining a rw and ro copy of a volume
on the designated master server. Changes are made to the rw copy over
time. When admins want to push out a new version to replicas on another
server, the ro copy on the master is replaced with a new snapshot, then
this is pushed to the slaves. The replicas are always ro and are used
mostly for load balancing; clients contact the closest or fastest server
containing a replica of the volume they want to access. They always
have a complete copy of the volume (ie no COW on the slaves).

I think you have designed into btrfs a lot of opportunity to implement
this kind of data virtualization and management... I'm excited to see
what can be done.

chuck.lever.vcf

Chris Mason

unread,

Jun 18, 2007, 10:45:42 AM6/18/07

to linux-...@vger.kernel.org, linux-...@vger.kernel.org

On Tue, Jun 12, 2007 at 12:10:29PM -0400, Chris Mason wrote:
> Hello everyone,
>
> After the last FS summit, I started working on a new filesystem that
> maintains checksums of all file data and metadata. Many thanks to Zach
> Brown for his ideas, and to Dave Chinner for his help on
> benchmarking analysis.

Thanks to everyone that tried out btrfs. Most of those that did managed
to hit problems with apps that did writes via mmap. So, I added a
page_mkwrite call and worked out cow safe mmap writes.

Terje Røsten sent along a btrfsprogs patch so that it compiles properly
on FC7 (thanks!).

I'll have a site up on oss.oracle.com/projects with mailing lists and
links to my HG trees shortly. Until then:

http://oss.oracle.com/~mason/btrfs/ has the latest tar balls.

Vladislav Bolkhovitin

unread,

Jun 18, 2007, 2:07:50 PM6/18/07

to Chris Mason, linux-...@vger.kernel.org, linux-...@vger.kernel.org

Chris Mason wrote:
> Hello everyone,
>
> After the last FS summit, I started working on a new filesystem that
> maintains checksums of all file data and metadata. Many thanks to Zach
> Brown for his ideas, and to Dave Chinner for his help on
> benchmarking analysis.
>
> The basic list of features looks like this:
>
> * Extent based file storage (2^64 max file size)
> * Space efficient packing of small files
> * Space efficient indexed directories
> * Dynamic inode allocation
> * Writable snapshots
> * Subvolumes (separate internal filesystem roots)
> - Object level mirroring and striping
> * Checksums on data and metadata (multiple algorithms available)
> - Strong integration with device mapper for multiple device support
> - Online filesystem check
> * Very fast offline filesystem check
> - Efficient incremental backup and FS mirroring

I would also suggest one more feature: support for block level
de-duplication. I mean:

1. Ability for Btrfs to have blocks in several files to point to the
same block on disk

2. Support for new syscall or IOCTL to de-duplicate as a single
transaction two or more blocks on disk, i.e. link them to one of them
and free others

3. De-de-duplicate blocks on disk, i.e. copy them on write

I suppose that de-duplication itself would be done by some user space
process that would scan files, determine blocks with the same data and
then de-duplicate them by using syscall or IOCTL (2).

That would be very usable feature, which in most cases would allow to
shrink occupied disk space on 50-90%.

Vlad

John Stoffel

unread,

Jun 18, 2007, 4:09:18 PM6/18/07

to Vladislav Bolkhovitin, Chris Mason, linux-...@vger.kernel.org, linux-...@vger.kernel.org

>>>>> "Vladislav" == Vladislav Bolkhovitin <v...@vlnb.net> writes:

>> The basic list of features looks like this:
>>
>> * Extent based file storage (2^64 max file size)
>> * Space efficient packing of small files
>> * Space efficient indexed directories
>> * Dynamic inode allocation
>> * Writable snapshots
>> * Subvolumes (separate internal filesystem roots)
>> - Object level mirroring and striping
>> * Checksums on data and metadata (multiple algorithms available)
>> - Strong integration with device mapper for multiple device support
>> - Online filesystem check
>> * Very fast offline filesystem check
>> - Efficient incremental backup and FS mirroring

Vladislav> I would also suggest one more feature: support for block
Vladislav> level de-duplication. I mean:

Vladislav> 1. Ability for Btrfs to have blocks in several files to
Vladislav> point to the same block on disk

Yikes! I'd be *very* wary of this feature. It's going to be
computationally expensive to do, and it's going to make the entire
filesystem more fragile, since now one bit of corruption means that
possibly all files with that block are now toast.

There would need to be some serious parity checking done here to make
me feel confortable with such a system.

Vladislav> 2. Support for new syscall or IOCTL to de-duplicate as a
Vladislav> single transaction two or more blocks on disk, i.e. link
Vladislav> them to one of them and free others

Do you want it per-block, or per-file?

Vladislav> 3. De-de-duplicate blocks on disk, i.e. copy them on write

You mean that if two files are sharing blocks and one gets written to,
then the changes are copied to a new block and the pointers updated?
Sure, that's a integral part of the concest.

Vladislav> I suppose that de-duplication itself would be done by some
Vladislav> user space process that would scan files, determine blocks
Vladislav> with the same data and then de-duplicate them by using
Vladislav> syscall or IOCTL (2).

Lots of work... where do you keep the metadata then? In the
filesystem?

Vladislav> That would be very usable feature, which in most cases
Vladislav> would allow to shrink occupied disk space on 50-90%.

Maybe. It's reliability I'd be concerned about here.

John

Pádraig Brady

unread,

Jun 19, 2007, 5:12:37 AM6/19/07

to Vladislav Bolkhovitin, Chris Mason, linux-...@vger.kernel.org, linux-...@vger.kernel.org

Vladislav Bolkhovitin wrote:
>
> I would also suggest one more feature: support for block level
> de-duplication. I mean:
>
> 1. Ability for Btrfs to have blocks in several files to point to the
> same block on disk
>
> 2. Support for new syscall or IOCTL to de-duplicate as a single
> transaction two or more blocks on disk, i.e. link them to one of them
> and free others
>
> 3. De-de-duplicate blocks on disk, i.e. copy them on write
>
> I suppose that de-duplication itself would be done by some user space
> process that would scan files, determine blocks with the same data and
> then de-duplicate them by using syscall or IOCTL (2).
>
> That would be very usable feature, which in most cases would allow to
> shrink occupied disk space on 50-90%.

Have you references for this number?
In my experience one gets a lot of benefit from
the much simpler process of "de-duplication" of files.

Note a checksum stored in file metadata,
that is automatically invalidated on write would
speed up user space file de duplification,
and rsync, etc....

Pádraig.

Vladislav Bolkhovitin

unread,

Jun 19, 2007, 6:02:07 AM6/19/07

to Pádraig Brady, Chris Mason, linux-...@vger.kernel.org, linux-...@vger.kernel.org

Pádraig Brady wrote:
> Vladislav Bolkhovitin wrote:
>
>>I would also suggest one more feature: support for block level
>>de-duplication. I mean:
>>
>>1. Ability for Btrfs to have blocks in several files to point to the
>>same block on disk
>>
>>2. Support for new syscall or IOCTL to de-duplicate as a single
>>transaction two or more blocks on disk, i.e. link them to one of them
>>and free others
>>
>>3. De-de-duplicate blocks on disk, i.e. copy them on write
>>
>>I suppose that de-duplication itself would be done by some user space
>>process that would scan files, determine blocks with the same data and
>>then de-duplicate them by using syscall or IOCTL (2).
>>
>>That would be very usable feature, which in most cases would allow to
>>shrink occupied disk space on 50-90%.
>
> Have you references for this number?

No, I've seen it somewhere and it well confirms with my own observations.

> In my experience one gets a lot of benefit from
> the much simpler process of "de-duplication" of files.

Yes, sure, de-duplication on files level brings its benefits, but on FS
blocks level it would bring ever more benefits, because there are many
more or less big files, which are different as a whole, but with a lot
of the same blocks. Simple example of such files is UNIX-style mail
boxes on a mail server.

> Note a checksum stored in file metadata,
> that is automatically invalidated on write would
> speed up user space file de duplification,
> and rsync, etc....

Sure, file level deduplication is simpler to implement, but it is
generally less powerful, than FS block level one. And seems it should
not be hard to add the above (1)-(3) features in the existing Btrfs
structure, especially on this stage, when the disk format isn't
stabilized yet.

Chris Mason

unread,

Jun 19, 2007, 8:07:52 AM6/19/07

to Pádraig Brady, Vladislav Bolkhovitin, linux-...@vger.kernel.org, linux-...@vger.kernel.org

On Tue, Jun 19, 2007 at 10:11:13AM +0100, Pádraig Brady wrote:
> Vladislav Bolkhovitin wrote:
> >
> > I would also suggest one more feature: support for block level
> > de-duplication. I mean:
> >
> > 1. Ability for Btrfs to have blocks in several files to point to the
> > same block on disk
> >
> > 2. Support for new syscall or IOCTL to de-duplicate as a single
> > transaction two or more blocks on disk, i.e. link them to one of them
> > and free others
> >
> > 3. De-de-duplicate blocks on disk, i.e. copy them on write
> >
> > I suppose that de-duplication itself would be done by some user space
> > process that would scan files, determine blocks with the same data and
> > then de-duplicate them by using syscall or IOCTL (2).
> >
> > That would be very usable feature, which in most cases would allow to
> > shrink occupied disk space on 50-90%.
>
> Have you references for this number?
> In my experience one gets a lot of benefit from
> the much simpler process of "de-duplication" of files.

Yes, I would expect simple hard links to be a better solution for this,
but the feature request is not that out of line. I actually had plans
on implementing auto duplicate block reuse earlier in btrfs.

Snapshots already share duplicate blocks between files, and so all of
the reference counting needed to implement this already exists.
Snapshots are writable, and data mods are copy on write, and in general
things work.

But, to help fsck, the extent allocation tree has a back pointer to the
inode that owns an extent. If you're doing snapshots, all of the owners
of the extent have the same inode number. If you're sharing duplicate
blocks, the owners can have any inode number, and fsck becomes much more
complex.

In general, when I have to decide between fsck and a feature, I'm going
to pick fsck. The features are much more fun, but fsck is one of the
main motivations for doing this work.

Thanks for the input,
Chris

Vladislav Bolkhovitin

unread,

Jun 19, 2007, 10:00:54 AM6/19/07

to Chris Mason, Pádraig Brady, linux-...@vger.kernel.org, linux-...@vger.kernel.org

Chris Mason wrote:
> On Tue, Jun 19, 2007 at 10:11:13AM +0100, Pádraig Brady wrote:
>
>>Vladislav Bolkhovitin wrote:
>>
>>>I would also suggest one more feature: support for block level
>>>de-duplication. I mean:
>>>
>>>1. Ability for Btrfs to have blocks in several files to point to the
>>>same block on disk
>>>
>>>2. Support for new syscall or IOCTL to de-duplicate as a single
>>>transaction two or more blocks on disk, i.e. link them to one of them
>>>and free others
>>>
>>>3. De-de-duplicate blocks on disk, i.e. copy them on write
>>>
>>>I suppose that de-duplication itself would be done by some user space
>>>process that would scan files, determine blocks with the same data and
>>>then de-duplicate them by using syscall or IOCTL (2).
>>>
>>>That would be very usable feature, which in most cases would allow to
>>>shrink occupied disk space on 50-90%.
>>
>>Have you references for this number?
>>In my experience one gets a lot of benefit from
>>the much simpler process of "de-duplication" of files.
>
>
> Yes, I would expect simple hard links to be a better solution for this,
> but the feature request is not that out of line.

From effort POV hard links could be a better solution, but from
effectiveness POV I can't agree with you.

> I actually had plans
> on implementing auto duplicate block reuse earlier in btrfs.
>
> Snapshots already share duplicate blocks between files, and so all of
> the reference counting needed to implement this already exists.
> Snapshots are writable, and data mods are copy on write, and in general
> things work.
>
> But, to help fsck, the extent allocation tree has a back pointer to the
> inode that owns an extent. If you're doing snapshots, all of the owners
> of the extent have the same inode number. If you're sharing duplicate
> blocks, the owners can have any inode number, and fsck becomes much more
> complex.
>
> In general, when I have to decide between fsck and a feature, I'm going
> to pick fsck. The features are much more fun, but fsck is one of the
> main motivations for doing this work.

I see. Thanks for explaining your position.

Vlad

da...@lang.hm

unread,

Jun 19, 2007, 2:21:02 PM6/19/07

to Vladislav Bolkhovitin, Pádraig Brady, Chris Mason, linux-...@vger.kernel.org, linux-...@vger.kernel.org

On Tue, 19 Jun 2007, Vladislav Bolkhovitin wrote:

>> > 3. De-de-duplicate blocks on disk, i.e. copy them on write
>> >
>> > I suppose that de-duplication itself would be done by some user space
>> > process that would scan files, determine blocks with the same data and
>> > then de-duplicate them by using syscall or IOCTL (2).
>> >
>> > That would be very usable feature, which in most cases would allow to
>> > shrink occupied disk space on 50-90%.
>>
>> Have you references for this number?
>
> No, I've seen it somewhere and it well confirms with my own observations.
>
>> In my experience one gets a lot of benefit from
>> the much simpler process of "de-duplication" of files.
>
> Yes, sure, de-duplication on files level brings its benefits, but on FS
> blocks level it would bring ever more benefits, because there are many more
> or less big files, which are different as a whole, but with a lot of the same
> blocks. Simple example of such files is UNIX-style mail boxes on a mail
> server.

unix style mail boxes would not be a good example of wins for sector-based
de-duplication since the duplicate mail is not going to be sector aligned.

David Lang

da...@lang.hm

unread,

Jun 19, 2007, 2:24:50 PM6/19/07

to Chris Mason, Pádraig Brady, Vladislav Bolkhovitin, linux-...@vger.kernel.org, linux-...@vger.kernel.org

On Tue, 19 Jun 2007, Chris Mason wrote:

>>> 3. De-de-duplicate blocks on disk, i.e. copy them on write
>>>
>>> I suppose that de-duplication itself would be done by some user space
>>> process that would scan files, determine blocks with the same data and
>>> then de-duplicate them by using syscall or IOCTL (2).
>>>
>>> That would be very usable feature, which in most cases would allow to
>>> shrink occupied disk space on 50-90%.
>>
>> Have you references for this number?
>> In my experience one gets a lot of benefit from
>> the much simpler process of "de-duplication" of files.
>
> Yes, I would expect simple hard links to be a better solution for this,
> but the feature request is not that out of line. I actually had plans
> on implementing auto duplicate block reuse earlier in btrfs.

with COW de-duplication you can merge things that have vastly different
permissions. hard-links can't be used if different people have write
permission.

David Lang

Philipp Matthias Hahn

unread,

Jun 19, 2007, 2:59:07 PM6/19/07

to Chris Mason, Pádraig Brady, Vladislav Bolkhovitin, linux-...@vger.kernel.org, linux-...@vger.kernel.org

Hello!

On Tue, Jun 19, 2007 at 08:04:57AM -0400, Chris Mason wrote:
> On Tue, Jun 19, 2007 at 10:11:13AM +0100, Pádraig Brady wrote:
> > Vladislav Bolkhovitin wrote:
> > >
> > > I would also suggest one more feature: support for block level
> > > de-duplication. I mean:

..

> > > That would be very usable feature, which in most cases would allow to
> > > shrink occupied disk space on 50-90%.
> >
> > Have you references for this number?
> > In my experience one gets a lot of benefit from
> > the much simpler process of "de-duplication" of files.
>
> Yes, I would expect simple hard links to be a better solution for this,
> but the feature request is not that out of line. I actually had plans
> on implementing auto duplicate block reuse earlier in btrfs.

One problem with hard-links for me is, they also share the meta-data,
especially file permissions and owners.

Take a Subversion checkout for example: For each file "$A" Subversion
saves a backup under ".svm/text-base/$A.svn-base" for file comparison
and diff generation. The user controls the file permissions of "$A",
Subversion protects its backup with 0444. You can't hard-link them,
because than "svn diff" doesn't work anymore if your editor doesn't
break the hard-link, or worse, your permissions can get wrong.

If previous versions Subversion also had an extra file for file
attributes (mime-type, permissions, to-be-ignored, etc.) Since most
files had no special attributes, each had a file only containing "END".
Those you could hard-link by hand to save space.

If somebody want to research this further:

There is this nice little package called "perforate", which contains
"finddup" to find duplicate files. Run it two times, once with "-i" to
ignore permissions while comparing file contents
finddup -i -d /
and once without "-i" for "content and permissions must match"
finddup -d /
This will give you a hint on how many files you could hard-link or how
many files share their content.

BYtE
Philipp
--
/ / (_)__ __ ____ __ Philipp Hahn
/ /__/ / _ \/ // /\ \/ /
/____/_/_//_/\_,_/ /_/\_\ pmh...@titan.lahn.de

Vladislav Bolkhovitin

unread,

Jun 20, 2007, 4:41:50 AM6/20/07

to da...@lang.hm, Pádraig Brady, Chris Mason, linux-...@vger.kernel.org, linux-...@vger.kernel.org

da...@lang.hm wrote:
>>> > 3. De-de-duplicate blocks on disk, i.e. copy them on write
>>> > > I suppose that de-duplication itself would be done by some user
>>> space
>>> > process that would scan files, determine blocks with the same data and
>>> > then de-duplicate them by using syscall or IOCTL (2).
>>> > > That would be very usable feature, which in most cases would
>>> allow to
>>> > shrink occupied disk space on 50-90%.
>>>
>>> Have you references for this number?
>>
>>
>> No, I've seen it somewhere and it well confirms with my own observations.
>>
>>> In my experience one gets a lot of benefit from
>>> the much simpler process of "de-duplication" of files.
>>
>>
>> Yes, sure, de-duplication on files level brings its benefits, but on
>> FS blocks level it would bring ever more benefits, because there are
>> many more or less big files, which are different as a whole, but with
>> a lot of the same blocks. Simple example of such files is UNIX-style
>> mail boxes on a mail server.
>
>
> unix style mail boxes would not be a good example of wins for
> sector-based de-duplication since the duplicate mail is not going to be
> sector aligned.

Yes, I realized that after I sent the e-mail. Handling of the same, but
not aligned, data in different files would need more complex logic.
Maybe too complex.

Vlad

Vladislav Bolkhovitin

unread,

Jun 20, 2007, 4:45:38 AM6/20/07

to Philipp Matthias Hahn, Chris Mason, Pádraig Brady, linux-...@vger.kernel.org, linux-...@vger.kernel.org

Philipp Matthias Hahn wrote:
>>>>I would also suggest one more feature: support for block level
>>>>de-duplication. I mean:
>

> ...

So, seems ever for file based de-duplication some support from the FS,
including some kind of ability for different inodes point to the same
data blocks to store the meta-data, would be needed anyway.

Vlad

Ph. Marek

unread,

Jun 20, 2007, 5:18:27 AM6/20/07

to Vladislav Bolkhovitin, Philipp Matthias Hahn, Chris Mason, Pádraig Brady, linux-...@vger.kernel.org, linux-...@vger.kernel.org

On Mittwoch, 20. Juni 2007, Vladislav Bolkhovitin wrote:
> Philipp Matthias Hahn wrote:
> >>>>I would also suggest one more feature: support for block level
> >>>>de-duplication. I mean:

> So, seems ever for file based de-duplication some support from the FS,
> including some kind of ability for different inodes point to the same
> data blocks to store the meta-data, would be needed anyway.

The easy way is to have the inode point to a (shared, reference counted) data
storage, which lists the data - then inodes can share the data, but have
different meta-data.

Ever since I read about filesystems using the files' hash as addressing
mechanism (per some Linus mail on LKML, about 10 years ago) and manber hashes
(http://citeseer.ist.psu.edu/manber94finding.html) I'm thinking about various
ways to use both in a filesystem.

Manber-Hashes allow to split data into similar chunks, which could be
addresses per some cryptographic checksum, and used by several files.

*But*: The boundaries are not at power-of-2 addresses, so the data for
read()/mmap() would have to be rebuild somehow. (Would eg. be necessary
anyway if the data block itself is stored compressed).

The other question I'm still pondering ... File sizes vary very much. If I
have a large project, eg. the kernel, with many thousand files of some 10 kB,
some data could be shared - GPL licenses in files, #include lists, and some
others.
If I have some other data, with files of several hundred megabytes, sharing
gets more interesting ... but what should the right block size (for manber
hashes) be? If it's small, we have to concatenate a lot of blocks to
reconstruct data - if it's big, we might lose many chances for sharing.

And getting good performance, when blocks in the middle of a file have to be
re-splitted, might be a bit of a problem, too ...

Regards,

Phil