[RFC] proposal to move off of btrfs

4,088 views
Skip to first unread message

Brandon Philips

unread,
Dec 15, 2014, 3:57:09 PM12/15/14
to coreos-dev
Hello coreos-dev-

This is a proposal and a request for feedback on changing the CoreOS
root filesystem for new images to ext4 in the future. If you depend on
the root filesystem being btrfs please speak up and describe your
usage of btrfs features.

Proposal and Motivation for the change:

Today the root fs for CoreOS machines is btrfs. We chose btrfs because
it was the most straightforward Docker graph driver at the time. But,
CoreOS users have regularly reported bugs against btrfs including: out
of disk space errors, metadata rebalancing problems requiring manual
intervention and generally slow performance when compared to other
filesystems[0].

However, in the latest release of the Linux Kernel overlayfs
was (finally!) merged filesystem upstream[1]. You can
think of overlayfs as providing a mechanism similar to AUFS but the
implementation is greatly simplified and now in the mainline Kernel.
It also offers greater speed of creating new docker containers[2] over
the alternatives.

The major changes to CoreOS in this proposal would be a switch to ext4
as the rootfs for all new machines at some point in the future and
making the docker engine overlayfs graph driver[3] the default. If we
make this change machines using btrfs root filesystems would be
unaffected and continue to receive updates as normal.

If you are interested in trying overlayfs out for yourself and have a
CoreOS SDK[4] on your host you can follow the instructions here:
https://gist.github.com/philips/bf96c263bc1d4b27c3a3

Cheers,

Brandon

[0] https://docs.google.com/a/coreos.com/presentation/d/1npITxQaHSq0QZlZYL7cJkV8Wzuupt0qk_y_K9UfDwsI/edit#slide=id.g4ea4d82cf_00
[1] https://lwn.net/Articles/618141/
[2] http://developerblog.redhat.com/2014/09/30/overview-storage-scalability-docker/
[3] https://github.com/docker/docker/pull/7619
[4] https://coreos.com/docs/sdk-distributors/sdk/modifying-coreos/

Seán C. McCord

unread,
Dec 15, 2014, 5:17:20 PM12/15/14
to coreo...@googlegroups.com
As a proponent of BTRFS, I'll be sad to see it go.  I use it on all my workstations, and I rebuilt all my CoreOS boxes when btrfs became the default fs.

As a server administrator, I'll be quite happy to move back to ext4.  The out-of-space / metadata balancing problem has bitten me more times than I care to count.  It's essentially a fact of life that I have to blow away /var/lib/docker and all its subvolumes every few weeks on any given machine, to clear an out-of-space problem (though `df` shows a usage of, say, 30%).  This is particularly true now that I've got a bunch of cloud vms in the mix with incredibly puny drives (m3.medium, here's looking at you).

BTRFS has a number of really excellent features, but even after a year or two (of my using it), this pain-point has had neither resolution nor movement in that direction, as far as I can tell.  Now that I have CoreOS running on virtually all my servers, the fatigue of dealing with those continued problems has begun to outweigh my fascination with BTRFS.  (Yes, I was a Reiser4 fan, too, right up to its MurderFS renaming.)

Morgaine Fowle

unread,
Dec 15, 2014, 5:19:48 PM12/15/14
to coreo...@googlegroups.com
I deeply enjoy the file-system taking responsibility for snapshotting. It creates a consistent management interface that's useful for a wide range of tasks. Anything based off overlayfs is going to have to concoct it's own unique management layer which will require it's own domain knowledge to handle, where-as someone proficient with the filesystem's snapshotting tools is going to have a more general, portable knowledge they'll be able to use to make sense of what CoreOS is doing naturally.

Also dropping BTRFS drops the stupendeously exciting capabilities such as using send/recieve to push deltas atop snapshots. Losing this would be a colossal loss. It'll obliterate the fast/easy/happy path to atomic updates of images Lennart outlined in Revisiting Putting Together Linux Systems. http://0pointer.net/blog/revisiting-how-we-put-together-linux-systems.html

I think there's a lot more valuable avenues CoreOS should be addressing, rather than destroying & rebuilding a suite of what I see as underlying capabilities that it's already grown from. Rocket ought permit users to mount ext4 partitions or directories if they have a pressing desire to use something other than BTRFS, and that gives a path where CoreOS can continue to leverage the file-system capabilities uniquely provided by Btrfs.

Scott Laird

unread,
Dec 15, 2014, 5:46:20 PM12/15/14
to coreo...@googlegroups.com
I have to agree with Seán here--btrfs is nice on paper, and there are lots of impressive things that it'll be able to do in the future, but on a day-to-day basis it makes CoreOS much more difficult to run in any environment where disks persist across reboots.  Filling the disk requires either substantial amounts of manual admin involvement or a total reinstall, which is the opposite of CoreOS's goals.  And CoreOS already has fast/easy/happy updates, although they're not atomic.  I'm all for advancing the state of the art in Unix system design (it's not the early 70s anymore...), but not at the cost of being able to get useful work done.  As things stand, btrfs is barely functional for long-term use with CoreOS.  Try filling up a disk with runaway journal entries and see how long it takes you to get the system back running happily sometime.  Admittedly, it's improved over last year, when filling a Ceph btrfs disk rendered it *unmountable*, but CoreOS should be aiming for admin-less systems, where there's really no reason or need to ever log into individual CoreOS instances, and btrfs doesn't allow that today.

Gabriel Monroy

unread,
Dec 16, 2014, 11:34:00 AM12/16/14
to coreo...@googlegroups.com
+1 from the Deis project.  

btrfs has been an ongoing source of pain for us.  That said, I'd like to see the overlayfs graph driver get a few more miles on it before switching the rootfs.  Maybe wait for the next Docker point release?  Some benchmarks would be helpful too.

Gabriel

Oliver Soell

unread,
Dec 16, 2014, 12:58:09 PM12/16/14
to coreo...@googlegroups.com
+1

For me, btrfs is the only point of unexpected instability from CoreOS at the moment. And it doesn't help that systemd isn't tuned to run on btrfs (https://github.com/coreos/coreos-overlay/pull/983).

It's nice, in theory, to have a modern fs under CoreOS. In practice, I never use any such features. If I want it, I'll make a separate volume.

cheers,
-o

On Monday, December 15, 2014 3:57:09 PM UTC-5, Brandon Philips wrote:

xad...@gmail.com

unread,
Dec 16, 2014, 3:47:52 PM12/16/14
to coreo...@googlegroups.com
I think that CoreOS Alpha isn't Alpha enough. This feature would be a perfect fit for it.

Rimas Mocevicius

unread,
Dec 16, 2014, 4:08:14 PM12/16/14
to coreo...@googlegroups.com
+1

Too much pain to maintain, I got sick of fighting for the disk space :)

Rimas

Michael Marineau

unread,
Dec 16, 2014, 4:26:44 PM12/16/14
to coreos-dev
Alpha releases are essentially release candidates for Beta and
eventually Stable. This is important in order for releases to be well
tested before they reach stable, any difference between alpha/beta and
stable can lead to broken releases. One of the few times we deviated
from this sequence we introduced a regression in stable which had
worked in alpha. :(

That said, even if we don't ship btrfs by default as ROOT in our
images it is possible to wipe ROOT and replace it with the filesystem
of your choice, the bare necessities required in the filesystem will
be initialized on boot. For example, to reformat ROOT in the base
image:

# losetup --find --partscan --show coreos_production_image.bin
/dev/loop0
# wipefs -a /dev/loop0p9
/dev/loop0p9: 2 bytes were erased at offset 0x00000438 (ext4): 53 ef
# mkfs.btrfs -L ROOT /dev/loop0p9
fs created label ROOT on /dev/loop0p9
nodesize 16384 leafsize 16384 sectorsize 4096 size 2.11GiB
# mount -t btrfs /dev/loop0p9 /mnt
# btrfs subvol create /mnt/root
Create subvolume '/mnt/root'
# btrfs subvol list /mnt/root
ID 257 gen 6 top level 5 path root
# btrfs subvol set-default 257 /mnt
# umount /mnt
# losetup -d /dev/loop0

That last bit tediousness is just there to create a subvolume named
'root' which will be mounted to / instead of the top level subvolume.
This isn't required but is how we currently setup btrfs and is super
helpful if you want to take snapshots of the filesystem.

Michael Marineau

unread,
Dec 16, 2014, 4:52:00 PM12/16/14
to coreos-dev
On Tue, Dec 16, 2014 at 1:26 PM, Michael Marineau
<michael....@coreos.com> wrote:
> That said, even if we don't ship btrfs by default as ROOT in our
> images it is possible to wipe ROOT and replace it with the filesystem
> of your choice, the bare necessities required in the filesystem will
> be initialized on boot. For example, to reformat ROOT in the base
> image:
>
> # losetup --find --partscan --show coreos_production_image.bin
> /dev/loop0
> # wipefs -a /dev/loop0p9
> /dev/loop0p9: 2 bytes were erased at offset 0x00000438 (ext4): 53 ef
> # mkfs.btrfs -L ROOT /dev/loop0p9
> fs created label ROOT on /dev/loop0p9
> nodesize 16384 leafsize 16384 sectorsize 4096 size 2.11GiB
> # mount -t btrfs /dev/loop0p9 /mnt
> # btrfs subvol create /mnt/root
> Create subvolume '/mnt/root'
> # btrfs subvol list /mnt/root
> ID 257 gen 6 top level 5 path root
> # btrfs subvol set-default 257 /mnt

Minor amendment to these instructions, our initrd has an error in it
and doesn't remount the filesystem read-write until after /usr is
mounted, but that means /usr must exist in advance. So just add this
to the sequence:

# mkdir /mnt/root/usr

That will be fixed before the switch to ext4 though so nothing beyond
mkfs is required.

Timo Derstappen

unread,
Dec 17, 2014, 4:35:41 AM12/17/14
to coreo...@googlegroups.com
+1 we are also not using any of the btrfs features besides docker in CoreOS.

I'm using a btrfs root on my laptop for quite a while now and rely on its features for development a lot, but this is something that could be replaced by overlayfs as well. 

But did you talk to Lennart about this replacement? It would be good to align this with their roadmap. Haven't followed their development lately and wonder if they are also in favor of overlayfs now.

Chris Heng

unread,
Dec 17, 2014, 7:27:45 PM12/17/14
to coreo...@googlegroups.com
It would be lovely to move to a solution that stops my machines from blowing up every few weeks :)

Patrick Reilly

unread,
Dec 17, 2014, 8:18:23 PM12/17/14
to coreo...@googlegroups.com
+1 btrfs seems too unstable for me.

— Patrick


On Monday, December 15, 2014 12:57:09 PM UTC-8, Brandon Philips wrote:

Scott Likens

unread,
Dec 17, 2014, 8:46:28 PM12/17/14
to coreo...@googlegroups.com
+1

On Monday, December 15, 2014 12:57:09 PM UTC-8, Brandon Philips wrote:

Chris Mason

unread,
Dec 18, 2014, 8:48:44 AM12/18/14
to coreo...@googlegroups.com
On Monday, December 15, 2014 3:57:09 PM UTC-5, Brandon Philips wrote:
Hello coreos-dev-

This is a proposal and a request for feedback on changing the CoreOS
root filesystem for new images to ext4 in the future. If you depend on
the root filesystem being btrfs please speak up and describe your
usage of btrfs features.

Proposal and Motivation for the change: 

Hi everyone,

I'm a big fan of what CoreOS is doing, and first and foremost want you to pick the storage configuration that best fits your needs.  It's great to see an overlay FS in the kernel and big projects pick it up.

On our end, many of these Btrfs warts are getting solved.  The 3.19 merge window fixes some very hard to find corruption problems that we've been chasing down, and Josef Bacik has developed a slick power-fail testing target that makes it much easier to prevent similar bugs in the future.  3.19 will also fix rare corruptions with block group removal, making both balance and the new auto-blockgroup cleanup feature much more reliable.

We've hit a few performance problems deploying Btrfs here at Facebook, and fixes for these are making it into upstream kernels.  We've also now caught two storage cards returning either stale or corrupt data, and the Btrfs crcs saved us from replicating the bad copies out across the cluster.

Thanks for the time you and CoreOS uses have spent making Btrfs better.  We'll keep improving the filesystem and look forward to seeing what CoreOS does next.

-chris

Brandon Philips

unread,
Dec 23, 2014, 12:40:26 AM12/23/14
to coreos-dev
Hey Chris!

I hope that it came through in this RFC that this move to a different
rootfs is a pragmatic choice based on the state of the current
upstream Kernel, our users experiences and needs based on the types of
environments that our users are running in. I will certainly be
keeping an eye on btrfs over time. More stuff inline below:

On Thu, Dec 18, 2014 at 5:48 AM, Chris Mason <mason.c...@gmail.com> wrote:
> On our end, many of these Btrfs warts are getting solved. The 3.19 merge
> window fixes some very hard to find corruption problems that we've been
> chasing down, and Josef Bacik has developed a slick power-fail testing
> target that makes it much easier to prevent similar bugs in the future.
> 3.19 will also fix rare corruptions with block group removal, making both
> balance and the new auto-blockgroup cleanup feature much more reliable.

Nice! This is great news; I know the btrfs team is hard at work
tackling these problems.

> We've hit a few performance problems deploying Btrfs here at Facebook, and
> fixes for these are making it into upstream kernels. We've also now caught
> two storage cards returning either stale or corrupt data, and the Btrfs crcs
> saved us from replicating the bad copies out across the cluster.

This is a critical feature and one of the many things that has me
excited about btrfs. I would be interested in hearing your thoughts on
the filesystem handling crypto validation/attestation features like
how dm-verity is used today.

> Thanks for the time you and CoreOS uses have spent making Btrfs better.
> We'll keep improving the filesystem and look forward to seeing what CoreOS
> does next.

Thank you and the btrfs team. Have a great holiday.

Brandon

Chris Mason

unread,
Dec 26, 2014, 9:38:03 AM12/26/14
to coreo...@googlegroups.com


On Tuesday, December 23, 2014 12:40:26 AM UTC-5, Brandon Philips wrote:
Hey Chris!

I hope that it came through in this RFC that this move to a different
rootfs is a pragmatic choice based on the state of the current
upstream Kernel, our users experiences and needs based on the types of
environments that our users are running in. I will certainly be
keeping an eye on btrfs over time. More stuff inline below:


Hi Brandon,

Asbolutely, I really want you to pick the best tools to make CoreOS go.  Please don't hesitate to send us reports about how to make Btrfs be that tool.  But, I completely understand that you need to make pragmatic choices.
 
On Thu, Dec 18, 2014 at 5:48 AM, Chris Mason <mason.c...@gmail.com> wrote:
> On our end, many of these Btrfs warts are getting solved.  The 3.19 merge
> window fixes some very hard to find corruption problems that we've been
> chasing down, and Josef Bacik has developed a slick power-fail testing
> target that makes it much easier to prevent similar bugs in the future.
> 3.19 will also fix rare corruptions with block group removal, making both
> balance and the new auto-blockgroup cleanup feature much more reliable.

Nice! This is great news; I know the btrfs team is hard at work
tackling these problems.

> We've hit a few performance problems deploying Btrfs here at Facebook, and
> fixes for these are making it into upstream kernels.  We've also now caught
> two storage cards returning either stale or corrupt data, and the Btrfs crcs
> saved us from replicating the bad copies out across the cluster.

This is a critical feature and one of the many things that has me
excited about btrfs. I would be interested in hearing your thoughts on
the filesystem handling crypto validation/attestation features like
how dm-verity is used today.


Our current crcs are great for finding storage problems, but crypto validation is a different and very interesting problem.  I do want to add features so that a signed FS image can be verified at run time.  If you're working in this area we can try to hammer out the use-cases.
 
> Thanks for the time you and CoreOS uses have spent making Btrfs better.
> We'll keep improving the filesystem and look forward to seeing what CoreOS
> does next.

Thank you and the btrfs team. Have a great holiday.


You too!

-chris
 

Spencer Brown

unread,
Jan 2, 2015, 8:58:23 PM1/2/15
to coreo...@googlegroups.com
Do it. The one thing devops type people I talk to cringe about when I describe CoreOS to them is btrfs, This will remove an adoption barrier.

Spencer Brown

Michael Marineau

unread,
Jan 12, 2015, 5:15:38 PM1/12/15
to coreos-dev
FYI everyone,

I just merged the change to switch to ext4 as our default root
filesystem: https://github.com/coreos/scripts/pull/372

This change should appear in alpha in a day or two. I'll post another
note when that happens.

Michael Marineau

unread,
Jan 14, 2015, 6:35:51 PM1/14/15
to coreos-dev
Just landed in alpha! https://coreos.com/releases/#561.0.0

Rimas Mocevicius

unread,
Jan 15, 2015, 6:54:19 AM1/15/15
to coreo...@googlegroups.com
I have tried the latest alpha on vagrant box and getting this error:
core-01 docker[8406]: time="2015-01-15T11:47:35Z" level="fatal" msg="Error pulling image (latest) from davivcgarcia/dockerui, Driver overlay failed to create image rootfs eda83c773872fba9149ee5a6b27bb552b33f1a3fd4bf5308b7efbd7410ae7a32: lstat /var/lib/docker/overlay/511136ea3c5a64f264b78b5433614aec563103b4d4702f3ba7d4d2698e22c158: no such file or directory"
Reply all
Reply to author
Forward
0 new messages