Copy-on-write filesystem

David Scheidt

unread,

Mar 3, 2000, 3:00:00 AM3/3/00

to Michael Bacarella

On Fri, 3 Mar 2000, Michael Bacarella wrote:

>
>
> Upon reading of Microsoft's fabulous innovations in the filesystem arena,
> I started playing with some ideas of my own (not to be confused with
> ORIGINAL ideas)
>
> Can someone tell me why copy-on-write filesystems would be bad?

It wouldn't be. This is how NetApp do their .snapshot direcotries. I think
they have some white papers on it on their website. It's very handy.

David

To Unsubscribe: send mail to majo...@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message

Ronald G. Minnich

unread,

Mar 4, 2000, 3:00:00 AM3/4/00

to freebsd...@freebsd.org

On Fri, 3 Mar 2000, Michael Bacarella wrote:

> Can someone tell me why copy-on-write filesystems would be bad?

It's a good idea. Peter Braam and I have written a device (called memdev)
for linux (sorry!) that implements a virtual-memory-backed copy-on-write
block device (like the loopback device, but uses anon vm pages for store).

It's pretty interesting. It's quite fast, and copy-on-write does seem to
work OK for a filesystem. I'm using this thing as one of two pieces of a
new private name space implementation that would also work quite well on
freebsd.

note it's not really a file system, but a loopback block device which does
copy-on-write for new blocks.

You can also use it to easily implement translucent file system behaviour.

ron

Michael Bacarella

unread,

Mar 4, 2000, 3:00:00 AM3/4/00

to freebsd...@freebsd.org

Upon reading of Microsoft's fabulous innovations in the filesystem arena,
I started playing with some ideas of my own (not to be confused with
ORIGINAL ideas)

Can someone tell me why copy-on-write filesystems would be bad?

Imagine: cp file file2, file and file2 reference the same exact blocks,
but modified chunks of file2 would be given their own private blocks.

This probably won't fit into current filesystems, but is it a sane idea
worth pursuing in a new filesystem? I performed an analysis on a
non-production server and determined that about 66 megs of a typical
FreeBSD install is duplicate files (and yes, I ignored hard links and
symlinks and non-regular files).

This was on a system without a ports tree, also.

I think the benefits would be sexy. Copies are closer to instant. More
cache hits. Space benefits. Copying /etc/skel to a user's home directory
won't take up any blocks at all unless users edit their files, which, if
you're an ISP, you know that 95% of users don't do anyway.

There's probably a stockpile of drawbacks to this as well. Fire away.

/* ----------
Michael Bacarella( mb...@nyct.net ) | (212) 293-2620
Administration / Development / Support | http://nyct.net/
[ N e w Y o r k C o n n e c t . N E T ] | in...@nyct.net
Bringing New York The Internet Service It Deserves!
--------- */

Brooks Davis

unread,

Mar 4, 2000, 3:00:00 AM3/4/00

to Julian Elischer

On Fri, Mar 03, 2000 at 01:39:54PM -0800, Julian Elischer wrote:
> David Scheidt wrote:
> >
> > On Fri, 3 Mar 2000, Michael Bacarella wrote:
> >

> > > Upon reading of Microsoft's fabulous innovations in the filesystem arena,
> > > I started playing with some ideas of my own (not to be confused with
> > > ORIGINAL ideas)
> > >
> > > Can someone tell me why copy-on-write filesystems would be bad?
> >

> > It wouldn't be. This is how NetApp do their .snapshot direcotries. I think
> > they have some white papers on it on their website. It's very handy.
>

> Kirk McKusick is implementing a Copy-on write functionality
> for UFS. It is used in conjunction with Soft updates to produce
> snapshots. It's not what you asked for, but still relevant
> I think. One problem with "Copy-on-write, when applied to
> file copies is that you need to assign the blocks up front, even if you
> don't copy the data, as otherwise you could run out of space
> when the copy is actually needed.

Don't holes already cause this problem? Admittly you are much more
likely to run into it if cp doesn't result in block reservations. In
any case, if you do prereserve the storage you should probably just make
copying lazy so given sufficient quiet time the system will no longer
have any COW'd pages. Unfortunaly I think this problem is likely to be
a rehash of the memory over subscription flame-wars the pop up periodicly
(if you don't know, DON'T ASK, read the archives.) The game has changed
a bit though because disk is quite different from memory in terms of
access characteristics and purpose.

-- Brooks

--
Any statement of the form "X is the one, true Y" is FALSE.

Michael Bacarella

unread,

Mar 4, 2000, 3:00:00 AM3/4/00

to Julian Elischer

> > It wouldn't be. This is how NetApp do their .snapshot direcotries. I think
> > they have some white papers on it on their website. It's very handy.
>
> Kirk McKusick is implementing a Copy-on write functionality
> for UFS. It is used in conjunction with Soft updates to produce
> snapshots. It's not what you asked for, but still relevant
> I think. One problem with "Copy-on-write, when applied to
> file copies is that you need to assign the blocks up front, even if you
> don't copy the data, as otherwise you could run out of space
> when the copy is actually needed.

That's the only real drawback I've considered.

People accept it (barely) when the OS commits to providing virtual memory
it does not have, killing processes when the system falls into debt.

No one will appreciate that happening to their "permanent" data,
especially if the OS decides that the best way to get out of debt is by
deleting a file :)

-MB

Matthew Dillon

unread,

Mar 4, 2000, 3:00:00 AM3/4/00

to Brian Beattie

:Swap? I thought we were talking about a copy-on-write filesystem
:i.e. disk block, not memory, or did I really miss something
:
:Brian Beattie | The only problem with

Where are you copy-on-writing to? Unbacked memory? No way that
would ever work, at least not for any reasonably-sized filesystem.
The copied data has to go somewhere.

Or are you talking about the copy-on-write softlink business? That's
a whole different ball of wax.

-Matt
Matthew Dillon
<dil...@backplane.com>

Jim Bryant

unread,

Mar 4, 2000, 3:00:00 AM3/4/00

to Michael Bacarella

In reply:

> Imagine: cp file file2, file and file2 reference the same exact blocks,
> but modified chunks of file2 would be given their own private blocks.

This is not a microsoft innovation, actually, I believe it was a VMS
innovation. It's called a generational filesystem. the original is
stored, and later generations of the file are stored as diffs.

> This probably won't fit into current filesystems, but is it a sane idea
> worth pursuing in a new filesystem? I performed an analysis on a
> non-production server and determined that about 66 megs of a typical
> FreeBSD install is duplicate files (and yes, I ignored hard links and
> symlinks and non-regular files).

it has it's advantages. and disavantages. one problem in VMS is
determining the system-wide policy on such things, such as how many
file generations will be kept.

this isn't exactly apples to apples, but it's close enough to be
discussed.

a VMS style filesystem would be interesting.

jim
--
All opinions expressed are mine, if you | "I will not be pushed, stamped,
think otherwise, then go jump into turbid | briefed, debriefed, indexed, or
radioactive waters and yell WAHOO !!! | numbered!" - #1, "The Prisoner"
------------------------------------------------------------------------------
KC5VDJ - HF to 23cm KC5VDJ@NW0I.#NEKS.KS.USA.NOAM kc5...@swbell.net
IC-706MkII, IC-T81A, HTX-202, HTX-212, HTX-404, KPC3+, PK-232MBX Grid: EM28px
------------------------------------------------------------------------------
ET has one helluva sense of humor, always anal-probing right-wing schizos!

Jim Bryant

unread,

Mar 4, 2000, 3:00:00 AM3/4/00

to Ronald G. Minnich

In reply:

> On Fri, 3 Mar 2000, Michael Bacarella wrote:
>
> > Can someone tell me why copy-on-write filesystems would be bad?
>

> It's a good idea. Peter Braam and I have written a device (called memdev)
> for linux (sorry!) that implements a virtual-memory-backed copy-on-write
> block device (like the loopback device, but uses anon vm pages for store).
>
> It's pretty interesting. It's quite fast, and copy-on-write does seem to
> work OK for a filesystem. I'm using this thing as one of two pieces of a
> new private name space implementation that would also work quite well on
> freebsd.

Ever experience CDC Cyber NOS? Interesting private-space filesystem
with many apps today.

Mike Smith

unread,

Mar 4, 2000, 3:00:00 AM3/4/00

to Brian Beattie

> > No one will appreciate that happening to their "permanent" data,
> > especially if the OS decides that the best way to get out of debt is by
> > deleting a file :)
>

> Actually, since this is copy-on-write, you do not need the block, until
> you write. If you need to make a copy, it will be on a write system call
> (possibly an inode update), just fail the write ENOSPC or whatever. Or am
> I missing something simple here.

Failing a write into the middle of an existing file with ENOSPC is going
to break any application that's not expecting a potentially sparse file...

--
\\ Give a man a fish, and you feed him for a day. \\ Mike Smith
\\ Tell him he should learn how to fish himself, \\ msm...@freebsd.org
\\ and he'll hate you for a lifetime. \\ msm...@cdrom.com

Matthew Dillon

unread,

Mar 4, 2000, 3:00:00 AM3/4/00

to Ronald G. Minnich

:It's a good idea. Peter Braam and I have written a device (called memdev)

:for linux (sorry!) that implements a virtual-memory-backed copy-on-write
:block device (like the loopback device, but uses anon vm pages for store).
:
:It's pretty interesting. It's quite fast, and copy-on-write does seem to
:work OK for a filesystem. I'm using this thing as one of two pieces of a
:new private name space implementation that would also work quite well on
:freebsd.

:
:note it's not really a file system, but a loopback block device which does

:copy-on-write for new blocks.
:
:You can also use it to easily implement translucent file system behaviour.
:
:ron

I think a copy-on-write FS is an excellent idea. Last year I added
swap-backed support to VN and started working on an I/O infrastructure
for vm_object's (i.e. at the vm_object level rather then the VFS level).
It would not be difficult to finish up that work and give the VN
device the ability to stack vm_object layers, which would allow us to
have a copy-on-write-to-swap layer in front of a partition or file
(or even a copy-on-write-to-file layer in front of a partition, giving
us persistence).

FreeBSD already has something similar with nullfs and unionfs, but those
operate at the VFS call level and despite all the bugs I've fixed in
them recently they are *still* a broken pile of chunky brown stuff. We
can do a lot of things with unionfs but we still can't buildworld
reliably.

-Matt
Matthew Dillon
<dil...@backplane.com>

Brian Beattie

unread,

Mar 4, 2000, 3:00:00 AM3/4/00

to Matthew Dillon

On Fri, 3 Mar 2000, Matthew Dillon wrote:

> :> > I think. One problem with "Copy-on-write, when applied to

> :> > file copies is that you need to assign the blocks up front, even if you
> :> > don't copy the data, as otherwise you could run out of space
> :> > when the copy is actually needed.
> :>
> :> That's the only real drawback I've considered.
> :>
> :> People accept it (barely) when the OS commits to providing virtual memory
> :> it does not have, killing processes when the system falls into debt.

> :>
> :> No one will appreciate that happening to their "permanent" data,

> :> especially if the OS decides that the best way to get out of debt is by
> :> deleting a file :)
> :>
> :
> :Actually, since this is copy-on-write, you do not need the block, until
> :you write. If you need to make a copy, it will be on a write system call
> :(possibly an inode update), just fail the write ENOSPC or whatever. Or am
> :I missing something simple here.
>

> The issue here is to ensure that you have sufficient swap.

Swap? I thought we were talking about a copy-on-write filesystem
i.e. disk block, not memory, or did I really miss something

Brian Beattie | The only problem with

bea...@aracnet.com | winning the rat race ...
www.aracnet.com/~beattie | in the end you're still a rat

Matthew Dillon

unread,

Mar 4, 2000, 3:00:00 AM3/4/00

to Brian Beattie

:> > I think. One problem with "Copy-on-write, when applied to
:> > file copies is that you need to assign the blocks up front, even if you
:> > don't copy the data, as otherwise you could run out of space
:> > when the copy is actually needed.
:>
:> That's the only real drawback I've considered.
:>
:> People accept it (barely) when the OS commits to providing virtual memory
:> it does not have, killing processes when the system falls into debt.
:>
:> No one will appreciate that happening to their "permanent" data,
:> especially if the OS decides that the best way to get out of debt is by
:> deleting a file :)
:>
:
:Actually, since this is copy-on-write, you do not need the block, until
:you write. If you need to make a copy, it will be on a write system call
:(possibly an inode update), just fail the write ENOSPC or whatever. Or am
:I missing something simple here.

The issue here is to ensure that you have sufficient swap. There are two
ways to do this: One, make sure you have enough swap to cover likely
operations done on the overlay, or Two, pre-reserve the entire partition's
worth of space in swap.

Both these options already exist for fresh swap-backed VN filesystems
under 4.0 so you'd get them for free if we were to implement overlay
functionality.

Trying to do anything fancier then that will create more problems then
it solves.

-Matt

Julian Elischer

unread,

Mar 4, 2000, 3:00:00 AM3/4/00

to David Scheidt

David Scheidt wrote:
>
> On Fri, 3 Mar 2000, Michael Bacarella wrote:
>
> >
> >
> > Upon reading of Microsoft's fabulous innovations in the filesystem arena,
> > I started playing with some ideas of my own (not to be confused with
> > ORIGINAL ideas)
> >

> > Can someone tell me why copy-on-write filesystems would be bad?
>

> It wouldn't be. This is how NetApp do their .snapshot direcotries. I think
> they have some white papers on it on their website. It's very handy.

Kirk McKusick is implementing a Copy-on write functionality
for UFS. It is used in conjunction with Soft updates to produce
snapshots. It's not what you asked for, but still relevant

I think. One problem with "Copy-on-write, when applied to
file copies is that you need to assign the blocks up front, even if you
don't copy the data, as otherwise you could run out of space
when the copy is actually needed.

How many files would actually benefit from this? We already symlink and
hardlink quite a few of them..

Julian

>
> David

>
> To Unsubscribe: send mail to majo...@FreeBSD.org
> with "unsubscribe freebsd-hackers" in the body of the message

--
__--_|\ Julian Elischer
/ \ jul...@elischer.org
( OZ ) World tour 2000
---> X_.---._/ presently in: Perth
v

Louis A. Mamakos

unread,

Mar 5, 2000, 3:00:00 AM3/5/00

to sth...@nethelp.no

> > > Imagine: cp file file2, file and file2 reference the same exact blocks,
> > > but modified chunks of file2 would be given their own private blocks.
> >
> > This is not a microsoft innovation, actually, I believe it was a VMS
> > innovation. It's called a generational filesystem. the original is
> > stored, and later generations of the file are stored as diffs.
>

> As far as I know, VMS simply stores whole files - no diffs involved. Now
> if you go back to for instance Univac 1100 and the Exec-8 OS (I suppose
> it is OS-1100 now), you'll find a system that *did* store the diffs. In
> the form of punched card images! :-)

Well, not really. That was mostly an application convention rather than
being done in the OS. And that all the applications wanted to use
SIR$ SDF to read program file elements was just a coincidence :-)

The cools part of Exec-8 that we still need (we already have sparse
files) are the virtual filesystem bits. E.g., unloaded files. People
have been struggling with multi-level storage architectures on UNIX
for years, while this was pretty much a solved problem on these 1's
complement 36 bit dinosars 30 years ago.

(The notion was that if you didn't use a file in a while, the system
would release the data blocks, and mark the file as "unloaded." When
you "assigned"/opened one of these files, a system process would cause
the current backup tape to be loaded, and the file restore. When you
began to get low on disk space, likeway a systen process would start,
and sort all files based on their priority for being unloaded - based
on last reference time, do we have a current backup, who created it, etc.
It would then begin to release the data blocks until you acheived a
configured threshold.)

louie

sth...@nethelp.no

unread,

Mar 5, 2000, 3:00:00 AM3/5/00

to kc5...@swbell.net, jbr...@ppp-207-193-2-159.kscymo.swbell.net

> > Imagine: cp file file2, file and file2 reference the same exact blocks,
> > but modified chunks of file2 would be given their own private blocks.
>
> This is not a microsoft innovation, actually, I believe it was a VMS
> innovation. It's called a generational filesystem. the original is
> stored, and later generations of the file are stored as diffs.

As far as I know, VMS simply stores whole files - no diffs involved. Now
if you go back to for instance Univac 1100 and the Exec-8 OS (I suppose
it is OS-1100 now), you'll find a system that *did* store the diffs. In
the form of punched card images! :-)

Steinar Haug, Nethelp consulting, sth...@nethelp.no