Whoops, sorry about the slow reply: I kept remembering to reply to
this, then somehow getting interrupted by meetings etc. Here I am at
last.
> First, one key type used by the format(s) is the vint. A vint is a
> variable length integer. See here for a detailed explanation:
> http://lucene.apache.org/java/1_4_3/fileformats.html. Values less than
> 128 only require one byte.
I like vints :) By the way, the idea for this came from taking a look
at Google protobufs, which are pretty neat, but drag in an overly
gigantic pile of dependencies so aren't really a great choice here.
(It would be great if they would make a really lightweight python
version that just does the basics.)
> We may end up with several top-level formats. For now I'll just
> describe a possible Linux format, and I'll leave aside the question of
> whether we'll eventually need per-platform, per-filesystem, or modular,
> compositional formats.
On this point, I think what I'd like to see is not exactly
per-anything, but just a numbered set of record formats. Each file's
metadata would consist of zero or more records. The record format you
describe here is just record format number 1 (and we might never need
more than one; at least we'd have a usable system with just the one
you proposed).
> I think it may be best to get started, adjust as we find out what's
> required, and just warn people that the formats will not be stable for a
> while (and perhaps also leave metadata disabled by default).
Leaving it disabled by default for a few weeks probably wouldn't hurt.
I'm guessing it won't take too long to shake this one out, at least to
the point where we're reasonably sure we didn't miss something
important.
> file-type - vint representing the file type, initially using
> ord('r') for regular file
> ord('b') for a block file
> ord('l') for a link, etc.
>
> standard-perms - vint representing a bitfield with the permission bits
> in this order: suid sgid sticky ru wu xu rg wg xg ro wo xo
> The vint makes it possible to easily add more bits later if needed.
I'd rather use Unix-style mode bits to define the filetype. That's
what 'struct stat' does, and it's nice, efficient, and standard. Thus
you wouldn't need the separate filetype vint.
> uid - vint (store the string names more globally)
>
> git - vint (store the string names more globally)
I think it's okay to store both the number and the name together each
time we use them. That's the sort of thing gzip should be able to
handle without any trouble.
> atime - vint nanoseconds
> mtime - vint nanoseconds
> ctime - vint nanoseconds
It pains me a little to wastefully store nanosecond-level detail when
most filesystems can barely handle seconds. But you're probably right.
As a size optimization, we could probably store seconds and
nanoseconds as separate vints. Then, on almost all current systems,
the nanoseconds would always be just a single zero byte.
If we stored (atime_s, mtime_s, ctime_s, atime_ns, mtime_ns,
ctime_ns), then we'd have three zeroes in a row, which gzip could
store as a single symbol, probably improving the compression even
further.
> if symlink
> target-name-length - vint
> target-name - bytes
symlinks should be stored the way the Unix filesystem does: the
symlink's target is the "content" (ie. the "data") of the file. In
fact, bup already stores symlinks this way. As long as the
symlinkiness (file type) is stored in the file attributes, we
shouldn't need to store anything more about symlinks.
> if device
> major-number - vint
> minor-number - vint
ok.
> attributes - vint bitfield for chattr/lsattr permissions (like perms)
This is Linux specific, but I think that's okay. On non-linux
systems, the vint would be a single byte, so no big deal.
> acl-text-length - vint
> acl-text - bytes via acl_to_any_text() via posix1e (unless length 0)
> I'm not yet certain the text format is stable/appropriate.
>
> extended-attributes-count - vint
> for each extended attribute: (via pyxattr)
> attribute-length - vint
> attribute - bytes
I haven't learned enough about ACLs (or xattrs) to be able to say much
about these. Since most files won't have xattrs or ACLs anyway,
perhaps this would be a good use for a secondary record format, and we
can just ignore them for now.
> I've written the encoder, and this representation ends up requiring
> about 40 bytes per-file when there are no attributes or acls, and it's
> not a symlink or device node. Roughly 800 files produce about 28k which
> gzip compresses to less than 5k.
That's very reasonable, I think.
By the way, a couple of things that you didn't mention in the proposal
(perhaps because they're obvious to you):
- I think this format would involve one file per git tree (ie. each
subdir would have a single .bupmeta file with the above format in it).
- There's no need to include the filenames in this file. The metadata
would be encoded in the same order as the filenames in the git tree.
The latter point makes this particular approach very appealing to me:
we get plenty of compressibility by grouping files together (as
opposed to my earlier idea of one-blob-per-file); most files in a
particular directory should have near-identical metadata. But we also
have inherent uncompressed efficiency (compared to, say, metastore)
because we don't have to store the paths or even the filenames. And
we can walk through the metadata incrementally in exactly the same way
as we walk incrementally through a tree.
Nice work!
I'd love to see an implementation of this one.
Have fun,
Avery
> On this point, I think what I'd like to see is not exactly
> per-anything, but just a numbered set of record formats. Each file's
> metadata would consist of zero or more records. The record format you
> describe here is just record format number 1 (and we might never need
> more than one; at least we'd have a usable system with just the one
> you proposed).
What problem are you trying to address with multiple records per-file?
I have some ideas, but wasn't sure.
> I'd rather use Unix-style mode bits to define the filetype. That's
> what 'struct stat' does, and it's nice, efficient, and standard. Thus
> you wouldn't need the separate filetype vint.
I don't feel very strongly about it, but if we don't manage the format
ourselves, I'd like to make sure we document the it clearly (i.e. record
whatever the unofficial "standard" is for st_mode). We may also want a
good suite of regression tests.
> I think it's okay to store both the number and the name together each
> time we use them. That's the sort of thing gzip should be able to
> handle without any trouble.
OK that's easy, and I'd prefer it. I was just trying to optimize, given
the concerns you expressed earlier about size. (Although I suppose
keeping the user and group strings in a separate table might be a little
easier on the parser and on RAM use when reading or manipulating a large
number of files.)
> It pains me a little to wastefully store nanosecond-level detail when
> most filesystems can barely handle seconds. But you're probably right.
> As a size optimization, we could probably store seconds and
> nanoseconds as separate vints. Then, on almost all current systems,
> the nanoseconds would always be just a single zero byte.
>
> If we stored (atime_s, mtime_s, ctime_s, atime_ns, mtime_ns,
> ctime_ns), then we'd have three zeroes in a row, which gzip could
> store as a single symbol, probably improving the compression even
> further.
We might want to test before adding any complexity here. I wouldn't be
surprised if gzip makes the straightforward approach more than efficient
enough.
Though OTOH, at least on Linux, during a restore I think we'll be using
sec and nsec separately, and handling the values separately might be
easier in some languages.
> symlinks should be stored the way the Unix filesystem does: the
> symlink's target is the "content" (ie. the "data") of the file. In
> fact, bup already stores symlinks this way. As long as the
> symlinkiness (file type) is stored in the file attributes, we
> shouldn't need to store anything more about symlinks.
Oh, right, of course.
>> attributes - vint bitfield for chattr/lsattr permissions (like perms)
>
> This is Linux specific, but I think that's okay. On non-linux
> systems, the vint would be a single byte, so no big deal.
So this is one of the complexities that made me wonder about the
"per-anything" issue. Linux has chattr flags, OS X has chflags flags,
and I'd imagine that other OSes have something else.
I also suspect that while flags like "immutable" may be somewhat common
across platforms, there will be other flags that aren't, and I imagine
there might also be flags with the same names on two platforms, but
different behaviors. Some we can map, some we can't.
Also, as an example with respect to the "per-anything" issue, imagine
that OS X chflags attributes really don't match chattr attributes. Then
we have to decide how to handle backups and restores, not just with
respect to the encoding(s), but also with respect to the organization
and structure of the code that handles backups and restores from one
platform to another.
> I haven't learned enough about ACLs (or xattrs) to be able to say much
> about these. Since most files won't have xattrs or ACLs anyway,
> perhaps this would be a good use for a secondary record format, and we
> can just ignore them for now.
If we can, I'd like to try to capture them from the start. Otherwise I
believe we might break restores of systems using selinux (for example).
(Though I'm also not an expert here.)
Note that xattrs seem fairly straightforward -- they just appear to be
string key to string (or binary?) value pairs.
ACLs are easy if the text format from acl_to_any_text() intended to be
stable/reliable. Otherwise we may have to design a representation.
> - I think this format would involve one file per git tree (ie. each
> subdir would have a single .bupmeta file with the above format in it).
Right. I had assumed that the metadata file would contain:
metadata-for-file-1
metadata-for-file-2
metadata-for-file-3
...
> - There's no need to include the filenames in this file. The metadata
> would be encoded in the same order as the filenames in the git tree.
Right.
> I'd love to see an implementation of this one.
The encoder already works (excluding the changes you've suggested), but
I haven't done anything about restoration. Once we settle some of these
details, I'll rework the encoder, and send it on.
Thanks
--
Rob Browning
rlb @defaultvalue.org and @debian.org
GPG as of 2002-11-03 14DD 432F AE39 534D B592 F9A0 25C8 D377 8C7E 73A4
Extensibility, backward compatibility, and future-proofness. The idea
is that if we add new kinds of attribute support to a future version
of bup, they would be in a record *in addition* to the standard record
type 1. New backups would still be readable and restorable with old
versions of bup (albeit without the newly-introduced attributes).
>> I'd rather use Unix-style mode bits to define the filetype. That's
>> what 'struct stat' does, and it's nice, efficient, and standard. Thus
>> you wouldn't need the separate filetype vint.
>
> I don't feel very strongly about it, but if we don't manage the format
> ourselves, I'd like to make sure we document the it clearly (i.e. record
> whatever the unofficial "standard" is for st_mode). We may also want a
> good suite of regression tests.
Yes, I agree with that.
>> I think it's okay to store both the number and the name together each
>> time we use them. That's the sort of thing gzip should be able to
>> handle without any trouble.
>
> OK that's easy, and I'd prefer it. I was just trying to optimize, given
> the concerns you expressed earlier about size. (Although I suppose
> keeping the user and group strings in a separate table might be a little
> easier on the parser and on RAM use when reading or manipulating a large
> number of files.)
Yeah, a separate table might be slightly easier on the parser, but
probably not importantly so. And it would make the parser more
*complex*, which is probably not worth it.
What I would expect is that for most directories, all the files in the
directory will have the same uid/gid. So the entire
(uid,username,gid,groupname) blob will be compressed by zlib down to a
single token, thus no less efficient than just having the uid/gid or
using a lookup table.
It would be interesting to try it both ways to see how small the
compressed data is.
>> It pains me a little to wastefully store nanosecond-level detail when
>> most filesystems can barely handle seconds. But you're probably right.
>> As a size optimization, we could probably store seconds and
>> nanoseconds as separate vints. Then, on almost all current systems,
>> the nanoseconds would always be just a single zero byte.
>>
>> If we stored (atime_s, mtime_s, ctime_s, atime_ns, mtime_ns,
>> ctime_ns), then we'd have three zeroes in a row, which gzip could
>> store as a single symbol, probably improving the compression even
>> further.
>
> We might want to test before adding any complexity here. I wouldn't be
> surprised if gzip makes the straightforward approach more than efficient
> enough.
>
> Though OTOH, at least on Linux, during a restore I think we'll be using
> sec and nsec separately, and handling the values separately might be
> easier in some languages.
A test would definitely be good. I have a feeling we'd notice a
fairly significant effect. I spent a little too much time reading the
zlib file format spec a while ago :)
>>> attributes - vint bitfield for chattr/lsattr permissions (like perms)
>>
>> This is Linux specific, but I think that's okay. On non-linux
>> systems, the vint would be a single byte, so no big deal.
>
> So this is one of the complexities that made me wonder about the
> "per-anything" issue. Linux has chattr flags, OS X has chflags flags,
> and I'd imagine that other OSes have something else.
>
> I also suspect that while flags like "immutable" may be somewhat common
> across platforms, there will be other flags that aren't, and I imagine
> there might also be flags with the same names on two platforms, but
> different behaviors. Some we can map, some we can't.
>
> Also, as an example with respect to the "per-anything" issue, imagine
> that OS X chflags attributes really don't match chattr attributes. Then
> we have to decide how to handle backups and restores, not just with
> respect to the encoding(s), but also with respect to the organization
> and structure of the code that handles backups and restores from one
> platform to another.
Yeah. I'm not sure there's any particular value to trying to map the
attributes. We should probably just have separate "linuxattrs" and
"macattrs" and (if applicable) "windowsattrs" vints. For backups made
on system that doesn't support the attributes in question, it only
costs one byte; no big deal.
>> I haven't learned enough about ACLs (or xattrs) to be able to say much
>> about these. Since most files won't have xattrs or ACLs anyway,
>> perhaps this would be a good use for a secondary record format, and we
>> can just ignore them for now.
>
> If we can, I'd like to try to capture them from the start. Otherwise I
> believe we might break restores of systems using selinux (for example).
> (Though I'm also not an expert here.)
>
> Note that xattrs seem fairly straightforward -- they just appear to be
> string key to string (or binary?) value pairs.
Right, selinux depends on xattrs. However, if we don't support them
from the start, then we simply don't support systems that rely on
selinux at first. I don't have any such systems, so I won't be
affected :)
> ACLs are easy if the text format from acl_to_any_text() intended to be
> stable/reliable. Otherwise we may have to design a representation.
Yeah, I haven't looked too closely at ACLs. Probably the text format
is at least a bit inefficient, and a binary one would be mildly
better. This would probably be a very good use for one of the
additional record types.
>> I'd love to see an implementation of this one.
>
> The encoder already works (excluding the changes you've suggested), but
> I haven't done anything about restoration. Once we settle some of these
> details, I'll rework the encoder, and send it on.
Does it also include the resulting file as part of the tree object?
That would at least make our backups complete.
But doesn't do much about restoration at the moment as it is, so
that's no particular flaw of your metadata implementation :)
Have fun,
Avery
>> What problem are you trying to address with multiple records per-file?
>> I have some ideas, but wasn't sure.
>
> Extensibility, backward compatibility, and future-proofness. The idea
> is that if we add new kinds of attribute support to a future version
> of bup, they would be in a record *in addition* to the standard record
> type 1. New backups would still be readable and restorable with old
> versions of bup (albeit without the newly-introduced attributes).
That's what I suspected. So did you have a particular approach in mind?
Given variable length metadata per file, one approach would be to just
length prefix everything, i.e. the metadata for one file would look like
this:
metadata-component-count
component-length-vint component-type-vint data...
component-length-vint component-type-vint data...
component-length-vint component-type-vint data...
...
An older version of bup would be able to tell from the type-vint whether
or not it knows about the component, and if not, just skip
component-length bytes.
> A test would definitely be good. I have a feeling we'd notice a
> fairly significant effect. I spent a little too much time reading the
> zlib file format spec a while ago :)
OK, I'll plan to test the effect of including user and group names, and
the effect of the alternate time encodings for some larger set of files.
> Does it also include the resulting file as part of the tree object?
> That would at least make our backups complete.
No. At least for the moment, I was leaving the "tree side" to you.
Right now, I'm just working on code to encode the metadata for a given
file. After that I was planning to work on the code to decode and apply
the metadata.
I've been lurking here for the past couple of days as bup looks really really interesting. I haven't tried it out yet though. (I run OpenSolaris for ZFS so I may have to fix things to get it going.)
However, if you are looking for an extensible, compact, typed, hierarchical encoding for data you might want to check out something like googles ProtoBufs or Facebooks Thrift.
Not clear taking a dependency is worth it though.
Joe
> --
> To unsubscribe, reply using "remove me" as the subject.
That sounds about right. I think I would probably prefer to skip the
component-count and just have a zero component length as a terminator
(since the component-type-vint would always make the length >= 1). I
don't know why I'd prefer that; it just seems slightly nicer to me.
Maybe I'm just a null-terminator kind of guy.
>> Does it also include the resulting file as part of the tree object?
>> That would at least make our backups complete.
>
> No. At least for the moment, I was leaving the "tree side" to you.
> Right now, I'm just working on code to encode the metadata for a given
> file. After that I was planning to work on the code to decode and apply
> the metadata.
Okay, fair enough. I can handle that part :)
Have fun,
Avery
It'll be interesting to see how OpenSolaris works out. bup doesn't
contain much unportable stuff, as long as you have a C compiler and
python installed. Please let us know how it turns out.
> However, if you are looking for an extensible, compact, typed,
> hierarchical encoding for data you might want to check out
> something like googles ProtoBufs or Facebooks Thrift.
I looked at these earlier. Unfortunately Thrift has lots of
non-optional RPC stuff, and the Protobufs source is 1.8 megs
compressed (about 8.3x the entire bup source .tar.gz), which is a bit
unreasonable (and since protobufs isn't available preinstalled on most
platforms, we'd probably have to bundle it). It also requires a
C++-based "protocol compiler" (generating python with C++?? That
*can't* be the easy way to solve my problem). And as a bonus, the
README in protobuf's python subdir says:
"The Python implementation of Protocol Buffers is not as mature as the C++
and Java implementations. It may be more buggy, and it is known to be
pretty slow at this time."
I can make slow, buggy code by myself and it won't be 1.8 megs :)
That said, the binary format we're talking about for the bup metadata
file is inspired by (though not compatible with) protobufs.
Particularly our use of varints and my suggestion for using a type
identifier for subrecords.
Have fun,
Avery
It's not surprising since I think the python protobufs implementation
was the last one to be added. But it's pretty obvious that it's a
second-class citizen. Maybe someday it won't be.
> I'll play with OpenSolaris some time this week and report back. I have some
> friends who are running FreeBSD for ZFS also. That should be an easy port
This would be great. Feel free to submit a patch to the README (in
the "reasons you might want to avoid bup" section) even if you find
that no patches are required :)
Have fun,
Avery
> On Thu, Apr 1, 2010 at 2:57 AM, Rob Browning <r...@defaultvalue.org> wrote:
>> Avery Pennarun <apen...@gmail.com> writes:
>> Though OTOH, at least on Linux, during a restore I think we'll be using
>> sec and nsec separately, and handling the values separately might be
>> easier in some languages.
>
> A test would definitely be good. I have a feeling we'd notice a
> fairly significant effect. I spent a little too much time reading the
> zlib file format spec a while ago :)
OK, I ran some tests, and although I had one case where the "single ns
integer" approach was better, it wasn't a lot better, and in the other
cases, including some synthetic cases where I forced ns to 0, your
suggestion was better, sometimes *much* better, so I'll adjust the
encoding.
> ACLs are easy if the text format from acl_to_any_text() intended to be
> stable/reliable. Otherwise we may have to design a representation.
It looks like you can encode/decode this format with acl_to_any_text(3)
and acl_from_text(3), and the format may be stable, but it also looks
like it only provides either user/group names or user/group ids, but not
both (depending on the options).
I assume we need to record both so that we can handle the same cases as
tar's --numeric-owner, i.e. if you're restoring a root filesystem for
some other system (to a chroot perhaps) and you want to preserve the
original system's user and group ids.
Also apparently the to/from text functions are a "non-portable, Linux
specific extension", so if we use the format, we may have to write our
own decoder for other systems, unless the posix1e module is availble
there, and handles the translation.
For those curious, here's what the text format looks like for a file
that only has the traditional mode perms:
u::rw-,g::r--,o::r--
and for files that have more (see acl(5)):
u::rw-,u:lisa:rw-,g::r--,g:toolies:rw-,m::r--,o::r--
g:toolies:rw,u:lisa:rw,u::wr,g::r,o::r,m::r
Note that the encoder will currently only create a posix1e acl record if
the file in question is on an fs that supports acls. We might also want
to add an optimization to omit any acl elements that match the mode
perms, assuming that's feasible.
So that's what I know so far. Right now the encoder just uses the
to/from text functions, so at a minimum, we'll need to do something to
accomodate the id/name issue.
Also note that all of this only applies to posix1e (retracted standard)
ACLs, which I think are what's in common use, at least on Linux.
They've actually been superceded by an offical NFSv4 ACL standard, which
will probably grow in use.
To handle that I think we should just have two different record formats.
For now we'll record posix1e ACLs, and later we can add a new record
type for NFSv4 ACLs.
Yup, that would make sense.
> Also apparently the to/from text functions are a "non-portable, Linux
> specific extension", so if we use the format, we may have to write our
> own decoder for other systems, unless the posix1e module is availble
> there, and handles the translation.
The Linux-specific extension doesn't make me too excited...
> For those curious, here's what the text format looks like for a file
> that only has the traditional mode perms:
>
> u::rw-,g::r--,o::r--
>
> and for files that have more (see acl(5)):
>
> u::rw-,u:lisa:rw-,g::r--,g:toolies:rw-,m::r--,o::r--
> g:toolies:rw,u:lisa:rw,u::wr,g::r,o::r,m::r
...and this format seems really wasteful to me anyway. At the very
least, the "rw-" type stuff could be encoded as a single integer. It
might be best to just have a list of
(code,uid/gid,username/groupname), where "code" represents the u/g/o/m
and the rwx bits.
I would rather take full responsibility for the data format than using
a known-nonportable one, especially a text-based known-nonportable
one, which will inevitably be much slower than a non-text-based
format.
Is there a standardized *binary* representation already? Presumably
you must be converting something to/from text...
> Note that the encoder will currently only create a posix1e acl record if
> the file in question is on an fs that supports acls. We might also want
> to add an optimization to omit any acl elements that match the mode
> perms, assuming that's feasible.
Not sure that's a good idea; we should try to represent exactly what's
on disk so that we can put it back exactly like that upon restore.
> Also note that all of this only applies to posix1e (retracted standard)
> ACLs, which I think are what's in common use, at least on Linux.
> They've actually been superceded by an offical NFSv4 ACL standard, which
> will probably grow in use.
>
> To handle that I think we should just have two different record formats.
> For now we'll record posix1e ACLs, and later we can add a new record
> type for NFSv4 ACLs.
That makes sense to me. I don't know anything about NFSv4 ACLs
though; are they almost identical? Maybe a single format that can
represent both would make the most sense. Or not.
Have fun,
Avery
> I would rather take full responsibility for the data format than using
> a known-nonportable one, especially a text-based known-nonportable
> one, which will inevitably be much slower than a non-text-based
> format.
Sounds good. I'll look in to it.
> Is there a standardized *binary* representation already? Presumably
> you must be converting something to/from text...
There's acl_type_t, but I don't think it's provided at the python level,
and AFAIK, it's opaque.
> Not sure that's a good idea; we should try to represent exactly what's
> on disk so that we can put it back exactly like that upon restore.
Actually, in these cases, I suspect the ACL might just be redundant
(i.e. what's returned when the fs supports acls but there are no acl
entries -- there's just the mode), but it doesn't really cost much to
keep both, and it'd be easier. And at the moment, I don't know for sure
that there can't be some kind of legitimate mismatch.
> That makes sense to me. I don't know anything about NFSv4 ACLs
> though; are they almost identical? Maybe a single format that can
> represent both would make the most sense. Or not.
I believe that NFSv4 ACLs are effectively a superset physically, but
there may be some slightly different semantics. So for now, I think I'd
rather just spend a tag value to preserve the distinction in the stored
data.
Ah, that does sound vaguely familiar. There might always be exactly
one ACL in the text representation that just represents the Unix file
mode. In that case, there's no need to store it in the ACL entries,
since changing the file mode would change that entry anyway, so
storing it twice (and thus risking getting the two out of sync) would
add to the confusion.
>> That makes sense to me. I don't know anything about NFSv4 ACLs
>> though; are they almost identical? Maybe a single format that can
>> represent both would make the most sense. Or not.
>
> I believe that NFSv4 ACLs are effectively a superset physically, but
> there may be some slightly different semantics. So for now, I think I'd
> rather just spend a tag value to preserve the distinction in the stored
> data.
Ok.
Avery