Issue 140 in zumastor: Allow using more compact btree nodes for

codesite...@google.com

unread,

Jun 2, 2008, 6:24:24 PM6/2/08

to zuma...@googlegroups.com

Issue 140: Allow using more compact btree nodes for < 16TB snapstores
http://code.google.com/p/zumastor/issues/detail?id=140

Comment #1 by vandebo:
I would appreciate feedback on this approach to resolving issue 140.

Instead of using an additional format that uses 32bit values (which could
only be used on small volumes), I suggest an alternate format that will
support both 32bit and 64bit fields in a space efficient manner.

B-tree internal nodes must encode an array of pairs (keys and disk
addresses of other nodes). Either may need to be >32bits for sufficiently
large backing devices or origin volumes. If we steal 2 bits from each
32bit word, we can encode a pair in 8 to 16 bytes, depending on the values
encoded. For each 32bit word, call the MSB bit the start of record bit,
and the next MSB bit the long field bit. Using these 2 bits, we can encode
any pair of 30bit and 60bit values as such:

0 32 64 96
u32 u32 10... 00...
u32 u64 10... 01... 00...
u64 u32 11... 00... 00...
u64 u64 11... 00... 01... 00...

The values stored in these fields are sector and chunk offsets, but as far
as I can tell, ddsnapd has to represent these values as byte offsets at
some point. So at most 55bits are useful. This encoding provides 60bits
of storage, so does not impose additional limits.

Of course there are a couple drawbacks to this approach, but I believe they
are outweighed by the space savings. It is an on-disk format after all.
Specifically, values can no longer simply copied from disk and used, they
must be decoded. Also, values are not necessarily aligned to their size,
but that will be corrected in the decode phase. Note however, that this
format still allows for a binary search over the data - instead of dividing
the records in half, divide the memory in half and search for the record
(word with MSB set) closest to the center.

With this in mind I propose the following structure for b-tree nodes:

struct enode {
u64 first_sector;
u32 free_bytes;
u32 child_entries[];
};

Where child_entries are encoded the above scheme.

The array of struct exception's in the leaf chunk can also be updated to
use the encoding.

--
Steve

--
You received this message because you are listed in the owner
or CC fields of this issue, or because you starred this issue.
You may adjust your issue notification preferences at:
http://code.google.com/hosting/settings

Jiaying Zhang

unread,

Jun 2, 2008, 7:11:22 PM6/2/08

to codesite...@google.com, zuma...@googlegroups.com

Sounds a solid approach to me.

With this in mind I propose the following structure for b-tree nodes:

struct enode {
    u64 first_sector;
    u32 free_bytes;
    u32 child_entries[];
};

Where child_entries are encoded the above scheme.

Most btree space is taken by leaf nodes, see eleaf structure. Frank
once created a graph showing btree node and leaf structures. That may
be good reference. I attached it in this email.

Jiaying

BTree.pdf

Steve VanDeBogart

unread,

Jun 3, 2008, 4:36:53 PM6/3/08

to zuma...@googlegroups.com

>> <snip>

>
> Most btree space is taken by leaf nodes, see eleaf structure. Frank
> once created a graph showing btree node and leaf structures. That may
> be good reference. I attached it in this email.

This is a good point. Are there any numbers about the average number of
leaf vs internal nodes? I would guess that it's in the neighborhood of
15:1?

Trying to apply the same technique to the leaf structure would yield
a reduction of at most a 1/4. However, we could increase the number of
supported snapshots to 128 and reduce the storage of leaf nodes by
1/2 to 3/4 by using this encoding + removing the share field.

Since 1) the chunk field is a chunk offset, 2) chunk offsets must be
converted to byte offsets within a 64bit variable, and 3) the minimum
chunk size is 512 bytes, we have 9 bits to play with in the chunk offset.
If we use one bit per word for a 32bit/64bit selector, we can use the
remaining 7 bits to store a relative snapshot number.

Alternatively, changing the share field to a 32bit relative snapshot
field would reduce the storage needs by 1/4 to 1/2 and limit the
number of snapshots to 1 billion.

RFC
--
Steve

Daniel Phillips

unread,

Jun 3, 2008, 7:13:18 PM6/3/08

to Steve VanDeBogart, zuma...@googlegroups.com

Steve VanDeBogart wrote:
> On Mon, 2 Jun 2008, Jiaying Zhang wrote:
>> Most btree space is taken by leaf nodes, see eleaf structure. Frank
>> once created a graph showing btree node and leaf structures. That may
>> be good reference. I attached it in this email.
>
> This is a good point. Are there any numbers about the average number of
> leaf vs internal nodes? I would guess that it's in the neighborhood of
> 15:1?

Closer to 128:1 for 4K metadata blocks. Each leaf entry is 16 bytes for
the physical chunk address plus sharing bitmap and 8 bytes for internal
leaf index roughly speaking (depends on how many physical exceptions per
logical address) so say 24 bytes/entry. Average leaf fullness is 75%
(due to splitting) so 4096/24 * .75 ~= 128.

So as Jiaying points out, we don't save much snapshot store space by
making index nodes smaller. The btree does get flatter. A rough
calculation says cutting the size of index node in half increases the
size of snapshot store addressable by a two level btree increases from
19 GB to 77 GB (factor of two per btree level). This difference could
be noticable and is maybe enough of an argument for some effort into
including index nodes in the "small format btree" option.

Daniel

Steve VanDeBogart

unread,

Jun 3, 2008, 7:59:19 PM6/3/08

to zuma...@googlegroups.com

On Tue, 3 Jun 2008, Steve VanDeBogart wrote:

> <snip>

>
> Trying to apply the same technique to the leaf structure would yield
> a reduction of at most a 1/4. However, we could increase the number of
> supported snapshots to 128 and reduce the storage of leaf nodes by
> 1/2 to 3/4 by using this encoding + removing the share field.
>
> Since 1) the chunk field is a chunk offset, 2) chunk offsets must be
> converted to byte offsets within a 64bit variable, and 3) the minimum
> chunk size is 512 bytes, we have 9 bits to play with in the chunk offset.
> If we use one bit per word for a 32bit/64bit selector, we can use the
> remaining 7 bits to store a relative snapshot number.
>
> Alternatively, changing the share field to a 32bit relative snapshot
> field would reduce the storage needs by 1/4 to 1/2 and limit the
> number of snapshots to 1 billion.

So I realized that it's not obvious how relative snapshotting would
work. Below is brief description. This method assumes that you only
snapshot the origin.

Whenever a snapshot is created, a new variable, "the current snapshot
number" is incremented. The super block would store the current
snapshot number, and a list of snapshots and the snapshot number when
they were created.

- When there is a write to the origin volume, we add an exception that
points to the copy of the origin chunk and the current snapshot
number.

- When reading a snapshot block, the exception list is searched for the
newest exception that has a snapshot number less than the snapshot
number of the snapshot. This shares exception blocks with all
snapshots created since the last exception block by default.

- When writing a snapshot block, a new snapshot number is not created,
but we may need to add two entries to the exception list. One, that
indicated this exception (and forward) should use a new copy of the
block, and another that indicates that snapshots after this one
should use the original block.

You can determine which snapshots are using a particular exception entry
by looking at all the exceptions in an exception list for a particular
block together with the set of snapshots.

When the snapshot number is going to wrap, we may have to walk the btree
and renumber (compact) the existing snapshots to remove gaps between
them. However, in the common case, people often delete older snapshots
first, so there may always be a gap between the current snapshot and the
oldest snapshot.

--
Steve

Daniel Phillips

unread,

Jun 4, 2008, 12:09:28 PM6/4/08

to zuma...@googlegroups.com, Steve VanDeBogart

Somewhere in there is an intriguing and workable idea I think. Handling
that wrapping corner case sounds like a significant amount of subtle,
rarely used code, the worst kind to debug.

I used a similar idea in a prototype version control system. Each data
element (which could be a directory entry or a region of file data) has
a birth and a death version number. Arbitrary branching of versions is
permitted. Any given data element is "live" for a given version if the
version lies anywere on the ancestry path from the version in which
the data element died to the version in which it was born. I _think_
this strategy maps onto your proposal above, and that for any chunk of
a snapshot that has not been written since the snapshot was created,
the "death" version does not need to be recorded, and this is by far the
most common case.

So I think you are on the track of something elegant. I doubt that it
is really necessary ever to "compress" the snapshot sequence numbers,
instead we examine the ancestry graph, which in the case of ddsnap is a
linear list, not a tree, a simpler situation than my version control
scheme. At the same time, it is nice to know that the idea does not
preclude snapshots of snapshots.

We should keep discussing this, work out the algorithms, analyze it and
search for holes in the logic, but it is a rather bigger change than I
originally had in mind for a "small format" snapshot store option. I
don't think we should attempt it for the next couple of releases at
least.

Daniel

Steve VanDeBogart

unread,

Jun 5, 2008, 9:06:07 PM6/5/08

to zuma...@googlegroups.com

More comments and RFC below...

I've pushed a little harder on this idea and have a more concrete
description for feedback.

First, storing the serial number in the leaves and then having to deal
with wrapping and compaction issues is a bummer. If we introduce a
level of indirection, we can make those problems become much easier to
deal with: we can store a snapshot id in the leaf, and then store a
table in the superblock ordering the snapshot ids. This means that
wrapping and compaction can be done solely in the superblock. It turns
out that this information is already present in the superblock,
snaplist[].tag is a unique id for each snapshot and snaplist[].ctime
gives us ordering.

With this in place, how do we implement the functions:
chunk_t map(snapshot, offset)
make_unique(snapshot, offset)
delete(snapshot, offset)

map:
- The origin always maps to the origin volume, otherwise:
- find a set of exceptions for the address, as is currently done
(each exception will be a pair: snapshot tag, chunk offset)
- Using the data in the super block, order the exception list by time
of snapshot creation
- Search the list for the newest exception that is older than the
passed snapshot number
- The corresponding chunk offset is the answer unless there is no
such exception in which case the origin chunk is used.

make_unique:
- If the branch of interest is the origin then we need to make a new
exception with the tag of the newest snapshot unless there are no
snapshots or an exception already exists with the newest snapshot
tag.
- There are two conditions for a snapshot to be unique.
1) There must be an exception with the same snapshot tag as the
snapshot of interest. If it doesn't exist, then we create it
with the chunk of the newest exception that's older than the
snapshot of interest.
2) There must be an exception with a snapshot tag of the immediately
older snapshot (this means we need a fake "first" snapshot tag).
(Note, if an exception is created to meet condition one, it may
share the chunk with another exception and therefore, we must
consider all exceptions in an exception list when freeing a chunk)

delete:
- If the snapshot is not unique for the offset, there is nothing to do
- If the snapshot is unique, then we can delete the newest exception
that is older than the passed snapshot tag.
- If the snapshot is the oldest snapshot in the superblock, then we can
update the tag of the oldest remaining exception to be the "first"
snapshot tag. (This allows the tag of the most recently deleted
snapshot to be reused.)

--
Steve

Steve VanDeBogart

unread,

Jun 7, 2008, 12:55:29 AM6/7/08

to zuma...@googlegroups.com

Even more...

On Thu, 5 Jun 2008, Steve VanDeBogart wrote:
> <snip>
>

Actually, this is broken... Hopefully I get it right this time.

Basically, we are trying to generating a version graph that is encoded
similarly to run length encoding. Since we disallow snapshots of
snapshots, the graph can be easily represented in a list.

So, when we are trying to find a chunk, we need to search within the
run length encoding to find the correct neighbor. My previous
description walked the wrong direction along the version graph and
failed to fully consider chunk sharing. Another try:

map_offset:
- The methodology of choice is exceptions from the origin, so the origin
"snapshot" always maps to the origin volume. Otherwise

- find a set of exceptions for the address, as is currently done
(each exception will be a pair: snapshot tag, chunk offset)

- Using the data in the super block, we transform the set of exception
into an RLE encoded version graph for the offset. Because we limit
branching, it condenses to a list of exceptions, where chunks are
shared among all snapshots between exceptions. For example an
exception at snapshot n for chunk c plus an exception at snapshot m
for chunk k means that snapshots n+1 to m share chunk k.
- So to find the chunk for the passed snapshot, we search the exception
list for the first exception tag that is greater than the passed
snapshot tag.
- If there is no exception, then the snapshot is sharing with the
origin.

Uniqueness -- The basic idea to determine if the version graph is unique
for an offset is to determine if the RLE encoding of exceptions is doing
any compression for the given version node.
- The origin is unique if the last snapshot has an exception.
- A snapshot is unique if both
1) It has an exception and
2) The previous snapshot as an exception.

make_unique_offset:
- In both cases (origin & snapshot) there may be multiple exception with
the same chunk, so we must scan all exceptions to ensure that the
chunk is unique. If it is not, then we must copy the chunk.
Additionally, we have to update the version graph structure as below.

- If the branch of interest is the origin then we need to make a new
exception with the tag of the newest snapshot unless there are no
snapshots or an exception already exists with the newest snapshot tag.

- There are two conditions to make a snapshot chunk unique.

1) There must be an exception with the same snapshot tag as the
snapshot of interest. If it doesn't exist, then we create it with

a copy of the chunk in the next exception, or of the origin if
there is no next exception.

2) There must be an exception with a snapshot tag of the immediately

older snapshot unless this is the first snapshot. If we need to
create the exception we use the chunk that map for this snapshot
would return.
- If an exception for the newest tag already exists, we have to search
all of the exception chunks to see if the origin chunk is shared.

delete:
- If the version graph is not unique for the snapshot (i.e. there is not
an exception for this snapshot or there is not an exception for the
previous snapshot), then we done. Otherwise,
- We can delete the exception for this snapshot.
- We should then consider the exception for the previous snapshot and
the next exception. If they map the same chunk, we can delete the
exception for the previous snapshot.
- We have to examine all of the exceptions to see if the chunk was
unique. If so, we can free it.

Note: The RLE version graph starts out maximally shared in terms of
chunks and we only copy a chunk when it will change, so we never
unnecessarily copy a chunk.

--
Steve

Daniel Phillips

unread,

Jun 9, 2008, 9:45:49 PM6/9/08

to Steve VanDeBogart, zuma...@googlegroups.com

Steve VanDeBogart wrote:
> Basically, we are trying to generating a version graph that is encoded
> similarly to run length encoding. Since we disallow snapshots of
> snapshots, the graph can be easily represented in a list.

To clarify, what I meant by "version graph" is the data structure that
allows us to determine membership a physical entries in a snapshot. I
think you are now applying the term "version graph" to the list of
physical entries, not what I intended.

When we restrict ourselves to the the case where we have only a single
linear chain of snapshots, the graph is a simple list and is in fact
just our snapshot list with the bit offset no longer used (instead we
use the snapshot tag directly).

So I will read "physical entry list" where you have written "version
graph" below, and we will introduce a more elaborate concept of version
graph when we consider how to generalize this strategy to handle snapshots
of snapshots.

> So, when we are trying to find a chunk, we need to search within the
> run length encoding to find the correct neighbor. My previous
> description walked the wrong direction along the version graph and
> failed to fully consider chunk sharing. Another try:
>
> map_offset:
> - The methodology of choice is exceptions from the origin, so the origin
> "snapshot" always maps to the origin volume. Otherwise
> - find a set of exceptions for the address, as is currently done
> (each exception will be a pair: snapshot tag, chunk offset)

I think you just want to observe in your preamble that the algorithm acts
strictly locally to the physical entry list. Finding the physical entry
list is not part of your algorithm.

> - Using the data in the super block,

...what I call the version graph...

> we transform the set of exception
> into an RLE encoded version graph for the offset.

Actually, it's just a "run" encoding, not a "run length" encoding because
we don't encode the length explicitly. How about we call the range of
snapshots a particular exception covers a "span"? Then an exception list
contains some number of spans, where each span gives the snapshot of the
last snapshot in the span to which the physical block of the span belongs.

Because we limit
> branching, it condenses to a list of exceptions, where chunks are
> shared among all snapshots between exceptions. For example an
> exception at snapshot n for chunk c plus an exception at snapshot m
> for chunk k means that snapshots n+1 to m share chunk k.

...like that.:

> - So to find the chunk for the passed snapshot, we search the exception
> list for the first exception tag that is greater than the passed
> snapshot tag.

I think you want to say "follows" here instead of "greater" to capture the
idea that the ordering is given by mapping through the snapshot graph, not
the literal snapshot tag.

> - If there is no exception, then the snapshot is sharing with the
> origin.

...and is only ever the case when the logical chunk has never been written.

> Uniqueness -- The basic idea to determine if the version graph is unique
> for an offset is to determine if the RLE encoding of exceptions is doing
> any compression for the given version node.
> - The origin is unique if the last snapshot has an exception.
> - A snapshot is unique if both
> 1) It has an exception and
> 2) The previous snapshot as an exception.

Terminology skew aside, that seems correct and concise.

> make_unique_offset:
> - In both cases (origin & snapshot) there may be multiple exception with
> the same chunk, so we must scan all exceptions to ensure that the
> chunk is unique. If it is not, then we must copy the chunk.

Which would require we store the same physical block number more than once
in the same exception list, making the data structure larger for writes to
snapshots (a common case for some users) and also requires the extra step of
searching for other physical pointers having the same value as the one
just found. I think we can improve this.

> Additionally, we have to update the version graph structure as below.
> - If the branch of interest is the origin then we need to make a new
> exception with the tag of the newest snapshot unless there are no
> snapshots or an exception already exists with the newest snapshot tag.

Yes, we only need to consider the last snapshot in the list to determine
origin uniqueness. That makes it kind of tempting to store the exception
list in reverse order, doesn't it? Just a thought.

> - There are two conditions to make a snapshot chunk unique.
> 1) There must be an exception with the same snapshot tag as the
> snapshot of interest. If it doesn't exist, then we create it with
> a copy of the chunk in the next exception, or of the origin if
> there is no next exception.
> 2) There must be an exception with a snapshot tag of the immediately
> older snapshot unless this is the first snapshot. If we need to
> create the exception we use the chunk that map for this snapshot
> would return.

Yes, except that it relies on the multiple representation of the same
physical block address.

> - If an exception for the newest tag already exists, we have to search
> all of the exception chunks to see if the origin chunk is shared.

Why?

> delete:
> - If the version graph is not unique for the snapshot (i.e. there is not
> an exception for this snapshot or there is not an exception for the
> previous snapshot), then we done.

Right.

> Otherwise,
> - We can delete the exception for this snapshot.
> - We should then consider the exception for the previous snapshot and
> the next exception. If they map the same chunk, we can delete the
> exception for the previous snapshot.

A step we may be able to omit if we find a nice representation that does
not require "splitting" a span when a snapshot it written (that is, does
not encode the same physical address more than once).

> - We have to examine all of the exceptions to see if the chunk was
> unique. If so, we can free it.
>
> Note: The RLE version graph starts out maximally shared in terms of
> chunks and we only copy a chunk when it will change, so we never
> unnecessarily copy a chunk.

So this iteration of the proposal produced a proposal that could actually
work, but we can do a little better. We should put our thinking caps on
and see what we come up with. For example, we could reserve a bit in each
physical pointer to mean that the physical pointer only applies to the
snapshot mentioned, not an entire span.

We also need to consider the generalization to representing snapshots of
snapshots, though I would not let that get in the way of finishing up
the analysis work for the simpler case.

Regards,

Daniel

Daniel Phillips

unread,

Jun 12, 2008, 5:14:29 PM6/12/08

to Steve VanDeBogart, zuma...@googlegroups.com

Steve VanDeBogart wrote:
>> First, storing the serial number in the leaves and then having to deal
>> with wrapping and compaction issues is a bummer. If we introduce a
>> level of indirection, we can make those problems become much easier to
>> deal with: we can store a snapshot id in the leaf, and then store a
>> table in the superblock ordering the snapshot ids. This means that
>> wrapping and compaction can be done solely in the superblock. It turns
>> out that this information is already present in the superblock,
>> snaplist[].tag is a unique id for each snapshot and snaplist[].ctime
>> gives us ordering.

We don't need 32 bits for the internal snapshot id. Sixteen bits would
give a limit of 65536 snapshots, with the remaining 48 bits of a 64 bit
field limiting the snapshot store size to an even exabyte for 4K blocks,
while 32 bits for the intenal id would leave us with a 16 TB limit, not
bit enough. So we want to map the snapshot tag to an internal id.

It is attractive to end up with a fixed-size field for the exception
span because that means we can leave the leaf editing code almost
exactly as it is. This is hard-to-write, hard-to-debug code that has
stood the test of time, so I would rather leave it alone given the
choice. Ending up with a single 64 bit field per snapshot store data
block would be a 50% compression of vs the two 64 bit fields we have
now, and that is good enough for me for now. I don't think it is worth
the effort or interface impact to go further and define "small" and
"large" size snapshot store types. I would like to keep the current
effort as simple as possible, mainly due to allergy to debugging subtle
corner cases. The fewer variables in a data design, the fewer the
corner cases.

I am deeply averse to relying on timestamps for anything important, and
it is not necessary. We represent the snapshot ordering by the position
in our snapshot graph. If we do not support snapshots of snapshots then
our graph is simply the snapshot list from the superblock, which is
already ordered by creation time.

At this point I am pretty sure we can define a leaf format that supports
snapshots of snapshots even if we do not initially support that.

To turn our superblock snapshot list into a snapshot tree we just need
to add a parent snapshot id field. Thus, we are in a position to make
our snapshot store format entirely "recursion ready", while having the
luxury of being able to concentrate on the simpler algorithms for the
nonrecursive case for the time being.

Regards,

Daniel

Steve VanDeBogart

unread,

Jun 13, 2008, 3:32:43 PM6/13/08

to Daniel Phillips, zuma...@googlegroups.com

On Mon, 9 Jun 2008, Daniel Phillips wrote:
> Steve VanDeBogart wrote:
>> - If there is no exception, then the snapshot is sharing with the
>> origin.
>
> ...and is only ever the case when the logical chunk has never been written.

Almost. This is also the case when the logical chunk has not been written
since the last time there were no snapshots. i.e. writes before snapshots
exist, or when all snapshots have been deleted.

>> - If the branch of interest is the origin then we need to make a new
>> exception with the tag of the newest snapshot unless there are no
>> snapshots or an exception already exists with the newest snapshot tag.
>
> Yes, we only need to consider the last snapshot in the list to determine
> origin uniqueness. That makes it kind of tempting to store the exception
> list in reverse order, doesn't it? Just a thought.

No, as proposed, we have to consider all exceptions because one of them
may point to the origin. Consider the case when three snapshots are
created and they all point to the origin. Then someone writes to the
last snapshot. There will be an exception for the last snapshot, but
the origin is not unique.

This is a complication due to repeating physical chunks in the exception
list. As you have suggested, maybe we can get away from doing this.
Even if we can, the table of exception lists at the beginning of the
leaf makes it easy to access the last exception in the exception list,
so I don't see any benefit (or harm) to storing them in reverse order.

>> - If an exception for the newest tag already exists, we have to search
>> all of the exception chunks to see if the origin chunk is shared.
>
> Why?

This is a similar case to the one I described above. Consider three
snapshots with a write to the origin. Now the three snapshots share the
exception. If the second to last snapshot is written, then the last
snapshot will be unique in terms of the exception list, but it shares a
chunk with the first snapshot.

> So this iteration of the proposal produced a proposal that could actually
> work, but we can do a little better. We should put our thinking caps on
> and see what we come up with. For example, we could reserve a bit in each
> physical pointer to mean that the physical pointer only applies to the
> snapshot mentioned, not an entire span.

The things that I've pointed out as different than you expected have to
do with physical chunks shared among exceptions. I'm not sure if that's
because it's complicated, or because you prefer a system where physical
chunks aren't shared among exceptions.

Indicating written snapshots in a different way as you've suggested
seems like it removes the shared chunk complication, but introduces the
separate complication of written vs non-written snapshots. I have not
yet convinced myself that tagging written snapshots always works out,
but it seems that it probably is.

--
Steve

Steve VanDeBogart

unread,

Jun 13, 2008, 3:45:37 PM6/13/08

to Daniel Phillips, zuma...@googlegroups.com

On Thu, 12 Jun 2008, Daniel Phillips wrote:

> Steve VanDeBogart wrote:
>
> choice. Ending up with a single 64 bit field per snapshot store data
> block would be a 50% compression of vs the two 64 bit fields we have
> now, and that is good enough for me for now. I don't think it is worth
> the effort or interface impact to go further and define "small" and
> "large" size snapshot store types. I would like to keep the current
> effort as simple as possible, mainly due to allergy to debugging subtle
> corner cases. The fewer variables in a data design, the fewer the
> corner cases.

Supporting two exceptions types, "small" and "large" could yield another
factor of two in the common case. But changing that part of the format
is a logically different change from converting to version trees instead
of share masks. For that reason, my effort to implement version trees
as left the physical exception format alone. Once the kinks in the
version tree change have been worked out, a small incremental change
to change the physical exception layout should be easy to implement
and much easier to debug vs changing it all at once.

> I am deeply averse to relying on timestamps for anything important, and
> it is not necessary. We represent the snapshot ordering by the position
> in our snapshot graph. If we do not support snapshots of snapshots then
> our graph is simply the snapshot list from the superblock, which is
> already ordered by creation time.

Agreed. What I wrote before with regard to snapshot versions was simply a
description of the concept and not necessarily meant to be implemented
directly. My implementation effort is using the ordering of the
snapshot list and has, for the moment, repurposed the u8 bit field to
hold the snapshot id.

--
Steve

Steve VanDeBogart

unread,

Jun 17, 2008, 8:33:49 PM6/17/08

to Daniel Phillips, zuma...@googlegroups.com

On Mon, 9 Jun 2008, Daniel Phillips wrote:
> So this iteration of the proposal produced a proposal that could actually
> work, but we can do a little better. We should put our thinking caps on
> and see what we come up with. For example, we could reserve a bit in each
> physical pointer to mean that the physical pointer only applies to the
> snapshot mentioned, not an entire span.
>
> We also need to consider the generalization to representing snapshots of
> snapshots, though I would not let that get in the way of finishing up
> the analysis work for the simpler case.

When trying to make an exception unique (i.e. for writing a snapshot),
the current proposal suggests splitting the exception span that the
snapshot currently resides in and creating a sandwich with the existing
chunk surrounding the newly allocated chunk. For example if snapshot 8
is between the exception for 1 and the exception for 10, then we
introduce two new exceptions, one at 7 with the same chunk as exception
10, and one at 8 with a new chunk.

This has some inelegances and the potential to consume more space than
necessary. An alternate proposal is to mark exceptions as having
an explicit span length of one. If the exception for 8 was marked in
this way, then the exception would only be considered when dealing with
snapshot 8. This always costs a bit in each exception. Another way
of thinking about this is that it cuts the id space for exceptions in
half where one half are multi-snapshot spans, and the other half are
single-snapshot spans. Instead, we can just store two ids in the
superblock for each snapshot. One for exceptions with a span of one,
and one for exceptions with a larger span. In terms of the version
graph these two ids are equal. In terms of uniqueness, the previous
rules apply for multi-exception spans, whereas single-exception spans
are always unique. This seems to work in general.

The other issue to consider is snapshots of snapshots. A simple
approach (that has higher space overhead than necessary) is to simply
apply the current algorithm recursively on snapshots. Doing this may
have extra overhead in some cases because the "head" of multiple
branches may be shared and writes to one of those branches requires
more book-keeping to keep it straight. (This is not so for the
origin.) The simplest example I can come up with that demonstrates
this is: snapshot 1 of the origin, snapshot 2 of snapshot 1,
snapshot 3 of snapshot 1, snapshot 4 of snapshot 2. Writing snapshot 2
requires adding multiple exceptions with a shared chunk.

Effectively snapshots of snapshots are "redirect on write" and
snapshots of origin are "preserve on write." Consequentially, changing
the direction that exceptions apply for the snapshot of snapshot case
will remove the inefficiency. For example, if snapshots 2-8 are of
snapshot 1, and a chunk has exceptions 3 and 6, snapshots 2 and 3
refer to what ever snapshot 1 would refer to in the origin, snapshots
4-6 refer to the chunk in exception 3, and snapshots 7 and 8 refer to
the chunk in exception 6.

From the other direction, when writing snapshot A, if there has been
a snapshot of A since the last write (last exception), we add an
exception and modify the chunk in the new exception.

This approach seems to work, but I am not completely sure of it.

--
Steve

Steve VanDeBogart

unread,

Jun 17, 2008, 8:47:08 PM6/17/08

to zuma...@googlegroups.com, Daniel Phillips

On Mon, 9 Jun 2008, Daniel Phillips wrote:
> So this iteration of the proposal produced a proposal that could actually
> work, but we can do a little better. We should put our thinking caps on
> and see what we come up with. For example, we could reserve a bit in each
> physical pointer to mean that the physical pointer only applies to the
> snapshot mentioned, not an entire span.

In the spirit of release early and often... attached is a patch that
uses a version graph to support snapshots and writable snapshots
(but not snapshots of snapshots). It does not use the bit per
exception / two ids per snapshot optimization to reduce the number of
exceptions. Also, delete is not correct in this patch, but it will
change with the optimization.

Also, attached is a mini testsuite for snapshots/exceptions. This
patch passes all but the delete tests.

I plan to update the patch to use the two id per snapshot optimization
and then to use a smaller exception format.

--
Steve

almost-vergraph.patch

snaptest.sh

Dan Kegel

unread,

Jun 17, 2008, 8:54:47 PM6/17/08

to Steve VanDeBogart, zuma...@googlegroups.com, Daniel Phillips

On Tue, Jun 17, 2008 at 5:47 PM, Steve VanDeBogart
<vandebo-...@nerdbox.net> wrote:
> Also, attached is a mini testsuite for snapshots/exceptions. This
> patch passes all but the delete tests.

Cool! Can you get that into our automated test suite?

Steve VanDeBogart

unread,

Jul 1, 2008, 3:52:34 PM7/1/08

to zuma...@googlegroups.com

On Tue, 17 Jun 2008, Steve VanDeBogart wrote:

> On Mon, 9 Jun 2008, Daniel Phillips wrote:
> In the spirit of release early and often... attached is a patch that
> uses a version graph to support snapshots and writable snapshots
> (but not snapshots of snapshots). It does not use the bit per
> exception / two ids per snapshot optimization to reduce the number of
> exceptions. Also, delete is not correct in this patch, but it will
> change with the optimization.
>

> I plan to update the patch to use the two id per snapshot optimization
> and then to use a smaller exception format.

Attached is an updated patch which uses two id's per snapshot in order
to support exclusive exceptions - exceptions that only apply to
a single snapshot. It also uses 8 bytes per exception in the leaf.
I also removed the get/set exception abstraction which Daniel Philips
did not like.

I have a patch that uses 4 or 8 bytes per exception, depending on the
magnitude of the chunk it needs to store. However, after I did the
work, I did the math and realized that only chunks within the first 16G
of the snapstore would be able to take advantage of the 4 byte format,
saving at most 16 megs of btree space.

The attached patch passes all of the tests that stock passes.
Unfortunately, it also seems to run three times slower than the tree
head. I will be looking into that a bit now.

--
Steve

versioned-origin.diff

Jiaying Zhang

unread,

Jul 7, 2008, 3:09:53 AM7/7/08

to Steve VanDeBogart, zuma...@googlegroups.com

I had a quick look at your patch. It looks very promising. I am thinking maybe we want to
schedule a formal code review for your work since it touches many major parts of ddsnap code.
(Hopefully, I will be back then).

Unfortunately, it also seems to run three times slower than the tree
head. I will be looking into that a bit now.

What does 'tree head' refers to? Any progress on it?

Jiaying

Steve VanDeBogart

unread,

Jul 7, 2008, 1:45:17 PM7/7/08

to Jiaying Zhang, zuma...@googlegroups.com

I meant svn head. -- It turns out that an independent benchmark script
shows that there is little difference between the two versions. I
suspect that the script in benchmarks/nsnaps has some hidden brokenness.

--
Steve

Daniel Phillips

unread,

Jul 7, 2008, 6:19:50 PM7/7/08

to Steve VanDeBogart, zuma...@googlegroups.com

Hi Steve,

The patch looks pretty good, you have done a nice job of picking up the local
style whether or not it is your favorite :-)

This patch is less heavily abstracted than earlier versions, and is stronger
for that in my opinion. There is the occasional wart remaining. You define
the symbol ORIGIN_SNAPID to let you overload the internal snapid so it can
also specify the origin. But why not write a wrapper function instead, like
is_origin(snapid)? Then you would not heed to be passing that extra status
variable around where you already pass the snapid.

Cutting down the number of parameters in an interface is one of the biggest
things you can do to improve readability and maintainability. Part of this
is taking care to avoid ever passing redundant information.

Regards,

Daniel

Steve VanDeBogart

unread,

Jul 7, 2008, 7:47:38 PM7/7/08

to Daniel Phillips, zuma...@googlegroups.com

On Mon, 7 Jul 2008, Daniel Phillips wrote:
> The patch looks pretty good, you have done a nice job of picking up the local
> style whether or not it is your favorite :-)
>
> This patch is less heavily abstracted than earlier versions, and is stronger
> for that in my opinion. There is the occasional wart remaining. You define
> the symbol ORIGIN_SNAPID to let you overload the internal snapid so it can
> also specify the origin. But why not write a wrapper function instead, like
> is_origin(snapid)? Then you would not heed to be passing that extra status
> variable around where you already pass the snapid.

As far as I see, none of my changes add an extra status variable of
origin vs. non-origin. I view the introduction of ORIGIN_SNAPID as a
cleanup; previously a literal -1 was used to indicate the origin.
is_origin() is an additional further improvement that could help
readability.

> Cutting down the number of parameters in an interface is one of the biggest
> things you can do to improve readability and maintainability. Part of this
> is taking care to avoid ever passing redundant information.

Functions that take a large number of arguments suggests that the
function division may be suboptimal. As far as I can tell, I am not
passing arguments that could be derived from other arguments (redundant
information).

Where I have increased the number of arguments a function takes, it is by
one or two at the most. This is due to a need for additional
information in the lower level functions. I have tried to increase the
readability by using more descriptive (though slightly longer) parameter
names.

I would be interested in the code that triggered your comments, because
I generally do not see it.

--
Steve

Jiaying Zhang

unread,

Jul 7, 2008, 9:23:27 PM7/7/08

to Steve VanDeBogart, zuma...@googlegroups.com

Have you looked at the metadata usage with the nsnaps benchmark?
It is really strange that your prototype runs three times slower with the nsnaps benchmark.
I am not aware of any brokenness of the benchmark. The script is quite simple. You can
have a look at it and see where the problem is from.

Jiaying

Jiaying Zhang

unread,

Jul 8, 2008, 12:55:48 AM7/8/08

to Steve VanDeBogart, zuma...@googlegroups.com

btw, did you turn off debug printing during the benchmark. nsnaps benchmark
outputs server log in /tmp/server.log by default. Does it say anything interesting?
I think with nsnaps benchmark, your prototype should have comparable performance
with the current code. But you can put 'ddsnap status' after run_test (line 191) and
see if metadata usage drops.

Jiaying

Steve VanDeBogart

unread,

Jul 8, 2008, 8:03:31 PM7/8/08

to Jiaying Zhang, zuma...@googlegroups.com

I just confirmed that the nsnaps.sh script is inaccurate. If you simply
insert a date command before the tar starts and another after the umount
finishes, you will see that a large amount of time is unaccounted for.
Nsnaps reported a time of ~42 and ~45s for the 1st and 2nd snapshot, whereas
the date output shows that it was in fact ~62 and ~63s respectively. It
also does not correctly accumulate data from multiple runs.

However, the initial factor of three that I saw may have been due to
operator error, as I am not able to get nsnaps to reproduce it now.

--
Steve

Jiaying Zhang

unread,

Jul 8, 2008, 10:06:32 PM7/8/08

to Steve VanDeBogart, zuma...@googlegroups.com

On 7/8/08, Steve VanDeBogart <vandebo-...@nerdbox.net> wrote:

On Mon, 7 Jul 2008, Jiaying Zhang wrote:

On 7/7/08, Steve VanDeBogart <vandebo-...@nerdbox.net> wrote:

I meant svn head. -- It turns out that an independent benchmark script
shows that there is little difference between the two versions. I
suspect that the script in benchmarks/nsnaps has some hidden brokenness.

Have you looked at the metadata usage with the nsnaps benchmark?
It is really strange that your prototype runs three times slower with the
nsnaps benchmark.
I am not aware of any brokenness of the benchmark. The script is quite
simple. You can
have a look at it and see where the problem is from.

I just confirmed that the nsnaps.sh script is inaccurate. If you simply
insert a date command before the tar starts and another after the umount
finishes, you will see that a large amount of time is unaccounted for.

Do you mean the umount of TAR device or ORIGIN device? You might not take
into account of sync time if you put a date right after umount TAR device.

Nsnaps reported a time of ~42 and ~45s for the 1st and 2nd snapshot, whereas the date output shows that it was in fact ~62 and ~63s respectively. It also does not correctly accumulate data from multiple runs.

I saw some variance before, but not this much. Are you using 2.6.24.2 kernel?
You may want to try 2.6.22.18 kernel or enable bio_throttling and see if the
performance is more consistent. Currently, nsnaps reports performance from
a single run only. You are welcome to extend that to report average from multiple runs.

However, the initial factor of three that I saw may have been due to
operator error, as I am not able to get nsnaps to reproduce it now.

Good to hear that. Did you see the drop of metadata memory usage, i..e.,
the last line of 'ddsnap status' output?

Jiaying

Reply all

Reply to author

Forward

Issue 140 in zumastor: Allow using more compact btree nodes for < 16TB snapstores

codesite...@google.com

Jiaying Zhang

Steve VanDeBogart

Daniel Phillips

Steve VanDeBogart

Daniel Phillips

Steve VanDeBogart

Steve VanDeBogart

Daniel Phillips

Daniel Phillips

Steve VanDeBogart

Steve VanDeBogart

Steve VanDeBogart

Steve VanDeBogart

Dan Kegel

Steve VanDeBogart

Jiaying Zhang

Steve VanDeBogart

Daniel Phillips

Steve VanDeBogart

Jiaying Zhang

Jiaying Zhang

Steve VanDeBogart

Jiaying Zhang