Incremental updates and better deduplication

179 views
Skip to first unread message

James Pharaoh

unread,
Jan 17, 2016, 11:34:27 AM1/17/16
to ZBackup general discussion
Hi,

I have a couple of proposals for improvements for zbackup, and the resources to make it happen myself if necessary.

Some brief background: I discovered this tool the other day, while doing some brainstorming on what a good solution for my backups would be, and when I checked if it already existed it popped straight up. It works great, and will serve most of my puproses for a few years easily, but I think there are some optimizations which would be essential for it to meet some of my more complex requirements.

At the moment I'm doing a tar of the entire filesystem I want to back up, and piping this directly to zbackup. This works really well, and the deduplication is far, far better than I ever hoped for. It works extremely well for backups of the same filesystem over time, but I also have multiple versions of the same system, for example live/staging/test/dev versions of a project, and each one of these adds a considerable overhead to the backup storage requirements.

It seems to me that this is probably because tar stores file metadata in next to the file data. If I tar the same filesystem once and then again 24 hours later then, for the most part, the metadata will be the same, so large chunks can be deduplicated spanning many files and their contents. The different environments, however, are likely to have different timestamps, and so the only opportunity for deduplication would be inside of files and therefore, of course, only in large files.

It seems that a fairly simple modification to the tar format would alleviate this, by storing all the file metadata in one chunk and the file data in another. In fact, I think I could write a filter, designed to be used in a pipeline, to do this very easily. So my first question is, does this make sense, and is there existing work on these lines? I will probably attempt to implement this, but would appreciate some feedback first.

The other issue is that I have to create a tar of the entire filesystem every time I do a backup, which will be daily in my case. Most of the data I back up is fairly small, say 2 to 200 gigs, and is stored on very powerful servers with lots of memory and fast SSD storage. I also manage large RAID storage servers, with many terabytes of data, on relatively cheap spinning disks, and it is not feasible to create a tar of all of this data with an acceptable frequency.

I am almost exclusively using BTRFS as a filsystem, which allows me to take consistent snapshots to feed into zbackup, which is extremely useful. This would also allow me to efficiently generate an incremental update, for example using the "btrfs send" function, or accessing the BTRFS tree data in the same way that this command does. I could pipe these incremental updates to zbackup, but it seems much less clean than having an intact and independent tar for every backup, in my opinion one of the main strengths of zbackup.

It doesn't seem like it would be too difficult to use BTRFS's ability to generate an incremental update efficiently to create some kind of patch which would convert one tar into another. I am fairly sure I could code this, but the interface to zbackup presently only works with a full copy of the data.

My proposal would be to add an alternative API to zbackup, which would allow me to specify an existing backup, and provide a diff to it. This would presumably consist of a series of blocks of two types: a reference to a byte range in the original backup, and a block of data which forms part of the differential update. It would probably make sense to allow references to several backups, and include the name of the reference in the block, so that a more sophisticated algorithm could be supported in the future.

So, my second question is: does this make sense, has anyone else considered anything like this, is this something the zbackup developer(s) and/or community would be interested in?

I'm prepared to do a lot of this work myself, either directly or paying for it to be done, so really this is just a request for comments and feedback.

James

Konstantin Isakov

unread,
Jan 22, 2016, 10:14:46 PM1/22/16
to ZBackup general discussion
Hi James,

Yes, what you wrote makes perfect sense. My story is similar to yours, other than that I did not find the right tool at the time and had to write one :) The idea was to implement only what I needed, and let others do the same afterwards. Right now there's only one person besides me (Vladimir) who has shown interest and had the time and skills to keep the development. You're certainly welcome to join in with the features you want!

I'm going to address your two main points here, which are:

1. Use of tar and its deficiencies

Well, yes, tar sucks. Its pros is simply that it's a tool which is familiar to all of us, it works and it's of production quality. However, its format is not very dedup-friendly - 1) it stores the full path name and other metadata right before the data, which effectively kills dedup for small files when the only difference between the older and the newer trees are the name of one of the upper-level path components (e.g., name1/foo/bar and name2/foo/bar), and 2) it doesn't have any index, as file entries are simply concatenated, so you can't get a particular file or a sub-tree you want without unpacking everything before it (well, you could skip over file data, which would help when it's large, but that's hardly ideal). The lack of index doesn't matter much right now, as zbackup doesn't allow seeking at the moment, but that's simply a lack of implementation, and once the seeking is implemented (which the repo architecture perfectly permits), a lack of index in tar would become a major downer.
The use of a filtering tool sounds like a nice approach, since it still allows one to use tar with all of its options. If all metadata is stored together, this would help with both deduplication and indexing. I have thought about that before, but never sat down to actually implement this, since I hardly have time to do things nowadays unless its really crucial for my production. I would certainly welcome you to write such a tool - in fact, I would switch to it right away.

2. Referring to data in previous backups

Makes sense! I see two approaches there. The first one is the one you've just described - add a new instruction (see BackupInstruction in zbackup.proto) to refer to a range of bytes in a previous backup. While those are simple to emit, restoring such a backup would not be very trivial - in fact, it would require implementing the seeking I have previously mentioned. It would also break the current model where any particular backup file, being simply a leaf, can be deleted, and that would not affect the integrity of other backups - it would just create some garbage (unreferenced chunks), which could be GCed later. While seeking is actually a desirable feature on its own, the ability to delete any backup file is a very nice and fundamental property to hang on to. So there's one other option. You have probably noticed that the stream of backup instructions, which is produced during the backup process, is itself deduplicated, and the result is again deduplicated, up to the point where it doesn't result in making the resulting data smaller. The use of higher-level deduplication results allows to easily refer to huge ranges of bytes with small number of instructions. So what could be done is that the references to ranges of bytes of previous backups could be handled during the creation of a new backup by opening the respective backups, seeking within them, and extracting and copying the instructions used there directly to the new backup. The simplest approach would be to unpack the previous backup data down to the instructions for the backup itself, skip over them until the required offset is reached, emit the range of the following instructions as is to the new backup, and then simply allow the dedup logic to deduplicate those instructions afterwards, just the way it's already happening now. A more sophisticated approach could try using the higher-level data (the one deduping the dedup result), but that would only make the backup process faster, so it would make sense to try implementing the simplest approach first - it would probably work very nicely already. This would allow keeping the repo format completely unchanged, allow removing any previous backup files the way it is right now, and so on, so I believe that would probably be a very nice solution.

So if you have time, feel free to try those things, submit pull requests and so on - you are most certainly welcome! I believe those features are of a very significant and universal value, and I for one could definitely use those myself :)

Konstantin

James Pharaoh

unread,
Jan 30, 2016, 7:32:53 AM1/30/16
to zba...@googlegroups.com
Hi,

Thanks for your reply, I've actually gone ahead and implemented a filter
to reorder a tar file, and some tests show that it does make a
significant difference in some cases.

Here's the code, in the "tar-filter" directory:

https://github.com/wellbehavedsoftware

I found that it made a very small difference when storing copies of the
same filesystem over time, but that separate versions of the same
filesystem experience significant savings, 10% with two versions and 22%
with four.

Specifically, I have lots of lightweight containers, with full operating
system images, website source files, databases, etc. These are backed up
daily, and there are generally several versions of each one, for live,
staging, test, dev, etc.

I've designed this tool so that the tar file is recovered exactly as it
went in, and the extraction is not tar-specific, so filters could be
written for other file types in the future.

It's also designed so that an incremental backup could be created by
appending content onto an existing one. Obviously zbackup would happily
deduplicate the entire first file. The extraction would be identical for
the incremental file, since the list of blocks to copy out of the packed
file are stored at the end, so a new list can be appended to the old
file, along with any new content blocks.

The biggest issue is that, at the moment, the tool must have access to
the uncompressed pack file on disk in order to unpack it. This is not
necessarily a huge issue, since restoring a backup would presumably not
be a common task, but I would like to be able to compare them frequently.

I will probably look at some kind of random access, like you've
described, to work around this. A fuse version would of course make this
work out-of-the-box. However, I'll probably have a go at reading the
repository directly in rust.

For reference, here's the repository sizes, in megabytes, comparing
regular tar and my packed version. This test was performed for one of my
small production websites.

The series shows the total size after adding each image to the
respository. The numbers at the start are the dates, in January 2015, of
the images which were stored. The next two numbers are the sizes of the
repository, in bytes, for regular tar and for packed tar. The percentage
at the end is the reduction in size.

live only:

30 only: 309 308 <1%
29 - 30: 315 314 <1%
28 - 30: 326 324 <1%
27 - 30: 348 346 <1%

live and staging:

30 only: 465 416 10%
29 - 30: 466 417 10%
28 - 30: 471 421 10%
27 - 30: 480 427 11%

live, staging and test:

30 only: 596 489 17%
29 - 30: 598 490 18%
28 - 30: 603 495 17%
27 - 30: 611 501 18%

live, staging, test and dev:

30 only: 727 563 22%
29 - 30: 728 564 22%
28 - 30: 731 566 22%
27 - 30: 736 568 22%

James
> --
> You received this message because you are subscribed to a topic in the
> Google Groups "ZBackup general discussion" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/zbackup/s3p4VpSCNmc/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> zbackup+u...@googlegroups.com
> <mailto:zbackup+u...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages