Hi,
Thanks for your reply, I've actually gone ahead and implemented a filter
to reorder a tar file, and some tests show that it does make a
significant difference in some cases.
Here's the code, in the "tar-filter" directory:
https://github.com/wellbehavedsoftware
I found that it made a very small difference when storing copies of the
same filesystem over time, but that separate versions of the same
filesystem experience significant savings, 10% with two versions and 22%
with four.
Specifically, I have lots of lightweight containers, with full operating
system images, website source files, databases, etc. These are backed up
daily, and there are generally several versions of each one, for live,
staging, test, dev, etc.
I've designed this tool so that the tar file is recovered exactly as it
went in, and the extraction is not tar-specific, so filters could be
written for other file types in the future.
It's also designed so that an incremental backup could be created by
appending content onto an existing one. Obviously zbackup would happily
deduplicate the entire first file. The extraction would be identical for
the incremental file, since the list of blocks to copy out of the packed
file are stored at the end, so a new list can be appended to the old
file, along with any new content blocks.
The biggest issue is that, at the moment, the tool must have access to
the uncompressed pack file on disk in order to unpack it. This is not
necessarily a huge issue, since restoring a backup would presumably not
be a common task, but I would like to be able to compare them frequently.
I will probably look at some kind of random access, like you've
described, to work around this. A fuse version would of course make this
work out-of-the-box. However, I'll probably have a go at reading the
repository directly in rust.
For reference, here's the repository sizes, in megabytes, comparing
regular tar and my packed version. This test was performed for one of my
small production websites.
The series shows the total size after adding each image to the
respository. The numbers at the start are the dates, in January 2015, of
the images which were stored. The next two numbers are the sizes of the
repository, in bytes, for regular tar and for packed tar. The percentage
at the end is the reduction in size.
live only:
30 only: 309 308 <1%
29 - 30: 315 314 <1%
28 - 30: 326 324 <1%
27 - 30: 348 346 <1%
live and staging:
30 only: 465 416 10%
29 - 30: 466 417 10%
28 - 30: 471 421 10%
27 - 30: 480 427 11%
live, staging and test:
30 only: 596 489 17%
29 - 30: 598 490 18%
28 - 30: 603 495 17%
27 - 30: 611 501 18%
live, staging, test and dev:
30 only: 727 563 22%
29 - 30: 728 564 22%
28 - 30: 731 566 22%
27 - 30: 736 568 22%
James
> --
> You received this message because you are subscribed to a topic in the
> Google Groups "ZBackup general discussion" group.
> To unsubscribe from this topic, visit
>
https://groups.google.com/d/topic/zbackup/s3p4VpSCNmc/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
>
zbackup+u...@googlegroups.com
> <mailto:
zbackup+u...@googlegroups.com>.
> For more options, visit
https://groups.google.com/d/optout.