Question about how it splits large files...

Nathan

unread,

Jul 13, 2012, 2:19:33 PM7/13/12

to bup-...@googlegroups.com

Hi, I am curious on how bup splits large files that a portion of the beginning content may change and/or increase/decrease in size but much of the file is actually the same.

Example would be a database dump where a large portion of the dump is the same, but first table that gets dumped may have a few rows added. I am worried that it will see the first portion as the same, but generate new hashes for the remaining portion of the file because the beginning of the file changed.

To explain my concern in a visual representation...

If i am backing up a file like:

1234567890

and bup splits it based on every other character (the pipe represents the chunks that would be hashed)

12|34|56|78|90

then the file has a 2 added after the first 2

12|23|45|67|89|0

it will generate a new hash for every section except the first.

My question is if it uses some sort of "look for me" text as markers to split different sections as checkpoints to begin new hash sections. Similar to how Git/SVN splits based on (\n|\r)+

I am sorry if I did not explain my question well, if I need to elaborate on a specific section let me know.

Thanks,

-Nathan

Zandr Milewski

unread,

Jul 13, 2012, 2:22:32 PM7/13/12

to bup-...@googlegroups.com

On 7/13/12 11:19 , Nathan wrote:
> Hi, I am curious on how bup splits large files that a portion of the
> beginning content may change and/or increase/decrease in size but much
> of the file is actually the same.

See the "Handling Large Files" section of the design document:

https://github.com/apenwarr/bup/blob/master/DESIGN

A detailed explanation of the mechanism bup uses, called
"hashsplitting", starts at line 121.

Rob Browning

unread,

Jul 13, 2012, 2:26:21 PM7/13/12

to Zandr Milewski, bup-...@googlegroups.com

Zandr Milewski <za...@mozilla.com> writes:

> See the "Handling Large Files" section of the design document:
>
> https://github.com/apenwarr/bup/blob/master/DESIGN
>
> A detailed explanation of the mechanism bup uses, called
> "hashsplitting", starts at line 121.

Right, that's definitely worth a read, and in this case, the short
answer is bup should "do the right thing", but of course, if it's
critical to you -- test.

Thanks
--
Rob Browning
rlb @defaultvalue.org and @debian.org
GPG as of 2002-11-03 14DD 432F AE39 534D B592 F9A0 25C8 D377 8C7E 73A4

Nathan

unread,

Jul 13, 2012, 2:53:17 PM7/13/12

to bup-...@googlegroups.com

Thank you, this explained exactly what I was looking for.... I wanted an in-depth answer and apparently only read the README not the DESIGN.

Thanks!

Gabriel Filion

unread,

Jul 14, 2012, 4:35:42 PM7/14/12

to Nathan, bup-...@googlegroups.com, za...@mozilla.com, Rob Browning

[brought back the CC list. Nathan: please use "reply all" on this list:
it ensures that people not subscribed to it can receive answers]

On 12-07-13 02:53 PM, Nathan wrote:
> Thank you, this explained exactly what I was looking for.... I wanted an
> in-depth answer and apparently only read the README not the DESIGN.

we should add a quick note about the DESIGN file in the README to say
that there's more details there.

--
Gabriel Filion

Rob Browning

unread,

Jul 27, 2012, 1:35:23 PM7/27/12

to Gabriel Filion, Nathan, bup-...@googlegroups.com, za...@mozilla.com

Gabriel Filion <lel...@gmail.com> writes:

> we should add a quick note about the DESIGN file in the README to say
> that there's more details there.

Fixed in tmp/pu/master.

Gabriel Filion

unread,

Jul 27, 2012, 4:06:45 PM7/27/12

to Rob Browning, Nathan, bup-...@googlegroups.com, za...@mozilla.com

On 12-07-27 01:35 PM, Rob Browning wrote:
> Gabriel Filion <lel...@gmail.com> writes:
>
>> we should add a quick note about the DESIGN file in the README to say
>> that there's more details there.
>
> Fixed in tmp/pu/master.

nice, thanks.

--
Gabriel Filion

Reply all

Reply to author

Forward