Quoting Bob Glickstein (2020-08-11 11:00:33)
> as you pointed out the existing implementations are all slightly
> mutually incompatible, so what is there to preserve, exactly?
The difference between
go4.org/rollsum and bup's choice of OnSplit() can
be factored out into a choice of bitmask and expected value easily
enough, and can be computed as:
digest() & mask == value
For
go4.org/rollsum, you have:
mask = ((1 << 13) - 1)
value = 0
And for bup you have:
mask = ((1 << 13) - 1)
value = mask
I modified your benchmark (after applying the rng pr I sent you) to
compute an `onsplit` this way, and the performance difference was small;
less than 3% difference vs. calling rs.OnSplit(). For most applications,
this function is fast enough that its cost is going to be completely
swallowed by other overheads.
All of this is to say I think it is reasonable to capture the difference
between bup & perkeep in slightly different parameters to the same
algorithm. We could add a third parameter to also capture the rsync
version, though I haven't measured the overhead of making that a
run-time value -- but I expect it to be small enough not to matter
much.
Unfortunately, making the rolling checksum itself an interface
approximately doubles the running time. But I'm not sure whether
that actually matters for any real applications. We're still talking 6
ns/op; for perkeep & bup I expect that difference is still going to be
completely swallowed by the computation of a cryptographic hash for the
block as a whole.
But, I think a good spec would document the hash function orthogonally
to the rest of the algorithm, so this isn't a blocker on getting
started on the other parts.
> The highest few bits do not exhibit good randomness.
>
> [...]
>
> You might argue we shouldn't care about the highest few bits for
> hashsplitting purposes - we seldom care about more than 20 or so of
> the lowest bits - but I don't feel ready to settle for that until we've
> evaluated some other candidate algorithms.
I'm open to alternative suggestions, but (1) I don't see a problem;
you're getting to 32MiB blocks before you start to have any differences
in variance, and realistically +/-5% probably isn't a big deal -- so
you're looking at max avg block size of 128MiB, which seems like plenty.
(2) I do think it would be nice to arrange things such that some
existing systems' algorithms fall out of the spec as choices of
particular parameters. For that to work for perkeep & bup, we'd need to
keep some version of this hash function.
If we decide it's ok to put the function behind an interface in the
code, we could explore multiple possible hash functions, but (1) I think
it's valuable for this function to be one of the options for
compatibility, and (2) I don't want to give too many options; a
different hash function is much more extra implementation work than a
numeric parameter, and if we add too many we've sort of missed the
point of standardization. So I'd only want to do this if there are clear
compelling use cases for each of the functions we include.
Whatever parameters we decide to add, we should pick a
default/recommended set of values for them.