performance on 0.9.7 was terrible. I had a multi-gigabyte file i was trying to delta, and it ran for 4 days before I killed it. I found the following strategies were successful:
- when reading in the signature file, base the size of the signature table on the size of the signature file. realloc is a painfully slow way to do it. I see readsums in 2.0.1 is no different from 0.9.7 in this respect.
- expand the size of the lookup hash table targets, based on the number of signatures. keep it to 50 collisions per tag.
- heapsort vs qsort - qsort is very fast, much less than a second to sort the signatures for a 100 GB (or so) file.
- collect weaksums, sorted and hang them off the hash table in case of hash collision. each weaksum indexes a lo and hi in the sorted targets table for a binary search.
- sort all interesting fields, not just the tag. do a binary search.
- instead of returning a token, return a first and last index into the sorted targets table of blocks with the same strong signature but at different indexes in the file.
this resulted in a processing time of around an hour instead of weeks. In addition, the delta file was smaller due to improved detection of runs of repeated blocks. 25 MB on 100 GB input file, so not a big difference.
i found that blake2 was much slower than the md4 used. Naturally, it doubles the size of the signature files, but apparently without any benefit. has anybody else tried to benchmark md4 vs blake2 on GB sized deltas? Is there any other benefit besides speed?