performance of rdiff on 20 GB files with lots of collisions

13 views

Skip to first unread message

bro....@gmail.com

unread,

Aug 7, 2018, 10:18:29 AM8/7/18

to librsync

any idea how the delta stuff performs on really large files?

with earlier versions of librsync, it was really bad. like exponentially bad. a 1 GB file could be done in maybe an hour, but a 20 GB file would take a couple days, and a 40 GB file ... i killed it after 10 days so i don't know how long it would have taken.

i'm looking at the hashtable code and it doesn't look to me like it will handle collisions well. but what do i know. i can't figure out this:

reserving zero for empty buckets, and iterating with index i and entry hash

h, terminating at an empty bucket. */

# define _for_probe(t, k, hk, i, h) \

const unsigned mask = t->size - 1;\

unsigned hk = KEY_HASH((KEY_T *)k), i, s, h;\

hk = hk ? hk : -1;\

for (i = mix32(hk) & mask, s = 0; (h = t->ktable[i]); i = (i + ++s) & mask)
/* Loop macro for probing table t for key k, setting hk to the hash for k

i think when there's collisions it just puts them into the table in the order they arrive, duplicates too. if the hashtable is full, it seems to me like it will loop, cause i think the add just looks for an empty entry. lookups appear sequential.

but i don't understand the consequences of the incrementer in the for. i = (i + ++s) & mask. if i is zero we get 0,1,3,6,10,15,21 etc. other i's are the same but shifted.

is it possible this method of skipping through the hashtable is a way of dealing with collisions?

Reply all

Reply to author

Forward

0 new messages