Re: Hashing/Collisions

42 views
Skip to first unread message

Mark Ruijter

unread,
Jul 4, 2014, 3:42:07 AM7/4/14
to lessfs

Hi Eric,

The file you want to look at is lib_common.c

 unsigned char *thash(unsigned char *buf, int size)
 222 {
 223     MHASH td;
 224     unsigned char *hash;
 225
 226     td = mhash_init(config->selected_hash);
 227     if (td == MHASH_FAILED)
 228         exit(1);
 229
 230     mhash(td, buf, size);
 231     hash = mhash_end(td);
 232     return hash;
 233 }

selected_hash is determined in lessfs.cfg, the configuration file.
Lessfs can therefore use almost any hash that is supported by mhash.
This routine is called thash, since in the beginning lessfs did not rely on mhash and included native support for the tiger hash only.

The mhash library is starting to show it's age.
I should switch to something else but have been reluctant to do so, since I would like to spend time on a deduplicating and thin kernel blockdevice (device-mapper) instead.

The actual work is done in lib_common.c here:

2242 unsigned int db_commit_block(unsigned char *dbdata,
2243                              INOBNO inobno, unsigned long dsize)
2244 {
2245     unsigned char *stiger = NULL;
2246     DAT *compressed;
2247     unsigned long long inuse;
2248     unsigned int ret = 0;
2249
2250     FUNC;
2251     LDEBUG("db_commit_block");
2252     stiger = thash(dbdata, dsize);
2253     create_hash_note(stiger);
2254     inuse = getInUse(stiger);
2255     if (0 == inuse) {
2256         compressed = lfscompress((unsigned char *) dbdata, dsize);
2257         ret = compressed->size;
2258         bin_write_dbdata(DBDTA, stiger, config->hashlen, compressed->data,
2259                          compressed->size);
2260         DATfree(compressed);
2261     } else {
2262         loghash("commit_block : only updated inuse for hash ", stiger);
2263     }
2264     inuse++;
2265     update_inuse(stiger, inuse);
2266     LDEBUG("db_commit_block : dbb %llu-%llu", inobno.inode,
2267            inobno.blocknr);
2268     bin_write_dbdata(DBB, (char *) &inobno, sizeof(INOBNO), stiger,
2269                      config->hashlen);
2270     delete_hash_note(stiger);
2271     s_free(stiger);
2272     return (ret);
2273 }

So line 2252 calculates the hash of the block that is about to be stored.
2254 then checks if an identical chunk of data is already present.
When present, inuse >0 then we do not store the block.
The reference counter (inuse) is incremented and the database DBU is updated by the call : update_inuse
Then the files inode->block number table is updated so that the current block number points to the hash that we calculated

I hope this provides you with enough information to get an idea.

Please let me know should you have additional questions.

Mark.


------
Op 3 jul. 2014 om 00:48 heeft Eric <> het volgende geschreven:

Hello,

From what I have read on online posts, lessfs does not implement byte-by-byte comparisons (as most dedup file systems); however, do you think you could let me know which part of the files deal with the mhash files? I would like to implement lessfs's hashing algorithms to create a more efficient way of checking for collisions and/or and ensuring the data block are in fact redundant data

(My technique will replace the need of byte-by-byte comparison, and eliminate the amount of overhead)

Yes, I understand that the probability of collisions occurring is very low, but this is a project I am very interested in






Reply all
Reply to author
Forward
0 new messages