Collision in Sparse Hash Map

289 views
Skip to first unread message

hcha...@gmail.com

unread,
Dec 4, 2012, 10:00:55 PM12/4/12
to google-s...@googlegroups.com
Hi,

Is there a way to ask sparse hash map to resize until there is no collision?

Does bucket_count() < size() mean that there are some collisions?

Just trying to see how I can improve the look-up speed.

Thanks in advance.


Cheers,
Hei

Donovan Hide

unread,
Dec 5, 2012, 4:12:26 AM12/5/12
to google-sparsehash
Hi again,

Is there a way to ask sparse hash map to resize until there is no collision?

The number of collisions is dependent on the number of keys and your choice of hashing algorithm. Testing with a typical workload is the only way to be sure of the collision rate. Collisions aren't inherently bad, unless you have many collisions for a single key due to a poor choice of hashing algorithm. They way to test is to fill your map with typical keys and then iterate over each bucket and call bucket_size for that bucket. If you get a bad distribution then your hashing algorithm can be improved. If your average bucket_count is much above 1 then choose a higher resize value.

Does bucket_count() < size() mean that there are some collisions?

Yes :-)

Cheers,
Donovan.

hcha...@gmail.com

unread,
Dec 5, 2012, 1:12:24 PM12/5/12
to google-s...@googlegroups.com
Hi Donovan,

Yes, the occurrence of collision depends on the combination of key pattern and hash function.  Given that there is no key pattern, I think it is impossible to tune the hash function.

So I was wondering whether sparse hash map provides a way to keep resizing (as you suggested) in hope that increasing the number of buckets will reduce the collision.  Hmm...but it will likely make the cache locality worse...

Thanks for your input.

Donovan Hide

unread,
Dec 5, 2012, 2:50:56 PM12/5/12
to google-sparsehash
Hi again,

Yes, the occurrence of collision depends on the combination of key pattern and hash function.  Given that there is no key pattern, I think it is impossible to tune the hash function.

Pointers do have patterns, just ones that are hard to predict! What I forgot to say this morning is that it is probably better to come up with a custom hash function which is related to the contents of your class. If you've read the archives of this mailing list you may have seen this before, but here's an example:


Is the map controlling the lifetime of the class instances you have or do they out live the lifetime of the map? If the map is the controller then it is probably better to do away with pointers and implement a custom hash function. Without seeing your code it is hard to say. It's often easier to help if others can see what you are trying to achieve.

See also here for how sparsehash deals with pointer keys:


So I was wondering whether sparse hash map provides a way to keep resizing (as you suggested) in hope that increasing the number of buckets will reduce the collision.  Hmm...but it will likely make the cache locality worse...

Yes, hash tables generally do this automatically. Have a read through the code:


But this formula only checks the occupancy rate of all buckets, not any one bucket with exceptionally high number of keys. Writing a test case along with the suggestions on bucket_count in the previous email will help you see if collisions are harming your performance or not. I'd worry more about collisions than cache locality at this stage. Your access patterns, hashing algorithm and activity of other threads are all factors in cache lines being hit.

Hope that helps,
Donovan.

1971 powerChina

unread,
May 6, 2015, 4:03:46 AM5/6/15
to google-s...@googlegroups.com, hcha...@gmail.com
Title:       The core of the big data solutions -- Map
Author:      pengwenwei
Email:       wenwei19710430
Language:    c++
Platform:    Windows, linux
Technology:  Perfect hash algorithm
Level:       Advanced
Description: A high performance map algorithm
Section      MFC c++ map stl
SubSection   c++ algorithm
License:     (GPLv3)

    Download demo project - 1070 Kb
    Download source - 1070 Kb

Introduction:
For the c++ program, map is used everywhere. And the bottleneck of program performance is often the performance of map. Especially in following two scenarios, the performance of map becomes the most critical technology:
    - Big Data
    - Because of strong bussiness coupling, unable to realize data distribution and parallel processing

I have been working in telecommunication and information security industry for a few years, and have a lot of experience with bottom big data. From my perspective, information security industry's data is the most complicated. All of them relay on map. For example, IP table, MAC table, telephone number list, domain name resolution table, ID number table query, the cloud killing of trojan/virus characteristic code etc..

The STL map utilizes binary chop, which causes a bad performance. Google Hash map has the best performance at present, but it has repeated collision probability. Currently the big data rarely use a collision probability map. In some special scenarios (Eg. charing/billing), any mistake is unacceptable.

Now I would like to publish my algorithms here. It includes three kinds of map. You will get the hash map after building. By comparing the test result, you will find that my algorithm has zero collision probability, but its performance is better than the hash algorithm. Even the general map's performance has no much difference with Google.

My algorithm is a perfect hash algorithm. Its key index algorithm and the theory of compression algorithm is out of the ordinary. The most important point is that, it uses a completely different structure, so the key index compression is fundamentally different. The most direct benefit for program is that, you need only one server to support the solution which needs ten servers before.

Declare: the code can not be used for commercial purposes. For commercial applications, please contact with me through QQ 75293192.

Download:
https://sourceforge.net/projects/pwwhashmap/files

You can find my design ideas from following article:
http://blog.csdn.net/chixinmuzi/article/details/1727195


Development Manual:
There are three different map algorithm, used in different application scenarios:
1,memMap:  Based on memory No hard disk consumption.
2,diskMap: Based on the hard disk No memory consumption.
3,hashMap: No delete function, but the best performance.
memMap and diskMap can turn to hashMap by memMap2HashMap and diskMap2HashMap.

In addition,Provide Google HashMap for comparative test.
Algorithm of Google,the performance is not good, but also has the collision probability .

Reply all
Reply to author
Forward
0 new messages