Google sparse hash slow for map of int to vector

381 views
Skip to first unread message

Nitish Gupta

unread,
Oct 5, 2015, 7:37:58 PM10/5/15
to google-sparsehash

I have a map of uint64_t to vector. The map takes an unusually long time (~70s) to build compared to unordered_map(~4s) for a string input of 4MB in size. Similarly, 120s vs 2400s for a string input of size ~150MB.

Unordered_map is generally faster than google sparse hash, but the difference in time is huge. Sparse hash is ~2 times slower(than unordered map) for a map to int to int.

I fail to understand the reason behind this time difference.


Code snippet:


int main(){
std
::string input,reference;

while (getline(cin,input)) {
    reference
+= input;
    input
.clear();
}

cout
<<"length of reference = "<<reference.length()<<endl;

google
::sparse_hash_map<uint64_t, vector<uint32_t>> m;
//unordered_map<uint64_t, vector<uint32_t>> m;
for (auto it = reference.begin(); it != reference.end(); it++) {
    m
[it-reference.begin()].push_back(it-reference.begin());
}
return 0;
}

This is the result of /usr/bin/time on the program





length of reference = 4641652
   
Command being timed: "./a.out"
   
User time (seconds): 76.06
   
System time (seconds): 0.31
   
Percent of CPU this job got: 99%
   
Elapsed (wall clock) time (h:mm:ss or m:ss): 1:16.43
   
Average shared text size (kbytes): 0
   
Average unshared data size (kbytes): 0
   
Average stack size (kbytes): 0
   
Average total size (kbytes): 0
   
Maximum resident set size (kbytes): 308228
   
Average resident set size (kbytes): 0
   
Major (requiring I/O) page faults: 0
   
Minor (reclaiming a frame) page faults: 230585
   
Voluntary context switches: 1
   
Involuntary context switches: 2316
   
Swaps: 0
   
File system inputs: 0
   
File system outputs: 0
   
Socket messages sent: 0
   
Socket messages received: 0
   
Signals delivered: 0

Page size (bytes): 4096 Exit status: 0

Craig Silverstein

unread,
Oct 7, 2015, 12:32:30 PM10/7/15
to google-s...@googlegroups.com
} I fail to understand the reason behind this time difference.

It's because sparsemap has to copy the values every time it does a
hashable resize. sparsemap stores values in the hashtable directly,
unlike unordered_map which uses a a pointer to the values.

There are various solutions:

1) Store a vector<unit32_t>* in the map, rather than a vector<uint32_t>
2) Use sparse_hash_multimap<> instead, and get rid of the vector entirely
3) Use m.resize(reference.length()) (or the m(reference.length())
map-constructor) before your first insert.

You can combine (3) with either (1) or (2), if you want, though it would
work fine on its own. (3) works best with (2) since otherwise it
allocates a bit too much memory (unless there are no duplicate keys in
the input). For sparsehash that extra memory is probably small, for
unordered_map it might be more of an issue.

craig

1971 powerChina

unread,
Oct 14, 2015, 2:34:30 AM10/14/15
to google-sparsehash
Title:       The core of the big data solutions -- Map.
Author:      pengwenwei
address:     No.17-18 of XiangGangbatang Community, Xiangtan City of Hunan Province, China.  
Language:    c++
Platform:    Windows, linux
Technology:  Perfect hash algorithm
Level:       Advanced
Description: A high performance map algorithm
Section      MFC c++ map stl
SubSection   c++ algorithm
License:     (GPLv3)


Map is widely used in c++ programs. Its performance is critical to programs' performance. Especially in big data  and the scenarios which can't realize data distribution and parallel processing.

I have been working on big data analysis for many years in telecommunition and information security industry. The data analysis is so complicated that they can't work without map. Especially in information security industry, the data is much more complicated than others. For example, ip table, mac table, telephone numbers table, dns table etc.


Currently, the STL map and Google's hash map are the most popular maps. But they have some disadvantages. The STL map is based on binary chop, which causes a bad performance. Google Hash map has the best performance at present, but it has probability of collision. For big data analysis, the collision probability is unacceptable.

Now I would like to publish pwwMap. It includes three different maps for different scenarios:
1. Memory Map(memMap): It has a good access speed. But its size is limited by memory size.
2. Harddisk Map(diskMap): It utilizes hard disk to store data. So it could accept much more data than memory map.
3. Hashmap(hashMap): It has the best performance and a great lookup speed, but it doesn't have 'insert' and 'delete' functionality.

MemMap and diskMap could be converted to hashMap by function memMap2HashMap and diskMap2HashMap. According to the test result, my algorithms' collision probability is zero. About performance, memMap has a comparable performance with google, and hashMap's performance is 100 times better than Google's hashmap.

In summary, pwwhash are perfect hash algorithms with zero collision probability. You can refer to following artical to find the key index and compress algorithm theory:
http://blog.csdn.net/chixinmuzi/article/details/1727195

Source code and documents:
https://sourceforge.net/projects/pwwhashmap/files/?source=navbar

I would like to transfer my technique with my immigration to the United States as the condition.

Please do not contact me with email, qq or telephone. We can have a face to face talk.

My address is as follows:
No.17-18 of XiangGangbatang Community, Xiangtan City of Hunan Province, China.
Reply all
Reply to author
Forward
0 new messages