Is map a hash table?

5,882 views
Skip to first unread message

mon...@gmail.com

unread,
Mar 19, 2012, 3:29:54 AM3/19/12
to golang-nuts, golang...@googlegroups.com
Dear all,

I've just written some code to test the performance of the map type. I
am quite curious about its performance because according to the article
from go blog[1], it seems that hash_lookup function may be time
consuming under some circumstances.

However, according to the textbook, the time complexity of search
operation on a hash table should be O(1) (suppose we have a good hash
function). At first, I thought it may because of the constant part of
the complexity function is too large. But according to my recent test,
it seems that the map type is more like a tree structure. NOTE: I didn't
read the code (which should be in src/pkg/runtime/hashmap.c ).

The test code is here: http://pastebin.com/W9me7bvc

It is very easy: allocate N items in a map, write N key-values, with
each key has same size. Then take all values in insertion order. Divide
the total time used in getting items by the number of items, then we
have the average time on each get-operation.

Here is my result:

Map Seq. Access. iterations: 1000. 0.08400000000000002 us/op
Map Seq. Access. iterations: 10000. 0.1177 us/op
Map Seq. Access. iterations: 100000. 0.26981 us/op
Map Seq. Access. iterations: 1000000. 0.390944 us/op
Map Seq. Access. iterations: 10000000. 0.4742399 us/op


The average access time is actually an O(log(N)) function not O(N). It
looks like a tree structure not a hash table.

In comparison, I wrote similar python code, and here are results:

Dict Seq. Access. iterations: 1000 . 0.123 us/op
Dict Seq. Access. iterations: 10000 . 0.1137 us/op
Dict Seq. Access. iterations: 100000 . 0.14328 us/op
Dict Seq. Access. iterations: 1000000 . 0.163299 us/op
Dict Seq. Access. iterations: 10000000 . 0.1822816 us/op

That's more like a hash table.

So I used C++ with map from STL and wrote another test program in C++. I
got similar result like go's map (using g++ 4.6.2, with -O3):

Map Seq. Access. iterations: 1000. 0.143us/op
Map Seq. Access. iterations: 10000. 0.2117us/op
Map Seq. Access. iterations: 100000. 0.33286us/op
Map Seq. Access. iterations: 1000000. 0.559903us/op
Map Seq. Access. iterations: 10000000. 0.731576us/op

According to the language specification of go:

The comparison operators == and != ... must be fully
defined for operands of the key type

If it is a hash table, then rather than == and !=, we need to define a
hash function on it. Such evidence let me believe that map is more like
a tree not a hash table.

However, in the source code, we have hashmap.c which, from the name,
seems like a hash table. I apologize that I didn't read the source code
because the test results is a very strong argument (to me) that map is
actually a tree structure, not a hash table. But in Go for C++
programmer[2], it says: Hash tables are provided by the language. They
are called maps.

All these pieces of evidence, to me, are inconsistent. Is it simply a
naming convention in go (and among its developer) that any key-value
storage is a "hash table"?

Regard,
-Monnand


1. http://blog.golang.org/2011/06/profiling-go-programs.html
2. http://golang.org/doc/go_for_cpp_programmers.html

unread,
Mar 19, 2012, 4:08:47 AM3/19/12
to golan...@googlegroups.com, golang...@googlegroups.com
You should take CPU cache misses into account:

http://pastebin.com/pfz8zvFp

Map Seq. Access. iterations: 1000. 0.19 us/op, IPC 1.26
Map Seq. Access. iterations: 10000. 0.20 us/op, IPC 1.19
Map Seq. Access. iterations: 100000. 0.31 us/op, IPC 0.91
Map Seq. Access. iterations: 1000000. 0.44 us/op, IPC 0.60
Map Seq. Access. iterations: 10000000. 0.49 us/op, IPC 0.52

Dave Cheney

unread,
Mar 19, 2012, 4:14:19 AM3/19/12
to ⚛, golan...@googlegroups.com, golang...@googlegroups.com
Atom makes a good point, by iterating through the map in key order,
rather than natural order, you're taking a lot of cache misses.

If you test again using the natural map interator

for k := range a {
sum += a[k]
}

Then you get results like this

before:
odessa(~/devel/src) % go run -v hmaptest.go
command-line-arguments
Map Seq. Access. iterations: 1000. 0.087 us/op
Map Seq. Access. iterations: 10000. 0.0863 us/op
Map Seq. Access. iterations: 100000. 0.19273000000000004 us/op
Map Seq. Access. iterations: 1000000. 0.270274 us/op
Map Seq. Access. iterations: 10000000. 0.32927910000000005 us/op

after:
odessa(~/devel/src) % go run -v hmaptest.go
command-line-arguments
Map Seq. Access. iterations: 1000. 0.12800000000000003 us/op
Map Seq. Access. iterations: 10000. 0.1097 us/op
Map Seq. Access. iterations: 100000. 0.12813000000000002 us/op
Map Seq. Access. iterations: 1000000. 0.145759 us/op
Map Seq. Access. iterations: 10000000. 0.20803340000000003 us/op

I think the last value is probably outside the L2/3 (?) cache on this
host (core i5 mac mini) so there is a lot of cache tramping going on.

Monnand

unread,
Mar 19, 2012, 5:03:03 AM3/19/12
to golan...@googlegroups.com, golang...@googlegroups.com


On Monday, March 19, 2012 4:08:47 AM UTC-4, ⚛ wrote:
You should take CPU cache misses into account:

http://pastebin.com/pfz8zvFp

Map Seq. Access. iterations: 1000. 0.19 us/op, IPC 1.26
Map Seq. Access. iterations: 10000. 0.20 us/op, IPC 1.19
Map Seq. Access. iterations: 100000. 0.31 us/op, IPC 0.91
Map Seq. Access. iterations: 1000000. 0.44 us/op, IPC 0.60
Map Seq. Access. iterations: 10000000. 0.49 us/op, IPC 0.52

Thank you!

Then let's take an analytic approach:

avery memory access time = hit time + miss rate * miss penalty

let T = ns per cycle; M = instructions per iteration

Then the time for each iteration would be (1/IPC) * T * M where i = 1, 2, 3...

Lets use CPI instead of IPC for convinience.

If it is a good hash table, in which the time complexity is O(1) = C:
CPI * T * M = hit time + miss rate * miss penalty + C

Suppose miss penalty >> hit time and miss penalty >> C, then we have:
CPI * T * M ~= miss rate * miss penalty

time for each iteration is proportional to log(N):
CPI * T * M ~ log(N)
=>
miss rate * miss penalty ~ log(N)

then we have:
miss rate ~ log(N).

I am trying to figure out why we get such high miss rate? Rember that in python, we can have relatively constant CPI hence constant miss rate.

Again, thanks for pointing out the cache miss.

Regards,
-Nan

unread,
Mar 19, 2012, 5:20:23 AM3/19/12
to golan...@googlegroups.com, golang...@googlegroups.com
On Monday, March 19, 2012 10:03:03 AM UTC+1, Monnand wrote: 
Rember that in python, we can have relatively constant CPI hence constant miss rate.

Did you measure the miss rate?

mon...@gmail.com

unread,
Mar 19, 2012, 5:26:33 AM3/19/12
to golan...@googlegroups.com

Not actually. But this could be derived from my previous analyze:

hash search time is a constant C;
average memory access time = hit time + miss rate * miss penalty

Then the time in each iteration is:


hit time + miss rate * miss penalty + C

Here is the python benchmark result:

Dict Seq. Access. iterations: 1000 . 0.123 us/op
Dict Seq. Access. iterations: 10000 . 0.1137 us/op
Dict Seq. Access. iterations: 100000 . 0.14328 us/op
Dict Seq. Access. iterations: 1000000 . 0.163299 us/op
Dict Seq. Access. iterations: 10000000 . 0.1822816 us/op

The us/op keeps (nearly) the same (~0.1x)

This means: hit time + miss rate * miss penalty + C keeps same. Then
miss rate should be similar.

Regards,
-Monnand

unread,
Mar 19, 2012, 5:44:03 AM3/19/12
to golan...@googlegroups.com, ⚛, golang...@googlegroups.com
On Monday, March 19, 2012 9:14:19 AM UTC+1, Dave Cheney wrote:
Atom makes a good point, by iterating through the map in key order,
rather than natural order, you're taking a lot of cache misses.

If you test again using the natural map interator

        for k := range a {
                sum += a[k]
        }


The keys are of type 'string'. A string requires a pointer indirection. If the map contents do not fit in CPU cache and the hash function is good (and the hash function is independent from memory allocation algorithm), then each map access will cause a cache miss irrespective of iteration order. Finding the iteration order which minimizes cache misses is hard.

A map[int]int should be faster than map[string]int.

mon...@gmail.com

unread,
Mar 19, 2012, 5:43:40 AM3/19/12
to golan...@googlegroups.com
mon...@gmail.com wrote, On 03/19/2012 05:26 AM:
> ⚛ wrote, On 03/19/2012 05:20 AM:
>> On Monday, March 19, 2012 10:03:03 AM UTC+1, Monnand wrote:
>>
>> Rember that in python, we can have relatively constant CPI hence
>> constant miss rate.
>>
>>
>> Did you measure the miss rate?
>
> Not actually. But this could be derived from my previous analyze:
>
> hash search time is a constant C;
> average memory access time = hit time + miss rate * miss penalty
>
> Then the time in each iteration is:
> hit time + miss rate * miss penalty + C
>
> Here is the python benchmark result:
>
> Dict Seq. Access. iterations: 1000 . 0.123 us/op
> Dict Seq. Access. iterations: 10000 . 0.1137 us/op
> Dict Seq. Access. iterations: 100000 . 0.14328 us/op
> Dict Seq. Access. iterations: 1000000 . 0.163299 us/op
> Dict Seq. Access. iterations: 10000000 . 0.1822816 us/op
>
> The us/op keeps (nearly) the same (~0.1x)

In case you need, here is the python benchmark:

http://pastebin.com/QMysjR47

mon...@gmail.com

unread,
Mar 19, 2012, 5:54:19 AM3/19/12
to golan...@googlegroups.com

Thanks for your reply!

It is true that map[int]int is faster. But still, it shows an O(log(N))
complexity:

Map Seq. Access. iterations: 1000. 0.06 us/op
Map Seq. Access. iterations: 10000. 0.0643 us/op
Map Seq. Access. iterations: 100000. 0.13352000000000003 us/op
Map Seq. Access. iterations: 1000000. 0.297422 us/op
Map Seq. Access. iterations: 10000000. 0.3801063 us/op

here is the code:
http://pastebin.com/96Gh4P5k

Regards,
-Monnand

Malcolm

unread,
Mar 19, 2012, 7:12:27 AM3/19/12
to golan...@googlegroups.com, golang...@googlegroups.com
On Monday, 19 March 2012 07:29:54 UTC, Monnand wrote:

According to the language specification of go:

The comparison operators == and != ... must be fully
defined for operands of the key type

If it is a hash table, then rather than == and !=, we need to define a 
hash function on it. Such evidence let me believe that map is more like 
a tree not a hash table.

 
Hash tables do need to compare keys as well as hash them, as the correct behaviour for two keys which hash the same depends on whether they are actually equal or just a hash collision.

Thanks,
Malcolm.

Volker Dobler

unread,
Mar 19, 2012, 7:45:29 AM3/19/12
to golang-nuts


On Mar 19, 10:54 am, "monn...@gmail.com" <monn...@gmail.com> wrote:
> It is true that map[int]int is faster. But still, it shows an O(log(N))
> complexity:
>
> Map Seq. Access. iterations: 1000. 0.06 us/op
> Map Seq. Access. iterations: 10000. 0.0643 us/op
> Map Seq. Access. iterations: 100000. 0.13352000000000003 us/op
> Map Seq. Access. iterations: 1000000. 0.297422 us/op
> Map Seq. Access. iterations: 10000000. 0.3801063 us/op

The data you measured do _not_ indicate a O(log(N)) behavior:
Plot the data and you'll see it. Maps in Go are hashmaps, but
the "textbook" O(1) are just that: "textbook complexities" i.e. under
assumptions which are not fulfilled from our real world like handling
dynamic growth while avoiding allocations, space/time trade-offs,
general purpose hash functions and cache misses.

Volker

John Asmuth

unread,
Mar 19, 2012, 7:58:35 AM3/19/12
to golan...@googlegroups.com


On Monday, March 19, 2012 7:45:29 AM UTC-4, Volker Dobler wrote:


On Mar 19, 10:54 am, "monn...@gmail.com" <monn...@gmail.com> wrote:
> It is true that map[int]int is faster. But still, it shows an O(log(N))
> complexity:
>
> Map Seq. Access. iterations: 1000. 0.06 us/op
> Map Seq. Access. iterations: 10000. 0.0643 us/op
> Map Seq. Access. iterations: 100000. 0.13352000000000003 us/op
> Map Seq. Access. iterations: 1000000. 0.297422 us/op
> Map Seq. Access. iterations: 10000000. 0.3801063 us/op

The data you measured do _not_ indicate a O(log(N)) behavior:
Plot the data and you'll see it.

Really? Looks logarithmic to me. Except for going from 1000 to 10000, the time per op increases by ~.1 us/op each step, while the number of iterations grows exponentially. Exponential increase in the range with linear increase in the domain is logarithmic.

Volker Dobler

unread,
Mar 19, 2012, 9:02:17 AM3/19/12
to golang-nuts
Extending the range a bit and some more measurement points:
100. 0.26 us/op
1000. 0.179 us/op
10000. 0.2278 us/op
100000. 0.38376 us/op
1000000. 0.728292 us/op
10000000. 0.8977112 us/op
15000000. 0.9741257333333333 us/op
20000000. 1.19961005 us/op
25000000. 1.309732 us/op

Which is not logarithmic (see http://pastebin.com/fGDxSxiq).
The points should fall pretty well on a (single) really straight
line in a log/lin plot if one wants to prove a logarithmic growth.

Volker

unread,
Mar 19, 2012, 9:55:58 AM3/19/12
to Volker Dobler, golang-nuts

The O(1) in textbooks corresponds to how long it takes to compute the
hashcode of a key and to compare two keys for equality. The O(1)
includes any cache misses that may happen during the computation.
Comparing two strings for equality also corresponds to O(1),
irrespective of the length of the strings. Thus, O(1) is the average
time it takes to compute the hashcode and to compare two keys. You
can think of it as O(keyType) instead of O(1).

It is physically impossible to select/choose 1 element out of N
elements in less than O(log N) time. The best case is O(log N), so in
any case O(1) is completely invalid.

The O(1) mentioned in the textbook should be replaced with the actual
time complexity of hashing and comparing keys. So, in case of this
golang-nuts discussion about Go maps, the first approximation would be
to replace O(1) with O(log N). Ultimately, O(1) should be replaced
with a formula that closely matches the actual measured performance of
Go/Python/C++ maps.

Message has been deleted

Volker Dobler

unread,
Mar 19, 2012, 10:06:07 AM3/19/12
to Michael Jones, golang-nuts
Nice test! Thanks!
We are seeing an overall log N behavior with clear 
deviations, at least over a range of five orders of 
magnitude.

V.

On Mon, Mar 19, 2012 at 2:56 PM, Michael Jones <m...@google.com> wrote:
sorry, incomplete post.

that was hashing both ways into a full table as in the original.

here it is in natural order (range over the hash table and using the values):
1024 0.0850
2048 0.0483
4096 0.0469
8192 0.0483
16384 0.0485
32768 0.0466
65536 0.0449
131072 0.0432
262144 0.0432
524288 0.0453
1048576 0.0484
2097152 0.0502
4194304 0.0515
8388608 0.0509
16777216 0.0537

the difference is the growth rate under discussion. can do no more
now. on the way to the airport.

On Mon, Mar 19, 2012 at 6:40 AM, Michael Jones <m...@google.com> wrote:
> 1024 0.1045
> 2048 0.1191
> 4096 0.1074
> 8192 0.1056
> 16384 0.1258
> 32768 0.1428
> 65536 0.1790
> 131072 0.2497
> 262144 0.3111
> 524288 0.3344
> 1048576 0.3434
> 2097152 0.3695
> 4194304 0.3819
> 8388608 0.4103
> 16777216 0.5523
> 33554432 0.5905
> 67108864 0.6256
> --
> Michael T. Jones | Chief Technology Advocate  | m...@google.com |  +1
> 650-335-5765



--
Michael T. Jones | Chief Technology Advocate  | m...@google.com |  +1
650-335-5765



--
Dr. Volker Dobler

Jan Mercl

unread,
Mar 19, 2012, 10:10:56 AM3/19/12
to golan...@googlegroups.com
On Monday, March 19, 2012 2:55:58 PM UTC+1, ⚛ wrote:

The O(1) in textbooks corresponds to how long it takes to compute the
hashcode of a key and to compare two keys for equality. The O(1)
includes any cache misses that may happen during the computation.
Comparing two strings for equality also corresponds to O(1),
irrespective of the length of the strings. Thus, O(1) is the average
time it takes to compute the hashcode and to compare two keys.  You
can think of it as O(keyType) instead of O(1).

It is physically impossible to select/choose 1 element out of N
elements in less than O(log N) time. The best case is O(log N), so in
any case O(1) is completely invalid.

If you're still talking about hash maps then the above doesn't hold.

unread,
Mar 19, 2012, 10:15:54 AM3/19/12
to golan...@googlegroups.com
Why?

Hotei

unread,
Mar 19, 2012, 10:25:23 AM3/19/12
to golan...@googlegroups.com
<nitpick>You may need to edit your graph if you wished to portray it as log/lin.  The Y axis is "erratic",, neither log nor linear.</nitpick>

Jan Mercl

unread,
Mar 19, 2012, 10:27:44 AM3/19/12
to golan...@googlegroups.com
Supposing "select/choose" is meant as "access the hash map value associated with a key", then O(log N) /N being len(map)/ is characteristic instead for [array] binary [tree] search, B-trees and friends. On contrary, for hash maps it's a textbook example of O(1) access in the first approximation (fixed hash table size, negligible collision rate) - which is still a real world case seen where time matters more than memory used. Accounting for memory more conservative approaches gets more complex, e.g.: http://en.wikipedia.org/wiki/Hash_table#Performance_analysis

 

Michael Jones

unread,
Mar 19, 2012, 9:40:20 AM3/19/12
to Volker Dobler, golang-nuts
1024 0.1045
2048 0.1191
4096 0.1074
8192 0.1056
16384 0.1258
32768 0.1428
65536 0.1790
131072 0.2497
262144 0.3111
524288 0.3344
1048576 0.3434
2097152 0.3695
4194304 0.3819
8388608 0.4103
16777216 0.5523
33554432 0.5905
67108864 0.6256

On Mon, Mar 19, 2012 at 6:02 AM, Volker Dobler
<dr.volke...@gmail.com> wrote:
>
>

--

Michael Jones

unread,
Mar 19, 2012, 9:56:19 AM3/19/12
to Volker Dobler, golang-nuts
sorry, incomplete post.

that was hashing both ways into a full table as in the original.

here it is in natural order (range over the hash table and using the values):
1024 0.0850
2048 0.0483
4096 0.0469
8192 0.0483
16384 0.0485
32768 0.0466
65536 0.0449
131072 0.0432
262144 0.0432
524288 0.0453
1048576 0.0484
2097152 0.0502
4194304 0.0515
8388608 0.0509
16777216 0.0537

the difference is the growth rate under discussion. can do no more
now. on the way to the airport.

On Mon, Mar 19, 2012 at 6:40 AM, Michael Jones <m...@google.com> wrote:

mon...@gmail.com

unread,
Mar 19, 2012, 3:42:25 PM3/19/12
to golan...@googlegroups.com, golang...@googlegroups.com
Dear all,

Thank you all for your wonderful replies. I read them carefully and
reproduced the result with modified benchmarks.

On cache miss argument mentioned by 0xe2.0x9a.0x9b (I'm sorry, I don't
know how to put your name/symbol here), and Dave, I totally buy it. Yes,
from the IPC, it is quite clear that the cache miss makes IPC drop.

0xe2.0x9a.0x9b, your perf package is really helpful. Thanks a lot!

Dave also mentioned that we can keep the access time be a constant if we
access the map according to its natural order. i.e. access it using for
k := range a . In such case, we can mitigate the cache miss rate with
sequential accesses.

Dave's argument made me think another question: Why python has a
constant access time, hence constant miss rate?

Here is my theory: in my python code, the elements are accessed in their
insert order. The insert order is the same order as the summation of
their symbols using their ASCII number. (the first element has key
"000000000", second one has "000000001", etc.) If the hash function in
python generates the hash of each key in such order, then my code would
access the hash table in its natural order. Hence we reduced the miss
rate and kept the access time as a constant.

Then I changed my python code to let it access elements in a random
order. If my previous theory hold, then the access time should not be a
constant any more because of cache misses.

Here is the code: http://pastebin.com/xrcYDmsw

Here is the result of python benchmark (I ran it 3 times):

$ vim test.py
$ python test.py
Dict Random Access. N = 1000 : 0.132 us/op
Dict Random Access. N = 10000 : 0.1317 us/op
Dict Random Access. N = 100000 : 0.41815 us/op
Dict Random Access. N = 1000000 : 0.505788 us/op
Dict Random Access. N = 10000000 : 0.6442865 us/op
$ python test.py
Dict Random Access. N = 1000 : 0.136 us/op
Dict Random Access. N = 10000 : 0.1361 us/op
Dict Random Access. N = 100000 : 0.41679 us/op
Dict Random Access. N = 1000000 : 0.497187 us/op
Dict Random Access. N = 10000000 : 0.6444111 us/op
$ python test.py
Dict Random Access. N = 1000 : 0.133 us/op
Dict Random Access. N = 10000 : 0.1393 us/op
Dict Random Access. N = 100000 : 0.41424 us/op
Dict Random Access. N = 1000000 : 0.501201 us/op
Dict Random Access. N = 10000000 : 0.6417045 us/op

From the result, my question is solved: the reason why my previous
python benchmark behaves better may be attribute to the access sequence.

It seems to me that the hash function in python is not randomized well
on its input. This may be attribute to the psudo random generator on the
system.

Another person sent me a result with same benchmark I used previously
but running on a windows machine. Then the access time is no longer a
constant.

Finally, here is my conclusion:

1. The reason why python benchmark performs better than go's is the
access pattern. If the access is a random access, then the access time
will be increase because of cache miss.

2. Python hash function may be not that "good" on some system depending
on the psudo random generator in the library (may be C library?).

Thanks again for your replies!

Regards,
-Monnand

Rémy Oudompheng

unread,
Mar 19, 2012, 4:33:20 PM3/19/12
to mon...@gmail.com, golan...@googlegroups.com, golang...@googlegroups.com
Le 19 mars 2012 20:42, mon...@gmail.com <mon...@gmail.com> a écrit :
> Dave's argument made me think another question: Why python has a constant
> access time, hence constant miss rate?

Python's dictionaries are arrays (similar to slices) that are
reallocated on growth.

> It seems to me that the hash function in python is not randomized well on
> its input. This may be attribute to the psudo random generator on the
> system.

Python's hash function is not randomized at all. Integers hash as
themselves. This may change in Python 3.3 or 3.4 as randomizatino is
introduced.

> 2. Python hash function may be not that "good" on some system depending on
> the psudo random generator in the library (may be C library?).

As said, Python's hash function is not randomized.

Go hashmaps are represented as very flat tree-like structures: I guess
that minimizes the allocation of giant chunks of memory and possibly
reduces pressure on the garbage collector. The arity of the tree is
very large (~1000) so the logarithmic effect is very slight.

Thus, small hash maps (less then 16384 elements) are supposed to
behave like Python's hashmaps, and beyond you start observing the tree
structure effect. [Maybe rutime authors can correct me].

The attached Go program introspects the top level data structure that
underlies maps. You should really read data structure definitions in
src/pkg/runtime/hashmap.c. Code is probbly a bit harder to read (I'm a
bit lazy to delve into details), but the data structure itself seems
self-explanatory.

% go run map.go
{"Count":1,"DataSize":8,"MaxPower":12,"Indirect":0,"ValOffset":4,"NodeAllocs":0,"Seed":1546380794,"SubtableLogSize":1}
{"Count":10,"DataSize":8,"MaxPower":12,"Indirect":0,"ValOffset":4,"NodeAllocs":0,"Seed":1639877174,"SubtableLogSize":4}
{"Count":100,"DataSize":8,"MaxPower":12,"Indirect":0,"ValOffset":4,"NodeAllocs":0,"Seed":1274387075,"SubtableLogSize":8}
{"Count":1000,"DataSize":8,"MaxPower":12,"Indirect":0,"ValOffset":4,"NodeAllocs":0,"Seed":526629865,"SubtableLogSize":11}
{"Count":10000,"DataSize":8,"MaxPower":12,"Indirect":0,"ValOffset":4,"NodeAllocs":0,"Seed":1053259730,"SubtableLogSize":14}
{"Count":100000,"DataSize":8,"MaxPower":12,"Indirect":0,"ValOffset":4,"NodeAllocs":15185,"Seed":2106519460,"SubtableLogSize":12}
{"Count":1000000,"DataSize":8,"MaxPower":12,"Indirect":0,"ValOffset":4,"NodeAllocs":32273,"Seed":1716741222,"SubtableLogSize":12}
{"Count":10000000,"DataSize":8,"MaxPower":12,"Indirect":0,"ValOffset":4,"NodeAllocs":47864,"Seed":1119210241,"SubtableLogSize":12}


--
Rémy.

map.go

mon...@gmail.com

unread,
Mar 19, 2012, 6:13:35 PM3/19/12
to Rémy Oudompheng, golan...@googlegroups.com, golang...@googlegroups.com
Thanks, R�my! I would comment your message below:


> Python's dictionaries are arrays (similar to slices) that are
> reallocated on growth.

Thanks for pointing this out. I have read Python's code and it's a
classic implementation of open addressing hash table like in textbook.

>> It seems to me that the hash function in python is not randomized well on
>> its input. This may be attribute to the psudo random generator on the
>> system.
>
> Python's hash function is not randomized at all. Integers hash as
> themselves. This may change in Python 3.3 or 3.4 as randomizatino is
> introduced.

Well... I was using fixed-length strings instead of integers. By
randomize, I mean the order of the hash value may different from the
order of the real input value (taking the input as a big integer or
something).

To make things clear, here is some python code:

>>> hash("000000000")
-1389505745207213127
>>> hash("000000001")
-1389505745207213128
>>> hash("000000002")
-1389505745207213125
>>> hash("000000009")
-1389505745207213136
>>> hash("000000010")
-1389505745206213126
>>> hash("000000011")
-1389505745206213125

The hash values are closed to each other and my code access them
sequentially.

>
>> 2. Python hash function may be not that "good" on some system depending on
>> the psudo random generator in the library (may be C library?).
>
> As said, Python's hash function is not randomized.
>
> Go hashmaps are represented as very flat tree-like structures: I guess
> that minimizes the allocation of giant chunks of memory and possibly
> reduces pressure on the garbage collector. The arity of the tree is
> very large (~1000) so the logarithmic effect is very slight.
>

Dave told me this also. But if it was (only) the tree structure that
leads to a logarithmic effect, then the IPC should not change.

To me, the cache effect may contribute more than the tree structure
considering the change of IPC (or CPI if you like).

Regards,
-Monnand

mon...@gmail.com

unread,
Mar 19, 2012, 6:22:55 PM3/19/12
to Rémy Oudompheng, golan...@googlegroups.com, golang...@googlegroups.com
>>> 2. Python hash function may be not that "good" on some system depending on
>>> the psudo random generator in the library (may be C library?).
>>
>> As said, Python's hash function is not randomized.
>>
>> Go hashmaps are represented as very flat tree-like structures: I guess
>> that minimizes the allocation of giant chunks of memory and possibly
>> reduces pressure on the garbage collector. The arity of the tree is
>> very large (~1000) so the logarithmic effect is very slight.
>>
>
> Dave told me this also. But if it was (only) the tree structure that
> leads to a logarithmic effect, then the IPC should not change.
>
> To me, the cache effect may contribute more than the tree structure
> considering the change of IPC (or CPI if you like).

Well. I think I didn't make this clear.

I think the tree structure may be the reason why we have cache miss and
cache miss leads to lower IPC.

You are right, so the tree structure leads to a logarithmic effect.

Thanks R�my!

Regards,
-Monnand

pww19...@gmail.com

unread,
Apr 2, 2015, 11:32:50 AM4/2/15
to golan...@googlegroups.com, golang...@googlegroups.com

Title:       The core of the core of the big data solutions -- Map
Author:      pengwenwei
Email:       
Language:    c++
Platform:    Windows, linux
Technology:  Perfect hash algorithm
Level:       Advanced
Description: Map algorithm with high performance
Section      MFC c++ map stl
SubSection   c++ algorithm
License:     (GPLv3)

    Download demo project - 1070 Kb
    Download source - 1070 Kb

Introduction:
For the c++ program, map is used everywhere.And bottleneck of program performance is often the performance of map.Especially in the case of large data,and the business association closely and unable to realize the data distribution and parallel processing condition.So the performance of map becomes the key technology.

In the work experience with telecommunications industry and the information security industry, I was dealing with the big bottom data,especially the most complex information security industry data,all can’t do without map.

For example, IP table, MAC table, telephone number list, domain name resolution table, ID number table query, the Trojan horse virus characteristic code of cloud killing etc..

The map of STL library using binary chop, its has the worst performance.Google Hash map has the optimal performance and memory at present, but it has repeated collision probability.Now the big data rarely use a collision probability map,especially relating to fees, can’t be wrong.

Now I put my algorithms out here,there are three kinds of map,after the build is Hash map.We can test the comparison,my algorithm has the zero probability of collision,but its performance is also better than the hash algorithm, even its ordinary performance has no much difference with Google.

My algorithm is perfect hash algorithm,its key index and the principle of compression algorithm is out of the ordinary,the most important is a completely different structure,so the key index compression  is fundamentally different.The most direct benefit for program is that for the original map need ten servers for solutions but now I only need one server.
Declare: the code can not be used for commercial purposes, if for commercial applications,you can contact me with QQ 75293192.
Download:
https://sourceforge.net/projects/pwwhashmap/files

Applications:
First,modern warfare can’t be without the mass of information query, if the query of enemy target information slows down a second, it could lead to the delaying fighter, leading to failure of the entire war. Information retrieval is inseparable from the map, if military products use pwwhashMap instead of the traditional map,you must be the winner.

Scond,the performance of the router determines the surfing speed, just replace open source router code map for pwwHashMap, its speed can increase ten times.
There are many tables to query and set in the router DHCP ptotocol,such as IP,Mac ,and all these are completed by map.But until now,all map are  using STL liabrary,its performance is very low,and using the Hash map has error probability,so it can only use multi router packet dispersion treatment.If using pwwHashMap, you can save at least ten sets of equipment.

Third,Hadoop is recognized as the big data solutions at present,and its most fundamental thing is super heavy use of the map,instead of SQL and table.Hadoop assumes the huge amounts of data so that the data is completely unable to move, people must carry on the data analysis in the local.But as long as the open source Hadoop code of the map changes into pwwHashMap, the performance will increase hundredfold without any problems.


Background to this article that may be useful such as an introduction to the basic ideas presented:
http://blog.csdn.net/chixinmuzi/article/details/1727195


在 2012年3月19日星期一 UTC+8下午3:29:54,Monnand写道:

Dict Seq. Access. iterations:  1000 .  0.123 us/op


Dict Seq. Access. iterations:  10000 .  0.1137 us/op
Dict Seq. Access. iterations:  100000 .  0.14328 us/op
Dict Seq. Access. iterations:  1000000 .  0.163299 us/op
Dict Seq. Access. iterations:  10000000 .  0.1822816 us/op

That's more like a hash table.

So I used C++ with map from STL and wrote another test program in C++. I
got similar result like go's map (using g++ 4.6.2, with -O3):

Map Seq. Access. iterations: 1000. 0.143us/op
Map Seq. Access. iterations: 10000. 0.2117us/op
Map Seq. Access. iterations: 100000. 0.33286us/op
Map Seq. Access. iterations: 1000000. 0.559903us/op
Map Seq. Access. iterations: 10000000. 0.731576us/op

According to the language specification of go:

The comparison operators == and != ... must be fully
defined for operands of the key type

If it is a hash table, then rather than == and !=, we need to define a
hash function on it. Such evidence let me believe that map is more like
a tree not a hash table.

However, in the source code, we have hashmap.c which, from the name,

seems like a hash table. I apologize that I didn't read the source code
because the test results is a very strong argument (to me) that map is
actually a tree structure, not a hash table. But in Go for C++
programmer[2], it says: Hash tables are provided by the language. They
are called maps.

wenwei peng

unread,
Apr 2, 2015, 9:13:39 PM4/2/15
to golan...@googlegroups.com
yes perfect hash map

--
You received this message because you are subscribed to a topic in the Google Groups "golang-nuts" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/golang-nuts/HDz5KiG6oMY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to golang-nuts...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

pww19...@gmail.com

unread,
Apr 21, 2015, 9:27:20 PM4/21/15
to golan...@googlegroups.com, golang...@googlegroups.com

pww19...@gmail.com

unread,
Apr 22, 2015, 4:10:21 AM4/22/15
to golan...@googlegroups.com, golang...@googlegroups.com
Reply all
Reply to author
Forward
0 new messages