Thank you very much for your response. I'll summarize here the main difference with go map.
I very easily got faster than go map with integers and alike by dropping the swiss groups and reducing the group size to fit a cache line. As key comparison is very fast (no secondary memory access) a simple sequential key comparison is fast.The speed gain might be due to the hash function as I rewrote the xxh3 hasher for uint64 values. I didn't spent much time exploring this further as it wasn't my use case.
Strings key definitely benefit from top hashes (swiss table groups). I use 8 bit top hashes instead of 7 which reduces the number of false positive and I use a perfect match which also reduce the number of false positive on arm64. amd64 uses a perfect match as it uses the simd instructions.
Another difference is that go map uses quadratic probing and my map doesn't. The popular wisdom is that probing is faster because the probability that the data is in cache is slightly higher and the table filling is better. Unfortunately this increases the number of required tests where false positive increases the penalty. The numbers I showed are for a different kind of table. It is composed of three arrays T1,T2 and T3. T1 is a conventional table of 8 items swiss groups. When a. group is full the item is stored in the overflow hash table T2. When the group in T2 is full, the item is appended to T3 which is treated as a sequence of groups. I do a linear search in this table containing the overflows of overflows. I implemented a table using quadratic probing and it was faster than go map on arm64. I assume it is due to the use of 8 bit hashes reducing the false positive. I'm not sure.
I tested independent hashes for T1 and T2 that then requires the use of tombstones, and related hashes. By related hash I mean that the index in T2 is for instance the fibonacci hash of the index in T1. Best filling rates of T2 were obtained with a random permutation mapping. The size ration between T1 and T2 should in theory be 2:1, but best performance where with 16:1. I assumed that this is because T2 data is hotter (in cache) than with a bigger table. The benefit of such deterministic relation between the index of T1 and T2 is that we don't need tombstones. We can use a swap pop strategy. In summary, the idea was to minimize the number of required probes to find a key.
The table uses also an extensible hash for which I have a simpler code than go map. I won't go faster. But this directory indirection introduces a significant performance penalty. In a later benchmark, after the one I published here, I saw a significant performance increase by removing the directory. When the hash function is good there most tables have the same depth.
Another optimization compared to go map is that my tables in the directory don't grow. Using fixed sized array give some performance advantage as boundaries and hash mask are constants. There is a small space usage penalty but when a table is split at 80% for instance, it becomes 40%. It is then very close to need a grow which I considered a waste of effort. For small number of items fitting. in one table, it may be worth to use a growable table which is what I didn't bother to implement.
I also saw that function calls are expensive. I then used manual inlining for the Get method. It is nice to avoid for code readability and maintenance though.
I tried assembly on arm64 where I could experiment with simd instructions, but due to the assembly function call overhead it was slower than pure go. Passing arguments by register wouldn't help. We would need assembly inlining, but I understand that this would open a can of worms and probably create more problems than it would solve.
If you are still interested and want some specific numbers I would be glad to give them.
Here is the cache miss benchmark that shows the benefit of reducing the number of probes
goos: darwin
goarch: arm64
pkg: fastCache/map
cpu: Apple M2
│ dirstr12/stats_arm64.txt │ puremapstr/stats_arm64.txt │
│ sec/op │ sec/op vs base │
Cache2Miss/_______1-8 4.721n ± 0% 8.298n ± 0% +75.75% (p=0.002 n=6)
Cache2Miss/______10-8 5.452n ± 0% 9.094n ± 25% +66.81% (p=0.002 n=6)
Cache2Miss/_____100-8 5.726n ± 6% 11.550n ± 5% +101.71% (p=0.002 n=6)
Cache2Miss/____1000-8 6.572n ± 4% 12.550n ± 21% +90.96% (p=0.002 n=6)
Cache2Miss/___10000-8 9.428n ± 4% 18.400n ± 7% +95.17% (p=0.002 n=6)
Cache2Miss/__100000-8 14.79n ± 1% 26.60n ± 2% +79.79% (p=0.002 n=6)
Cache2Miss/_1000000-8 29.11n ± 1% 36.83n ± 2% +26.54% (p=0.002 n=6)
Cache2Miss/10000000-8 40.44n ± 5% 51.12n ± 1% +26.41% (p=0.002 n=6)
geomean 10.60n 17.80n +67.98%
Here is the cache hit comparing dirstr12 with flat8 that has no directory.
goos: darwin
goarch: arm64
pkg: fastCache/map
cpu: Apple M2
│ flat8/stats_arm64.txt │ dirstr12/stats_arm64.txt │
│ sec/op │ sec/op vs base │
Cache2Hit/_______1-8 6.047n ± 0% 6.246n ± 5% +3.28% (p=0.002 n=6)
Cache2Hit/______10-8 8.172n ± 0% 8.499n ± 0% +4.00% (p=0.002 n=6)
Cache2Hit/_____100-8 7.893n ± 0% 8.256n ± 5% +4.61% (p=0.002 n=6)
Cache2Hit/____1000-8 7.585n ± 6% 8.228n ± 2% +8.48% (p=0.002 n=6)
Cache2Hit/___10000-8 9.191n ± 0% 10.375n ± 1% +12.88% (p=0.002 n=6)
Cache2Hit/__100000-8 10.03n ± 4% 12.14n ± 1% +21.15% (p=0.002 n=6)
Cache2Hit/_1000000-8 40.79n ± 1% 42.27n ± 4% +3.64% (p=0.002 n=6)
Cache2Hit/10000000-8 47.29n ± 1% 56.49n ± 9% +19.47% (p=0.002 n=6)
geomean 12.31n 13.47n +9.48%
I currently don't has access to an x86 computer. Thank you for your mail.