fmap: a faster map

christoph...@gmail.com

unread,

Jun 19, 2025, 7:11:28 AMJun 19

to golang-nuts

Hello,

trying to implement a fast cache, I inadvertently entered a rabbit hole that led me to implement my own map. In the process I tried to make it faster than the go map just to see if it is possible. I worked weeks on it trying out various architectures and methods.

On my mac book air M2, I get the following benchmarks for a Get operation. The numbers are the number of items inserted in the table. My keys are 8 byte long strings.

goos: darwin
goarch: arm64
pkg: fastCache/map
cpu: Apple M2
│ dirstr12/stats_arm64.txt │ puremapstr/stats_arm64.txt │
│ sec/op │ sec/op vs base │
Cache2Hit/_______1-8 6.151n ± 6% 7.087n ± 1% +15.22% (p=0.002 n=6)
Cache2Hit/______10-8 8.491n ± 0% 8.156n ± 29% ~ (p=0.394 n=6)
Cache2Hit/_____100-8 8.141n ± 7% 14.185n ± 13% +74.24% (p=0.002 n=6)
Cache2Hit/____1000-8 8.252n ± 3% 10.635n ± 39% +28.89% (p=0.002 n=6)
Cache2Hit/___10000-8 10.45n ± 2% 20.99n ± 4% +100.81% (p=0.002 n=6)
Cache2Hit/__100000-8 12.16n ± 1% 19.11n ± 10% +57.05% (p=0.002 n=6)
Cache2Hit/_1000000-8 42.28n ± 2% 47.90n ± 2% +13.29% (p=0.002 n=6)
Cache2Hit/10000000-8 56.38n ± 12% 61.82n ± 6% ~ (p=0.065 n=6)
geomean 13.44n 17.86n +32.91%

On my amd64 i5 11th gen I get the following benchmarks.

goos: linux
goarch: amd64
pkg: fastCache/map
cpu: 11th Gen Intel(R) Core(TM) i5-11400 @ 2.60GHz
│ dirstr12/stats_amd64.txt │ puremapstr/stats_amd64.txt │
│ sec/op │ sec/op vs base │
Cache2Hit/_______1-12 9.207n ± 1% 7.506n ± 3% -18.48% (p=0.002 n=6)
Cache2Hit/______10-12 9.223n ± 0% 8.806n ± 6% ~ (p=0.058 n=6)
Cache2Hit/_____100-12 9.279n ± 2% 10.175n ± 3% +9.66% (p=0.002 n=6)
Cache2Hit/____1000-12 10.45n ± 2% 11.29n ± 3% +8.04% (p=0.002 n=6)
Cache2Hit/___10000-12 16.00n ± 2% 17.21n ± 5% +7.59% (p=0.002 n=6)
Cache2Hit/__100000-12 22.20n ± 17% 24.73n ± 22% +11.42% (p=0.026 n=6)
Cache2Hit/_1000000-12 87.75n ± 2% 91.05n ± 5% +3.76% (p=0.009 n=6)
Cache2Hit/10000000-12 104.2n ± 2% 105.6n ± 5% ~ (p=0.558 n=6)
geomean 20.11n 20.49n +1.90%

On amd64 the performance is on par with go map, but go map uses inlined simd instructions which I don’t use because I don’t have access to it in pure go. I use xxh3 right out of the box for the hash function. The differences are not due to the hash function.

If there is interest in this, please let me know.

Keith Randall

unread,

Jun 21, 2025, 6:12:25 PMJun 21

to golang-nuts

Certainly interesting. We'd be interested in incorporating some ideas from your experiments if they translate over.

One thing to caution, you showed no numbers for space utilization, map miss speed, whether your map can do iteration, etc. It is possible to trade off one of these for other ones, so it is not immediately obvious that your techniques would apply to the builtin map. For instance, if you're using 10x the space to get 10% speedup, we probably wouldn't want to do that for the general map. (Not saying that's what you did, just pointing out raw speed is not the only consideration.)

christoph...@gmail.com

unread,

Jun 22, 2025, 4:28:26 AMJun 22

to golang-nuts

Thank you very much for your response. I'll summarize here the main difference with go map.

I very easily got faster than go map with integers and alike by dropping the swiss groups and reducing the group size to fit a cache line. As key comparison is very fast (no secondary memory access) a simple sequential key comparison is fast.The speed gain might be due to the hash function as I rewrote the xxh3 hasher for uint64 values. I didn't spent much time exploring this further as it wasn't my use case.

Strings key definitely benefit from top hashes (swiss table groups). I use 8 bit top hashes instead of 7 which reduces the number of false positive and I use a perfect match which also reduce the number of false positive on arm64. amd64 uses a perfect match as it uses the simd instructions.

Another difference is that go map uses quadratic probing and my map doesn't. The popular wisdom is that probing is faster because the probability that the data is in cache is slightly higher and the table filling is better. Unfortunately this increases the number of required tests where false positive increases the penalty. The numbers I showed are for a different kind of table. It is composed of three arrays T1,T2 and T3. T1 is a conventional table of 8 items swiss groups. When a. group is full the item is stored in the overflow hash table T2. When the group in T2 is full, the item is appended to T3 which is treated as a sequence of groups. I do a linear search in this table containing the overflows of overflows. I implemented a table using quadratic probing and it was faster than go map on arm64. I assume it is due to the use of 8 bit hashes reducing the false positive. I'm not sure.

I tested independent hashes for T1 and T2 that then requires the use of tombstones, and related hashes. By related hash I mean that the index in T2 is for instance the fibonacci hash of the index in T1. Best filling rates of T2 were obtained with a random permutation mapping. The size ration between T1 and T2 should in theory be 2:1, but best performance where with 16:1. I assumed that this is because T2 data is hotter (in cache) than with a bigger table. The benefit of such deterministic relation between the index of T1 and T2 is that we don't need tombstones. We can use a swap pop strategy. In summary, the idea was to minimize the number of required probes to find a key.

The table uses also an extensible hash for which I have a simpler code than go map. I won't go faster. But this directory indirection introduces a significant performance penalty. In a later benchmark, after the one I published here, I saw a significant performance increase by removing the directory. When the hash function is good there most tables have the same depth.

Another optimization compared to go map is that my tables in the directory don't grow. Using fixed sized array give some performance advantage as boundaries and hash mask are constants. There is a small space usage penalty but when a table is split at 80% for instance, it becomes 40%. It is then very close to need a grow which I considered a waste of effort. For small number of items fitting. in one table, it may be worth to use a growable table which is what I didn't bother to implement.

I also saw that function calls are expensive. I then used manual inlining for the Get method. It is nice to avoid for code readability and maintenance though.

I tried assembly on arm64 where I could experiment with simd instructions, but due to the assembly function call overhead it was slower than pure go. Passing arguments by register wouldn't help. We would need assembly inlining, but I understand that this would open a can of worms and probably create more problems than it would solve.

If you are still interested and want some specific numbers I would be glad to give them.

Here is the cache miss benchmark that shows the benefit of reducing the number of probes

goos: darwin
goarch: arm64
pkg: fastCache/map
cpu: Apple M2
│ dirstr12/stats_arm64.txt │ puremapstr/stats_arm64.txt │
│ sec/op │ sec/op vs base │

Cache2Miss/_______1-8 4.721n ± 0% 8.298n ± 0% +75.75% (p=0.002 n=6)
Cache2Miss/______10-8 5.452n ± 0% 9.094n ± 25% +66.81% (p=0.002 n=6)
Cache2Miss/_____100-8 5.726n ± 6% 11.550n ± 5% +101.71% (p=0.002 n=6)
Cache2Miss/____1000-8 6.572n ± 4% 12.550n ± 21% +90.96% (p=0.002 n=6)
Cache2Miss/___10000-8 9.428n ± 4% 18.400n ± 7% +95.17% (p=0.002 n=6)
Cache2Miss/__100000-8 14.79n ± 1% 26.60n ± 2% +79.79% (p=0.002 n=6)
Cache2Miss/_1000000-8 29.11n ± 1% 36.83n ± 2% +26.54% (p=0.002 n=6)
Cache2Miss/10000000-8 40.44n ± 5% 51.12n ± 1% +26.41% (p=0.002 n=6)
geomean 10.60n 17.80n +67.98%

Here is the cache hit comparing dirstr12 with flat8 that has no directory.

goos: darwin
goarch: arm64
pkg: fastCache/map
cpu: Apple M2

│ flat8/stats_arm64.txt │ dirstr12/stats_arm64.txt │
│ sec/op │ sec/op vs base │
Cache2Hit/_______1-8 6.047n ± 0% 6.246n ± 5% +3.28% (p=0.002 n=6)
Cache2Hit/______10-8 8.172n ± 0% 8.499n ± 0% +4.00% (p=0.002 n=6)
Cache2Hit/_____100-8 7.893n ± 0% 8.256n ± 5% +4.61% (p=0.002 n=6)
Cache2Hit/____1000-8 7.585n ± 6% 8.228n ± 2% +8.48% (p=0.002 n=6)
Cache2Hit/___10000-8 9.191n ± 0% 10.375n ± 1% +12.88% (p=0.002 n=6)
Cache2Hit/__100000-8 10.03n ± 4% 12.14n ± 1% +21.15% (p=0.002 n=6)
Cache2Hit/_1000000-8 40.79n ± 1% 42.27n ± 4% +3.64% (p=0.002 n=6)
Cache2Hit/10000000-8 47.29n ± 1% 56.49n ± 9% +19.47% (p=0.002 n=6)
geomean 12.31n 13.47n +9.48%

I currently don't has access to an x86 computer. Thank you for your mail.

Keith Randall

unread,

Jun 23, 2025, 4:41:46 PMJun 23

to golang-nuts

> As key comparison is very fast (no secondary memory access) a simple sequential key comparison is fast.

We've looked at that, see CL 626699. Maybe helps in some cases, seems to hurt for miss lookups. We decided not to move forward with it, but it is definitely something to think about.

> I use 8 bit top hashes instead of 7

How do you record a presence bit?

> We can use a swap pop strategy.

Not sure exactly what that is, but it sounds incompatible with iteration. Our required iteration semantics mean it is very hard to move keys.

> There is a small space usage penalty but when a table is split at 80% for instance, it becomes 40%. It is then very close to need a grow which I considered a waste of effort.

I think when we split a table we use the max-sized table for each part? I'm not sure what effect you're seeing here then.

> I also saw that function calls are expensive. I then used manual inlining for the Get method. It is nice to avoid for code readability and maintenance though.

We could theoretically have the compiler do this for builtin maps.

christoph...@gmail.com

unread,

Jun 25, 2025, 6:16:54 AMJun 25

to golang-nuts

By swap pop I mean that a deleted item is replaced by the last item of the group or the chain. This will indeed move items.

I have implemented a new version using 8bit top hashes, tombstones, perfect match, quadratic probing and extensible hash table. It leaves items in place and might thus be compatible with the iteration requirements of go maps.

I plan to publish it on github but still need to verify deleting items on which I didn't spent much time so far. The following benchmark is with tables of 256 Groups. A table is split when reaching 90%. The filling rates is then far better than with my 3 table architecture. Using quadratic probing would then be a better choice.

goos: darwin
goarch: arm64
pkg: fastmap/map
cpu: Apple M2
│ altmap/stats_arm64.txt │ stdmap/stats_arm64.txt │
│ sec/op │ sec/op vs base │
Cache2Hit/_______1-8 6.386n ± 0% 7.081n ± 0% +10.88% (p=0.000 n=10)
Cache2Hit/______10-8 8.813n ± 0% 8.114n ± 30% ~ (p=0.481 n=10)
Cache2Hit/_____100-8 8.473n ± 7% 15.720n ± 8% +85.52% (p=0.000 n=10)
Cache2Hit/____1000-8 8.735n ± 2% 13.715n ± 23% +57.01% (p=0.000 n=10)
Cache2Hit/___10000-8 11.04n ± 1% 20.38n ± 6% +84.56% (p=0.000 n=10)
Cache2Hit/__100000-8 13.19n ± 1% 17.62n ± 17% +33.59% (p=0.000 n=10)
Cache2Hit/_1000000-8 52.02n ± 1% 55.45n ± 0% +6.59% (p=0.001 n=10)
Cache2Hit/10000000-8 64.91n ± 1% 68.12n ± 3% +4.96% (p=0.002 n=10)
Cache2Miss/_______1-8 5.003n ± 0% 8.243n ± 0% +64.73% (p=0.000 n=10)
Cache2Miss/______10-8 5.878n ± 3% 9.238n ± 11% +57.18% (p=0.000 n=10)
Cache2Miss/_____100-8 6.037n ± 4% 12.600n ± 9% +108.70% (p=0.000 n=10)
Cache2Miss/____1000-8 6.992n ± 7% 12.015n ± 17% +71.83% (p=0.000 n=10)
Cache2Miss/___10000-8 10.49n ± 2% 17.66n ± 9% +68.40% (p=0.000 n=10)
Cache2Miss/__100000-8 20.73n ± 1% 26.56n ± 2% +28.09% (p=0.000 n=10)
Cache2Miss/_1000000-8 33.52n ± 1% 36.83n ± 1% +9.89% (p=0.000 n=10)
Cache2Miss/10000000-8 48.62n ± 1% 50.84n ± 1% +4.56% (p=0.000 n=10)
geomean 13.25n 18.38n +38.75%

To be honest I can't fully explain these numbers. Reducing false positive by using an exact match and 8 bit top hashes might be part of it. It could also be the hash function. xxh3 on arm64 is pure go only. It is fast for short strings, but bad for long strings as it doesn't use simd (yet). The 8 byte key strings are fmt.Sprintf("%7d ", i) and fmt.Sprintf("%7d-", i) for misses. I hope that these numbers don't result from a bug in my code.

For the top hash I use the H1, H2 terminology of swiss tables. H2 are the 8 less significant bits of the hash. 0x00 is used for free slots, and 0x80 for the tombstone. Computing H2 implies thus a conditional branch, but the cpu may predict its outcome right most of the time. I benchmarked 7bit top hashes, but it ended to be slower. I assumed it was due to the increased number of false positive. Using 0x80 for tombstone saves one op to find free slots in a group. If the speed up is due to the 8bit top hashes, I suspect that the C++ swiss table might also benefit from it.

On amd64, the performance is on par with go map, but I don't use simd instruction.

christoph...@gmail.com

unread,

Jun 26, 2025, 3:37:45 AMJun 26

to golang-nuts

Hello,

the code is published here https://github.com/chmike/fastmap.

christoph...@gmail.com

unread,

Jun 27, 2025, 3:23:53 AMJun 27

to golang-nuts

Just benchmarked the fastmap with integer keys. It is exactly the same as the one I published but with integer keys and a xxh3 function for integer values.

goos: darwin
goarch: arm64
pkg: fastCache/map
cpu: Apple M2

│ cacheint6/stats_arm64.txt │ puremapint/stats_arm64.txt │
│ sec/op │ sec/op vs base │
Cache2Hit/_______1-8 3.496n ± 0% 2.065n ± 0% -40.93% (p=0.002 n=6)
Cache2Hit/______10-8 4.327n ± 2% 5.052n ± 0% +16.74% (p=0.002 n=6)
Cache2Hit/_____100-8 4.364n ± 4% 5.731n ± 12% +31.32% (p=0.002 n=6)
Cache2Hit/____1000-8 4.248n ± 5% 5.935n ± 3% +39.73% (p=0.002 n=6)
Cache2Hit/___10000-8 5.167n ± 1% 7.553n ± 1% +46.18% (p=0.002 n=6)
Cache2Hit/__100000-8 6.763n ± 1% 9.300n ± 3% +37.52% (p=0.002 n=6)
Cache2Hit/_1000000-8 18.25n ± 1% 26.32n ± 1% +44.26% (p=0.002 n=6)
Cache2Hit/10000000-8 28.11n ± 0% 37.95n ± 0% +35.05% (p=0.002 n=6)
geomean 6.881n 8.405n +22.15%

Jason E. Aten

unread,

Jun 27, 2025, 3:53:42 PMJun 27

to golang-nuts

Hi Christoph,

You might have seen my deterministic map findings that showed that if your access

pattern is repeated full range scans, then caching the key/value pairs in a slice soundly

trounces the built in map soundly due to better L1 utilization.

https://groups.google.com/g/golang-nuts/c/MeY_OKAhjQg

-- which is just to say, if you want to go even faster, and you are mostly doing full range

scans and having a repeatable, deterministic iteration order is okay -- then caching can probably allow you to go even faster

than you are now; I give a point example of going 14x faster than the Go builtin map.

Best,

Jason

On Friday, June 27, 2025 at 9:23:53 AM UTC+2 christoph...@gmail.com wrote:

Just benchmarked the fastmap with integer keys. It is exactly the same as the one I published but with integer keys and a xxh3 function for integer values.

goos: darwin
goarch: arm64
pkg: fastCache/map
cpu: Apple M2
│ cacheint6/stats_arm64.txt │ puremapint/stats_arm64.txt │
│ sec/op │ sec/op vs base │
Cache2Hit/_______1-8 3.496n ± 0% 2.065n ± 0% -40.93% (p=0.002 n=6)
Cache2Hit/______10-8 4.327n ± 2% 5.052n ± 0% +16.74% (p=0.002 n=6)
Cache2Hit/_____100-8 4.364n ± 4% 5.731n ± 12% +31.32% (p=0.002 n=6)
Cache2Hit/____1000-8 4.248n ± 5% 5.935n ± 3% +39.73% (p=0.002 n=6)
Cache2Hit/___10000-8 5.167n ± 1% 7.553n ± 1% +46.18% (p=0.002 n=6)
Cache2Hit/__100000-8 6.763n ± 1% 9.300n ± 3% +37.52% (p=0.002 n=6)
Cache2Hit/_1000000-8 18.25n ± 1% 26.32n ± 1% +44.26% (p=0.002 n=6)
Cache2Hit/10000000-8 28.11n ± 0% 37.95n ± 0% +35.05% (p=0.002 n=6)
geomean 6.881n 8.405n +22.15%