sharded rw mutex approaches?

Jason E. Aten

unread,

Mar 13, 2025, 11:29:18 PMMar 13

to golang-nuts

Is there a common way to do sharded read-write locks now?

I mean faster than sync.RWMutex.

I tried https://github.com/jonhoo/drwmutex which is from quite a

while back...

...and it was pretty good! reducing L1 cache misses (perf measured) by a little over

1% was enough to make the benchmark code run 6x faster on a 48 core machine.

Just wondering if there are any other approaches in use today?

That one needs a regular checking of which cpu your code

is running on to assign the read-write lock, and its not clear to me how best to do that...

apparently the check is somewhat costly (slow) so you'd

rather not do it too often.

I suppose I could sample at runtime and try to determine

the distribution of thread switching times

and check after the 2%-ile time had passed... but I'd rather have

a more reliable method than such a heuristic.

Thanks for any ideas! A 6x speedup is worth chasing to me...

Jason

atomly

unread,

Mar 13, 2025, 11:52:20 PMMar 13

to golang-nuts

On Thu, Mar 13, 2025 at 20:29 Jason E. Aten <j.e....@gmail.com> wrote:

Is there a common way to do sharded read-write locks now?
I mean faster than sync.RWMutex.

I tried https://github.com/jonhoo/drwmutex which is from quite a
while back...

Have you read the original thread that spawned this, you might find it pretty informative if not:

https://groups.google.com/g/golang-nuts/c/zt_CQssHw4M/m/TteNG44geaEJ?pli=1

-awl

Robert Engels

unread,

Mar 14, 2025, 12:29:46 AMMar 14

to atomly, golang-nuts

I think it is easier to just hash and shard the data set the lock is protecting - ie a lock per shard.

On Mar 13, 2025, at 10:52 PM, atomly <ato...@gmail.com> wrote:

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/golang-nuts/CAA_Y42wUOYwnaO0ZDyC2USCMsZqVQXGck9q%2Bfb8inWoxG01brA%40mail.gmail.com.

Jason E. Aten

unread,

Mar 14, 2025, 1:17:15 AMMar 14

to golang-nuts

oh nice. I like the hashing idea to pick a shard lock out of array...that way

I don't have to stick locks on, and bloat, every node in the btree. I can

just take the hash value modulo a the size of a fixed set of locks... Thanks Robert.

p.s. awl, thanks, yes... saw that. thank you.

Robert Engels

unread,

Mar 14, 2025, 1:32:08 AMMar 14

to Jason E. Aten, golang-nuts

I know some people are put off by stuff like this, but reading the Java JDK concurrent package provides a wealth of information- it is well documented and almost all are referenced implementations of academic papers. In this case, the Java concurrent hash map would be an applicable design.

You can see some of this - not nearly as good or complete - in my github.com/robaho/go-concurrency-test

On Mar 14, 2025, at 12:17 AM, Jason E. Aten <j.e....@gmail.com> wrote:

oh nice. I like the hashing idea to pick a shard lock out of array...that way

To view this discussion visit https://groups.google.com/d/msgid/golang-nuts/08738a3e-2c94-4f52-8f49-16b287303c6bn%40googlegroups.com.

Jason E. Aten

unread,

Mar 14, 2025, 2:09:56 AMMar 14

to golang-nuts

I do keep seeing references to Java concurrent stuff people are porting to Go. I have to check it out.

I need an ordered k/v store (find the next key greater-than)... and I was trying to see how

concurrency could be added to things like https://github.com/google/btree, which, in

turns out, at least in single goroutine form, wipes the floor soundly against red-black

trees and radix trees. It's even way faster than the built in Go map, while giving ordered key lookup,

not just hash table operations. The only trade-off is that it is suuuuper slow for deletes.

Jason E. Aten

unread,

Mar 14, 2025, 2:19:38 AMMar 14

to golang-nuts

(If that seems surprising, the reason is mentioned in the sibling c++ library announcement:

"performance in these cases is effectively governed by the number of cache misses, not the number of key-compare operations" -- https://opensource.googleblog.com/2013/01/c-containers-that-save-memory-and-time.html

... so the game turns into how few pointers you can chase. Turns out in memory b-trees are great for this.

Robert Engels

unread,

Mar 14, 2025, 7:49:36 AMMar 14

to Jason E. Aten, golang-nuts

You can look at things like concurrent skip lists and for large data sets with persistence, log sequential merge trees.

On Mar 14, 2025, at 1:20 AM, Jason E. Aten <j.e....@gmail.com> wrote:

(If that seems surprising, the reason is mentioned in the sibling c++ library announcement:

To view this discussion visit https://groups.google.com/d/msgid/golang-nuts/d3081e16-37f2-4509-95d6-9cbd32bb38ebn%40googlegroups.com.

Jason E. Aten

unread,

Mar 14, 2025, 12:48:59 PMMar 14

to golang-nuts

An update... I discovered that the CPUID instruction, despite its name, is the

exactly wrong way to get the cpu ID. It is for reading features, not processor identity,

and it completely serializes everything, flushes your CPU pipeline, does a memory fence, and makes rude

noises to your grandma's face (not sure actually about that last one...)

It takes like 150 nsec.

What we actually want to get a cpu ID and not stomp on the cache coherency protocol's toes

is to use RDPID (1-2 nsec), or RDTSCP (15-20 nsec), with the first, faster,

RDPID being preferred if it is available...

https://community.intel.com/t5/Intel-ISA-Extensions/Which-processor-families-support-RDPID-instruction/m-p/1134808

If the LLM is to be believed...

"RDPID (Read Processor ID) was introduced with:

AMD: Zen 2 architecture in 2019 (starting with Ryzen 3000 series)
Intel: Ice Lake architecture in 2019 (10th gen Core processors)"

Turns out, I can get those to work, on Linux, on a 2019 Rhyzen 3960X chip.

But on a 2020 Intel i7-1068NG7 Mac running MacOS? nope.

Apparently (according to my unreliable LLM source...) Apple would have to help me out here by

actually setting up the IA32_TSC_AUX MSR... more commentary from the LLM:

"However, I notice something interesting about your earlier message - you mentioned having an Ice Lake i7-1068NG7, which should support RDPID according to Intel's documentation. The fact that we're seeing zeros from both RDTSCP and RDPID on your machine strongly suggests that macOS isn't initializing the IA32_TSC_AUX MSR (0xC0000103) with CPU IDs, rather than any instruction restriction.

This makes sense from Apple's perspective since they were already planning their transition to Apple Silicon around that time (announced in 2020), so they might not have added support for setting up this newer x86 feature in macOS."

A very snarky AI, that.

Does anyone have a clever, way to get the fast RDPID to work on Darwin?

Thanks!

Reply all

Reply to author

Forward