sharded rw mutex approaches?

162 views
Skip to first unread message

Jason E. Aten

unread,
Mar 13, 2025, 11:29:18 PMMar 13
to golang-nuts
Is there a common way to do sharded read-write locks now?
I mean faster than sync.RWMutex.

I tried https://github.com/jonhoo/drwmutex which is from quite a
while back...

...and it was pretty good! reducing L1 cache misses (perf measured) by a little over
1% was enough to make the benchmark code run 6x faster on a 48 core machine.

Just wondering if there are any other approaches in use today?

That one needs a regular checking of which cpu your code
is running on to assign the read-write lock, and its not clear to me how best to do that...
apparently the check is somewhat costly (slow) so you'd
rather not do it too often. 

I suppose I could sample at runtime and try to determine 
the distribution of thread switching times
and check after the 2%-ile time had passed... but I'd rather have
a more reliable method than such a heuristic.

Thanks for any ideas!  A 6x speedup is worth chasing to me...

Jason

atomly

unread,
Mar 13, 2025, 11:52:20 PMMar 13
to golang-nuts
On Thu, Mar 13, 2025 at 20:29 Jason E. Aten <j.e....@gmail.com> wrote:
Is there a common way to do sharded read-write locks now?
I mean faster than sync.RWMutex.

I tried https://github.com/jonhoo/drwmutex which is from quite a
while back...

Have you read the original thread that spawned this, you might find it pretty informative if not:

Robert Engels

unread,
Mar 14, 2025, 12:29:46 AMMar 14
to atomly, golang-nuts
I think it is easier to just hash and shard the data set the lock is protecting - ie a lock per shard. 

On Mar 13, 2025, at 10:52 PM, atomly <ato...@gmail.com> wrote:


--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/golang-nuts/CAA_Y42wUOYwnaO0ZDyC2USCMsZqVQXGck9q%2Bfb8inWoxG01brA%40mail.gmail.com.

Jason E. Aten

unread,
Mar 14, 2025, 1:17:15 AMMar 14
to golang-nuts
oh nice. I like the hashing idea to pick a shard lock out of array...that way
I don't have to stick locks on, and bloat, every node in the btree. I can
just take the hash value modulo a the size of a fixed set of locks... Thanks Robert.

p.s. awl, thanks, yes... saw that. thank you.

Robert Engels

unread,
Mar 14, 2025, 1:32:08 AMMar 14
to Jason E. Aten, golang-nuts
I know some people are put off by stuff like this, but reading the Java JDK concurrent package provides a wealth of information- it is well documented and almost all are referenced implementations of academic papers. In this case, the Java concurrent hash map would be an applicable design. 

You can see some of this - not nearly as good or complete - in my github.com/robaho/go-concurrency-test

On Mar 14, 2025, at 12:17 AM, Jason E. Aten <j.e....@gmail.com> wrote:

oh nice. I like the hashing idea to pick a shard lock out of array...that way

Jason E. Aten

unread,
Mar 14, 2025, 2:09:56 AMMar 14
to golang-nuts
I do keep seeing references to Java concurrent stuff people are porting to Go. I have to check it out.

I need an ordered k/v store (find the next key greater-than)... and I was trying to see how
concurrency could be added to things like https://github.com/google/btree, which, in
turns out, at least in single goroutine form, wipes the floor soundly against red-black
trees and radix trees. It's even way faster than the built in Go map, while giving ordered key lookup,
not just hash table operations. The only trade-off is that it is suuuuper slow for deletes.

Jason E. Aten

unread,
Mar 14, 2025, 2:19:38 AMMar 14
to golang-nuts
(If that seems surprising, the reason is mentioned in the sibling c++ library announcement:
"performance in these cases is effectively governed by the number of cache misses, not the number of key-compare operations" -- https://opensource.googleblog.com/2013/01/c-containers-that-save-memory-and-time.html 
... so the game turns into how few pointers you can chase. Turns out in memory b-trees are great for this.

Robert Engels

unread,
Mar 14, 2025, 7:49:36 AMMar 14
to Jason E. Aten, golang-nuts
You can look at things like concurrent skip lists and for large data sets with persistence, log sequential merge trees. 

On Mar 14, 2025, at 1:20 AM, Jason E. Aten <j.e....@gmail.com> wrote:

(If that seems surprising, the reason is mentioned in the sibling c++ library announcement:

Jason E. Aten

unread,
Mar 14, 2025, 12:48:59 PMMar 14
to golang-nuts
An update... I discovered that the CPUID instruction, despite its name, is the
exactly wrong way to get the cpu ID. It is for reading features, not processor identity,
and it completely serializes everything, flushes your CPU pipeline, does a memory fence, and makes rude
noises to your grandma's face (not sure actually about that last one...)

It takes like 150 nsec.

What we actually want to get a cpu ID and not stomp on the cache coherency protocol's toes
is to use RDPID (1-2 nsec), or RDTSCP (15-20 nsec), with the first, faster,
RDPID being preferred if it is available...

If the LLM is to be believed...
"RDPID (Read Processor ID) was introduced with:
AMD: Zen 2 architecture in 2019 (starting with Ryzen 3000 series)
Intel: Ice Lake architecture in 2019 (10th gen Core processors)"

Turns out, I can get those to work, on Linux, on a 2019 Rhyzen 3960X chip. 

But on a 2020 Intel i7-1068NG7 Mac running MacOS? nope.

Apparently (according to my unreliable LLM source...) Apple would have to help me out here by 
actually setting up the IA32_TSC_AUX MSR... more commentary from the LLM:

"However, I notice something interesting about your earlier message - you mentioned having an Ice Lake i7-1068NG7, which should support RDPID according to Intel's documentation. The fact that we're seeing zeros from both RDTSCP and RDPID on your machine strongly suggests that macOS isn't initializing the IA32_TSC_AUX MSR (0xC0000103) with CPU IDs, rather than any instruction restriction.
This makes sense from Apple's perspective since they were already planning their transition to Apple Silicon around that time (announced in 2020), so they might not have added support for setting up this newer x86 feature in macOS."

A very snarky AI, that.

Does anyone have a clever, way to get the fast RDPID to work on Darwin?

Thanks!
Reply all
Reply to author
Forward
0 new messages