Benchmarking spinlock

647 views
Skip to first unread message

Elazar Leibovich

unread,
Nov 27, 2014, 4:28:44 PM11/27/14
to mechanica...@googlegroups.com
(x-posted from the linux-il mailing list, as I think this is a more appropriate forum)
I'm trying to write a benchmark for spinlocks.

Two previous benchmark methodology I found are [0[ and [1]. If you know any other source - I'll be happy to read about it.

Two prerequirements which are valid to any micro-benchmark:

1) Are there any best practices to measure memory bus traffic?
2) Are there any best practice to measure cache footprint?
3) Are there any best practices to eliminate outliers, by, e.g., interrupts or OS context switches?

Design goals:
1) Benchmark should run on a reasonable OS. Requiring a special purpose hardware would increase benchmark accuracy, but would make running it much more difficult.
2) Running with an OS might distort the benchmark with context switches or interrupt. To solve that, you can repeat the test multiple times and remove outliers.
3) Should give a few simple outputs, so that lock designer could use the results as a proxy for how to optimize the lock for his use case.
4) Initial design assumes x86 for brevity, but changes to other architecture should be easy.

Implementation:
Input parameters:
LENGTH estimated # cycles that the lock would be held.
N number of CPUs contending for the lock
1) Count cycles (e.g., r for uncontended acquire and release - that would be the first output.
2) Measure the memory footprint of uncontended acquire and release - that would be the second output. This can be done by invalidating the cache, running lock acquire release, and reading cache misses from performance counters. To eliminate other overhead, do the same with a NOP lock - this would be the overhead.
That would be the second output.
3) Pin N threads to N different logical CPUs. We can afford ignoring logical CPUs, since in most use cases you cannot ensure two threads would run on different hardware cpus. Each CPU would pause for LENGTH cycles, by running commands that do NOT use the memory bus.
4) Count time (e.g., rdtsc) from the time the first CPU acquired the lock, to the time the last one did.
Do that by acquiring the lock, keeping the timestamp in a register, releasing the lock, and then after a 1ms pause, writing this value to a known location, so that writing the timestamp won't interfere with the test. after 10ms, when all CPUS are definitely done with the benchmark.
The problem here is, the timestamps between CPUs might not be synchronized.
Possible solutions are:
4.a. Use HPET instead, than again the HPET overhead might distort the test case.
4.b On newer Intel HW, I understand there's invariant tsc which can be a solution. Not sure how portable would that be.
4.c Assuming the time drift is small, averaging over multiple experiments might give a small enough standard deviation.
I'm not sure how much problematic would that be in practice.
5?) I'm not sure its possible, or meaningful, but one measure the memory bus utilization during the lock contention. Maybe counting the total cache misses for all CPUs while contending over the lock.

You'll end up with four actionable parameters:
CPU usage for the uncontended case.
Memory footprint for the uncontended case.
CPU overhead for the contended case
Memory bus usage (or a close proxy of it) in the contended case.

Does that sound like a good idea? How can I improve it


Gil Tene

unread,
Nov 28, 2014, 11:04:41 AM11/28/14
to mechanica...@googlegroups.com
A couple of notes:

- Note that the typical measurement techniques (and the typical use of spin locks) are done inside an OS kernel, and not in user mode. Inside an OS, spin locks will usually prevent context switching, and will often disable interrupts. Both are done for a reason (things go very bad very fast in the real world when you don't).

- Assuming a user-mode test (per your description): unless you are looking at a use case with something like isolcpus, a single participating thread per isolated core, and all interrupts redirected to not hit the participating cores, I'd include (and bear the cost of) a yield() call in the loop in all measurements, because no practical use case will remain robust without it. Using a non-yielding spin loop without such a setup will come with predictably terrible outliers (as in terrible enough to not use a spin lock). Short u-benchmarks that "show" otherwise simply haven't run long enough to encounter the  common ~1msec+ lock owner scheduling blips that will make the dominos start to fall over (often resulting in occasional 100s of msec of acquire time when a moderate number of threads are involved). 

- Look at using the x86 PAUSE instruction in the spin loop. Expect to see improved acquire latencies with it in there.

- In latency tests for low throughput situations (i.e. when spin time is expected to be significant), consider/compare the effect of having your spin loop actively (and incrementally, one item per loop) access all other variables used outside of the spin loop Choosing what to touch in the spin is a bit of "black magic", and is done to keep the active state in the L1/L2/L3 caches by recovering it immediately when some neighboring core's L3 activity ends up evicting your "idle spinning" core's useful cached state. 


There is a good reason for the lack of user mode spinlock and spinunlock primitives in common libraries. They are not robust.

Elazar Leibovich

unread,
Nov 29, 2014, 1:31:10 PM11/29/14
to mechanica...@googlegroups.com
Thanks!

I think I didn't explain myself correctly.

I want to use spinlock in kernel environment.
I want to benchmark them in userspace.

The reason I want to benchmark them in userspace, is, to make running the tests easier. If you'll have to reboot the computer to run my benchmark - no one would run them. It would make the spinlocks development cycle much faster.

To remedy the situation when a lock holder is preempted in the benchmark, I can recognize those outliers, and ignore them. I can assume a certain distribution for the spinlock performance, and assuming that distribution I can recognize and eliminate outliers.

Maybe it's not really possible, but, a man can dream.

> - In latency tests for low throughput situations (i.e. when spin time is 
> expected to be significant), consider/compare the effect of having 
> your spin loop actively (and incrementally, one item per loop) access 
> all other variables used outside of the spin loop. Choosing what to 
> touch in the spin is a bit of "black magic", and is done to keep the 
> active state in the L1/L2/L3 caches by recovering it immediately 
> when some neighboring core's L3 activity ends up evicting your "idle 
> spinning" core's useful cached state.

I'm not sure I understand that. Do you suggest making sure L3 cache is evicted in benchmarks, since in real life, useful variables in L3 could be evicted for the spinning threads, while they're spinning?

[A small footnote, in virtual systems, one have the lock holder preemption problem which is somewhat similar to spinlock in userspace.]

--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/YTEPa9Qtfv4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-sympathy+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Gil Tene

unread,
Nov 29, 2014, 4:43:29 PM11/29/14
to mechanica...@googlegroups.com

On Saturday, November 29, 2014 10:31:10 AM UTC-8, Elazar Leibovich wrote:
Thanks!

I think I didn't explain myself correctly.

I want to use spinlock in kernel environment.
I want to benchmark them in userspace.

Ah. That makes more sense.
 

The reason I want to benchmark them in userspace, is, to make running the tests easier. If you'll have to reboot the computer to run my benchmark - no one would run them. It would make the spinlocks development cycle much faster.

To remedy the situation when a lock holder is preempted in the benchmark, I can recognize those outliers, and ignore them. I can assume a certain distribution for the spinlock performance, and assuming that distribution I can recognize and eliminate outliers.

Maybe it's not really possible, but, a man can dream.

> - In latency tests for low throughput situations (i.e. when spin time is 
> expected to be significant), consider/compare the effect of having 
> your spin loop actively (and incrementally, one item per loop) access 
> all other variables used outside of the spin loop. Choosing what to 
> touch in the spin is a bit of "black magic", and is done to keep the 
> active state in the L1/L2/L3 caches by recovering it immediately 
> when some neighboring core's L3 activity ends up evicting your "idle 
> spinning" core's useful cached state.

I'm not sure I understand that. Do you suggest making sure L3 cache is evicted in benchmarks, since in real life, useful variables in L3 could be evicted for the spinning threads, while they're spinning?

No, I'm referring to a technique used in dedicated-core, user-mode spinning loops in latency sensitive apps that are intended to minimize wakeup latency by keeping state warm in the cache even when no activity has occurred for a while. Think of critical latency paths may end up waiting for several sec with no activity, but still care about every fraction of a usec in the response path when an event does arrive.

This is not very relevant to kernel use of spin locks (which is what you are looking at), since kernels tend to be "sane" in their use of cpu cores and would never spin-wait on a core just for fun.
 
[A small footnote, in virtual systems, one have the lock holder preemption problem which is somewhat similar to spinlock in userspace.]

Yup. This is why a common hypervisor scheduling trick is to prefer to preempt guests only when they are in user-mode (and therefore are known to not be holding a kernel spinlock), reverting to preempting a guest in non-user-mode only when "you absolutely have to". Still, the problem of taking away a core form kernel code that is holding a spin lock is very real. And so is the problem of taking away a core long enough to make timing sanity riggers panic. E.g. some hypervisors will slew and virtualize the TSC register in order to avoid apparent "impossible on hardware" time skips.
 

To unsubscribe from this group and all its topics, send an email to mechanical-sympathy+unsubscribe...@googlegroups.com.

Elazar Leibovich

unread,
Nov 30, 2014, 12:21:50 AM11/30/14
to mechanica...@googlegroups.com
I'm not sure if you've answered that. But do you think measuring lock metrics in user space, and eliminating outliers could give meaningful results?

This is the part I'm most hesitant about.

Thanks again,

To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.

Gil Tene

unread,
Dec 2, 2014, 7:29:43 AM12/2/14
to mechanica...@googlegroups.com
Well, that depends on what you mean by meaningful results. It will probably establish the throughput you can get through the spin lock without much impact. But as you know, that is not the most important behavior to look at in a spin lock. When it comes to corner issues like fairness & near-starvation, outliers driven by the actual spin-lock behavior may not be distinguishable from the ones you would be filtering out as "user-mode related".

"regular" spinlock implementations exhibit serious outlier problems due to fairness and near-starvation problems even with low to mid tens of participants on modern hardware. At those core counts, a spin lock waiter can experience 100s of msec of spin times and can snowball into full watchdog timer panics. These fairness related outlier behavior problem has led to various improved fairness designs like ticketed spinlocks and fairlock variants. The process is still not complete (e.g. readwrite spin locks still don't use ticketing, I think), but I expect that at some point in the near future spin locks that do not include inherent fairness controls will become a thing of the past.

With that in mind, any new spin-lock implementation you would want to measure should probably be taking outlier behavior into account as a primary behavior mode (one that is probably more important that avg. latency or throughput). And measuring that (kernel) behavior emulated in user-mode is going to be hard...
To unsubscribe from this group and all its topics, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Elazar Leibovich

unread,
Dec 2, 2014, 3:12:04 PM12/2/14
to mechanica...@googlegroups.com
Excellent points I really didn't think of. This is obviously a bad approach.

What do you think about tracing events in the kernel, then disqualifying samples which intersect a "dangerous" event.

Linux ftrace can use the "invariant tsc" x64 feature, which can eliminate time drift. I'm not sure if there are similar features in other architecture, but if there are, it might even won't hurt the portability of the benchmark.

Tracing might add overhead to the measured events, but are not supposed to hurt anything other than that.

I traced a desktop linux system at rest, for context switches and interrupts, and it seems it's common to have even 5ms of quiescence for all CPUs (tids are actually CPU ids):


It looks like we have enough time to get meaningful results for microbenchmarks.

Here is the explanation of how I captured the trace, in short, rq_handler_{exit,enter} and sched_switch events are marked as 1usec events in the trace:


I'll be happy to hear about the problems and pitfalls in this approach, which I probably haven't thought of.

Thanks!
Reply all
Reply to author
Forward
0 new messages