Multi-threading question

Bonita Montero

unread,

Jan 21, 2021, 3:00:04 PM1/21/21

to

I don't know how to ask for this somewhere else. comp.programming
.threads is occupied by this arabic idiot.
"POSIX"-mutexes can be given an attribute-structure under Linux
and this attribute can be configrured so that the mutex does
adaptive spinning. The spincount can't be implementation-defined
since different hardware-platforms have different latencies when
aquiring the underlying futex through the kernel; further the
latency also depends on the competition among the threads.
So how is the number of spins estimated ?

Scott Lurndal

unread,

Jan 21, 2021, 3:22:07 PM1/21/21

to

Linux, being open source, offers you the opportunity to download
the sources and examine them to see how things, like posix
mutexes, work.

https://www.kernel.org/

Bonita Montero

unread,

Jan 21, 2021, 3:24:08 PM1/21/21

to

Spinning is a userland-feature.

Scott Lurndal

unread,

Jan 21, 2021, 3:40:20 PM1/21/21

to

Bonita Montero <Bonita....@gmail.com> idiotically snipped attributions, the twit:

The kernel spins. A lot. It's the ultimate multithreaded codebase.

The userland stuff is also open source. Feel free to do your
own homework. Hint, glibc may be a useful project to
download the source archive.

Bonita Montero

unread,

Jan 22, 2021, 1:37:15 AM1/22/21

to

>>> Linux, being open source, offers you the opportunity to download
>>> the sources and examine them to see how things, like posix
>>> mutexes, work.
>>> https://www.kernel.org/

>> Spinning is a userland-feature.

> The kernel spins. ...

No, spinning for a mutex is done in userland.
The kernel simply stops a threads relaes it only when the futex
is released - but not by spinning, that could be better done in
userland.

Chris M. Thomasson

unread,

Jan 22, 2021, 1:39:54 AM1/22/21

to

Spinning can happen in both lands, however... Usually, when talking
about spinning it is used to avoid transitions from user to kernel land.
As for a general spin count, well, its too sensitive to system
conditions to give any concrete numbers.

Bonita Montero

unread,

Jan 22, 2021, 2:26:37 AM1/22/21

to

> Spinning can happen in both lands, ...

Spinning for a mutex doesn't make sense in the kernel.
But there are spinlocks in the kernel, but that's a facility
not provided to the userland.

Scott Lurndal

unread,

Jan 22, 2021, 10:43:53 AM1/22/21

to

You are obviously not a kernel engineer. The kernel itself has
spin locks for it's own needs.

I give up. You snip attributions and useful context. Begone troll.

Bonita Montero

unread,

Jan 22, 2021, 10:50:02 AM1/22/21

to

> You are obviously not a kernel engineer. The kernel itself has
> spin locks for it's own needs.

I'm aware of kernel spinlocks. But they don't supply userland-mutexes.
And that's what we're talking about.

Bonita Montero

unread,

Jan 22, 2021, 11:41:14 AM1/22/21

to

> Spinning for a mutex doesn't make sense in the kernel.
> But there are spinlocks in the kernel, but that's a facility
> not provided to the userland.

I just see that there are POSIX userland spinlocks ([*]). But this is
very dangrous since a thread holding spinlock could be scheduled away
while other threads keep spinning. I think that's provided only for a
extremely rare purpose of threads that are pinned to dedicated CPU
-threads that aren't allowed to execute different threads (CPU-bin-
ning). But I don't think some should use that and Linus Torvalds also
said that userland-spinlocks are a bad idea.

[*]
https://pubs.opengroup.org/onlinepubs/009696699/functions/pthread_spin_lock.html

Bonita Montero

unread,

Jan 22, 2021, 12:20:38 PM1/22/21

to

That's the "adaptive" mutex locking code of pthread_mutex_lock
in the glibc 2.31:

| else if (__builtin_expect (PTHREAD_MUTEX_TYPE (mutex)
| == PTHREAD_MUTEX_ADAPTIVE_NP, 1))
| {
| if (! __is_smp)
| goto simple;
|
| if (LLL_MUTEX_TRYLOCK (mutex) != 0)
| {
| int cnt = 0;
| int max_cnt = MIN (max_adaptive_count (),
| mutex->__data.__spins * 2 + 10);
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| do
| {
| if (cnt++ >= max_cnt)
| {
| LLL_MUTEX_LOCK (mutex);
| break;
| }
| atomic_spin_nop ();
That's just a PAUSE-instruction
| }
| while (LLL_MUTEX_TRYLOCK (mutex) != 0);
|
| mutex->__data.__spins += (cnt - mutex->__data.__spins) / 8;
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| }
| assert (mutex->__data.__owner == 0);
| }
| else

Juha Nieminen

unread,

Jan 22, 2021, 12:31:29 PM1/22/21

to

Bonita Montero <Bonita....@gmail.com> wrote:
> I don't know how to ask for this somewhere else.

And I don't know why you bother asking at all, because you seem
completely unwilling to read any answers.

Bonita Montero

unread,

Jan 22, 2021, 1:00:29 PM1/22/21

to

> And I don't know why you bother asking at all, because
> you seem completely unwilling to read any answers.

Your introsort-advice was a bad advice.

Bonita Montero

unread,

Jan 22, 2021, 2:01:39 PM1/22/21

to

> | atomic_spin_nop ();
> That's just a PAUSE-instruction

I just tested the "performance" of the PAUSE-instruction:

#include <iostream>
#include <chrono>
#include <intrin.h>

using namespace std;
using namespace chrono;

int main()
{
using hrc_tp = time_point<high_resolution_clock>;
hrc_tp start = high_resolution_clock::now();
for( size_t i = 0; i != 1'000'000; ++i )
_mm_pause();
double ns = (int64_t)duration_cast<nanoseconds>(
high_resolution_clock::now() - start ).count() / 1'000'000.0;
cout << ns << endl;
}

But for me (Ryzen Threadripper 3990X) each PAUSE-iteration is only
0.7ns. That doesn't make sense to me because the instruction should
delay spin-loops that power is saved and they have a close timing
-relationship to a nearly immediate wakeup from the kernel so that
kernel-calls are likely to be avoided. But glibc has a static upper
limit of 100 iterations while spinning; and I hardly doubt that any
kernel could give a so fast futex-call and -wakeup.
So can anyone here verify this with his Intel-hardware;
maybe Intel -CPUs have more appropriate PAUSE delays.
But there's also TPAUSE as I see which waits until the timestamp
-counter reaches a certain value. But glibc does use PAUSE:
#define atomic_spin_nop() __asm ("pause")

Bonita Montero

unread,

Jan 22, 2021, 2:54:51 PM1/22/21

to

> But there's also TPAUSE as I see which waits until the timestamp
> -counter reaches a certain value. But glibc does use PAUSE:
> #define atomic_spin_nop() __asm ("pause")

I just found that all AMD-CPUs don't support TPAUSE. What a mess.
That's such a useful instruction. I thought I could use it for my
shared lock.

Ian Collins

unread,

Jan 22, 2021, 4:30:36 PM1/22/21

to

As I said a while back, you are arguing with someone who can't even
quote correctly. You are wasting your time.

--
Ian.

Chris M. Thomasson

unread,

Jan 22, 2021, 5:21:20 PM1/22/21

to

On 1/22/2021 8:40 AM, Bonita Montero wrote:
>> Spinning for a mutex doesn't make sense in the kernel.
>> But there are spinlocks in the kernel, but that's a facility
>> not provided to the userland.
>
> I just see that there are POSIX userland spinlocks ([*]).

Then there are adaptive locks that are a hybrid of a spinlock and a
blocking mutex. They can spin a couple of times using, say, a backoff
scheme. This might be a single PAUSE instruction wrt x86, or even some
more exotic exponential backoffs. They spin a while before resorting to
going into kernel land to actually block.

> But this is
> very dangrous since a thread holding spinlock could be scheduled away
> while other threads keep spinning. I think that's provided only for a
> extremely rare purpose of threads that are pinned to dedicated CPU
> -threads that aren't allowed to execute different threads (CPU-bin-
> ning). But I don't think some should use that and Linus Torvalds also
> said that userland-spinlocks are a bad idea.
>
> [*]
> https://pubs.opengroup.org/onlinepubs/009696699/functions/pthread_spin_lock.html
>

100% pure spin locks can use a variety of backoff schemes. Have you ever
experimented with them? I have.

Chris M. Thomasson

unread,

Jan 22, 2021, 7:48:01 PM1/22/21

to

Never heard of TPAUSE before...

https://www.felixcloutier.com/x86/tpause

Sounds very nice!

Bonita Montero

unread,

Jan 23, 2021, 12:43:36 AM1/23/21

to

> Then there are adaptive locks that are a hybrid of a spinlock and a

> blocking mutex. ...

Sorry, didn't you read about what I've been writing her all the time ?
My question was initially related to this.

Paavo Helde

unread,

Jan 23, 2021, 3:22:35 AM1/23/21

to

Except that Juha never advised you to use introsort, it was just for
comparison.

Bonita Montero

unread,

Jan 23, 2021, 4:11:44 AM1/23/21

to

>> I just found that all AMD-CPUs don't support TPAUSE. What a mess.
>> That's such a useful instruction. I thought I could use it for my
>> shared lock.

> Never heard of TPAUSE before...
> https://www.felixcloutier.com/x86/tpause
> Sounds very nice!

From what I've read today TPAUSE is only supportet by the latest mobile
Tremont-architecture Intel-chips.

Chris M. Thomasson

unread,

Jan 23, 2021, 3:49:45 PM1/23/21

to

I did, and cannot give you a concrete answer about how many times to
spin. It depends on a lot of factors. Think about adaptive backoffs.
This can be used in the spin portion of an adaptive mutex.

Actually, one time I wrote a program that measured the number of spins
vs the number of lock acquires vs the number of kernel waits. They were
generally different, on different systems. Heck, they were different on
the same system under various periods of load. I remember having
different ways to measure lock acquires. Was the lock obtained during a
spin? Was the lock obtained after going to kernel land? Iirc my old code
was getting some output like:

[0]: thread[0] lock_acquired after 13 spins

[1]: thread[1] lock_acquired after 128 spins and 1 kernel wait

[2]: thread[0] lock_acquired after 3 spins

[3]: thread[1] lock_acquired after 0 spins

Another factor that can impact the "optimal" spin count is the
complexity of the critical section itself. Adaptive locks tend to
"shine" when the critical section is really short and sweet, so to
speak... ;^)

Bonita Montero

unread,

Jan 24, 2021, 5:37:41 AM1/24/21

to

Can you compile and link this:

#include <iostream>
#include <chrono>
#include <cstdint>

using namespace std;
using namespace chrono;

extern "C"
void pause_loop( uint64_t count );

int main()
{
using hrc_tp = time_point<high_resolution_clock>;

uint64_t TURNS = 1'000'000'000;

hrc_tp start = high_resolution_clock::now();

pause_loop( TURNS );

double ns = (int64_t)duration_cast<nanoseconds>(

high_resolution_clock::now() - start ).count() / (double)TURNS;
cout << ns << "ns" << endl;
}

PUBLIC pause_loop

_TEXT SEGMENT

pause_loop PROC
mov rax, rcx
xor rdx, rdx
mov rcx, 100
div rcx
test rax, rax
jz tail
headLoop:
REPEAT 100
pause
ENDM
sub rax, 1
jnz headLoop
tail:
test rdx, rdx
jz byebye
tailLoop:
pause
sub rdx, 1
jnz tailLoop
byebye:
ret
pause_loop ENDP

_TEXT ENDS

END

And run this on your Intel-CPU ?
I can't beliefe that PAUSE inappropriately fast on an Intel-CPU;
on my TR 3990X, it's about 3 clock cycles. It's used in spin-loops
and should be related to the time cacheline needs to travel from
one cache to another, and that's much more than 3 clock cycles.

Vir Campestris

unread,

Jan 25, 2021, 4:41:54 PM1/25/21

to

As Bonita says (but doesn't get the quoting right) and as you'll find in
the proper manual at

<https://software.intel.com/sites/default/files/managed/39/c5/325462-sdm-vol-1-2abcd-3abcd.pdf>

User wait: TPAUSE, UMONITOR, UMWAIT Intel Atom processor based on
Tremont microarchitecture

and another 24 mentions of TPAUSE.

Andy

Chris M. Thomasson

unread,

Jan 25, 2021, 4:53:53 PM1/25/21

to

Very useful. It kind of seems that they were designed to aid in
transactional memory scenarios that can livelock. I need to read:

Programming with Intel® Transactional Synchronization Extensions

Chris M. Thomasson

unread,

Jan 25, 2021, 5:13:52 PM1/25/21

to

On 1/24/2021 2:37 AM, Bonita Montero wrote:
> Can you compile and link this:

[...]

> And run this on your Intel-CPU ?
> I can't beliefe that PAUSE inappropriately fast on an Intel-CPU;
> on my TR 3990X, it's about 3 clock cycles. It's used in spin-loops
> and should be related to the time cacheline needs to travel from
> one cache to another, and that's much more than 3 clock cycles.

When I get some time. I need to use MASM for it, in an externally
assembled file. Its been a while since I used MASM or GAS.

Chris M. Thomasson

unread,

Jan 25, 2021, 5:42:58 PM1/25/21

to

On 1/24/2021 2:37 AM, Bonita Montero wrote:

> Can you compile and link this:

[...]

> And run this on your Intel-CPU ?
> I can't beliefe that PAUSE inappropriately fast on an Intel-CPU;
> on my TR 3990X, it's about 3 clock cycles. It's used in spin-loops
> and should be related to the time cacheline needs to travel from
> one cache to another, and that's much more than 3 clock cycles.

I am getting outputs in release mode using multiple runs:

running without debugging:

3.66028ns
3.65354ns
3.66328ns

With debugging:

4.24173ns
4.32265ns
4.36551ns

Btw, and fwiw... Thanks for making/prompting me get MASM up and running
again! Its been a long time since I used it. Thanks Bonita.