Understanding the FreeBSD locking mechanism

Yubin Ruan

unread,

Apr 6, 2017, 5:17:11 AM4/6/17

to

Hi all freebsd hackers,
I am reading the FreeBSD source code related to its locking mechanism. I
have done some researches but still cannot understand some codes.

Let's take `spinlock' for example. I know there are different kinds of
mutex in FreeBSD: spin mutex and other kinds of mutex. I try to locate
the source of spin mutex but what I find all look very weird to me. For
example, the `spinlock_enter()`

1816 void
1817 spinlock_enter(void)
1818 {
1819 struct thread *td;
1820 register_t flags;
1821
1822 td = curthread;
1823 if (td->td_md.md_spinlock_count == 0) {
1824 flags = intr_disable();
1825 td->td_md.md_spinlock_count = 1;
1826 td->td_md.md_saved_flags = flags;
1827 } else
1828 td->td_md.md_spinlock_count++;
1829 critical_enter();
1830 }

Does this function provides the ordinary "spinlock" functionality? There
is no special "test-and-set" instruction, and neither any extra locking
to protect internal data structure manipulation. Isn't this subjected to
race condition?

I also checked the `mtx_lock()`, but neither can't find a seemingly
correct implementation.

Do I miss anything? Which is the real implementation of the spin lock in
FreeBSD?

Thanks
Yubin Ruan
_______________________________________________
freebsd...@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hacke...@freebsd.org"

Ed Schouten

unread,

Apr 6, 2017, 5:32:17 AM4/6/17

to

Hi Yubin,

2017-04-06 11:16 GMT+02:00 Yubin Ruan <ablack...@gmail.com>:
> Does this function provides the ordinary "spinlock" functionality? There
> is no special "test-and-set" instruction, and neither any extra locking
> to protect internal data structure manipulation. Isn't this subjected to
> race condition?

Locking a spinlock is done through macro mtx_lock_spin(), which
expands to __mtx_lock_spin() in sys/sys/mutex.h. That macro first
calls into the function you looked at, spinlock_enter(), to disable
interrupts. It then calls into the _mtx_obtain_lock_fetch() to do the
test-and-set operation you were looking for.

Best regards,
--
Ed Schouten <e...@nuxi.nl>
Nuxi, 's-Hertogenbosch, the Netherlands
KvK-nr.: 62051717

Yubin Ruan

unread,

Apr 9, 2017, 6:14:15 AM4/9/17

to

On 2017/4/6 17:31, Ed Schouten wrote:
> Hi Yubin,
>
> 2017-04-06 11:16 GMT+02:00 Yubin Ruan <ablack...@gmail.com>:
>> Does this function provides the ordinary "spinlock" functionality? There
>> is no special "test-and-set" instruction, and neither any extra locking
>> to protect internal data structure manipulation. Isn't this subjected to
>> race condition?
>
> Locking a spinlock is done through macro mtx_lock_spin(), which
> expands to __mtx_lock_spin() in sys/sys/mutex.h. That macro first
> calls into the function you looked at, spinlock_enter(), to disable
> interrupts. It then calls into the _mtx_obtain_lock_fetch() to do the
> test-and-set operation you were looking for.

Thanks for replying. I have read some of those codes.

Just a few more questions, if you don't mind:

(1) why are spinlocks forced to disable interrupt in FreeBSD?

From the book "The design and implementation of the FreeBSD Operating
System", the authors say "spinning can result in deadlock if a thread
interrupted the thread that held a mutex and then tried to acquire the
mutex"...(section 4.3, Mutex Synchronization, paragraph 4)

I don't get the point why a spinlock(or *spin mutex* in the FreeBSD
world) has to disable interrupt. Being interrupted does not necessarily
mean a deadlock. Assume that thread A holding a lock T gets interrupted
by another thread B(context switch here) and thread B try to acquire
the lock T. After finding out that lock T has already been acquired,
thread B will just spin until it gets preempted, after which thread A
gets waken up and run and release the lock T. So, you see there is not
necessarily any deadlock even if thread A get interrupted.

I can only remember two conditions where using spinlock without
disabling interrupts will cause deadlock:

#######1, spinlock used in an interrupt handler
If a thread A holding a spinlock T get interrupted and the interrupt
handler responsible for this interrupt try to acquire T, then we have
deadlock, because A would never have a chance to run before the
interrupt handler return, and the interrupt handler, unfortunately,
will continue to spin ... so in this situation, one has to disable
interrupt before spinning.

As far as I know, in Linux, they provide two kinds of spinlocks:

spin_lock(..); /* spinlock that does not disable interrupts */
spin_lock_irqsave(...); /* spinlock that disable local interrupt */

#######2, priority inversion problem
If thread B with a higher priority get in and try to acquire the lock
that thread A currently holds, then thread B would spin, while at the
same time thread A has no chance to run because it has lower priority,
thus not being able to release the lock.
(I haven't investigate enough into the source code, so I don't know
how FreeBSD and Linux handle this priority inversion problem. Maybe
they use priority inheritance or random boosting?)

thanks,
Yubin Ruan

vasanth sabavat

unread,

Apr 9, 2017, 9:29:17 AM4/9/17

to

Assume single CPU, If thread B spins where will thread A get to run and
finish up its critical section and release the lock? The one CPU you have
is held by thread b for spinning.

For spin locks on single core, it does not make sense to spin. We just
disable interrupts as we are currently the only ones running we just need
to make sure no others will get to preempt us. That's why spin locks should
be held for short duration.

When you have multiple cores, ThreadA can spin on cpu1, while thread B
holding the lock on cpu2 can finish up and release it. We disable
interrupts only on cpu1 so we don't want to preempt threadA. The cost of
preemption is very high compared to short spin. Note: short spin.

Look at adaptive spin locks.

--
Thanks,
Vasanth

Yubin Ruan

unread,

Apr 9, 2017, 11:57:02 AM4/9/17

to

On 2017/4/9 21:28, vasanth sabavat wrote:
>
> On Sun, Apr 9, 2017 at 3:14 AM Yubin Ruan <ablack...@gmail.com
> <mailto:ablack...@gmail.com>> wrote:
>
> On 2017/4/6 17:31, Ed Schouten wrote:
> > Hi Yubin,
> >
> > 2017-04-06 11:16 GMT+02:00 Yubin Ruan <ablack...@gmail.com

> <mailto:ablack...@gmail.com>>:

Can't the scheduler preempt thread B and put thread A to run? After all,
we did not disable interrupt.

regards,
Yubin Ruan

> freebsd...@freebsd.org <mailto:freebsd...@freebsd.org>

> mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-hackers
> To unsubscribe, send any mail to
> "freebsd-hacke...@freebsd.org

> <mailto:freebsd-hacke...@freebsd.org>"

Ryan Stone

unread,

Apr 9, 2017, 12:25:04 PM4/9/17

to

On Sun, Apr 9, 2017 at 6:13 AM, Yubin Ruan <ablack...@gmail.com> wrote:

>
> #######1, spinlock used in an interrupt handler
> If a thread A holding a spinlock T get interrupted and the interrupt
> handler responsible for this interrupt try to acquire T, then we have
> deadlock, because A would never have a chance to run before the
> interrupt handler return, and the interrupt handler, unfortunately,
> will continue to spin ... so in this situation, one has to disable
> interrupt before spinning.
>
> As far as I know, in Linux, they provide two kinds of spinlocks:
>
> spin_lock(..); /* spinlock that does not disable interrupts */
> spin_lock_irqsave(...); /* spinlock that disable local interrupt *

In the FreeBSD locking style, a spinlock is only used in the case where one
needs to
synchronize with an interrupt handler. This is why spinlocks always
disable local
interrupts in FreeBSD.

FreeBSD's lock for the first case is the MTX_DEF mutex, which is
adaptively-spinning
blocking mutex implementation. In short, the MTX_DEF mutex will spin
waiting for the
lock if the owner is running, but will block if the owner is deschedules.
This prevents
expensive trips through the scheduler for the common case where the mutex
is only held
for short periods, without wasting CPU cycles spinning in cases where the
owner thread is
descheduled and therefore will not be completing soon.

#######2, priority inversion problem
> If thread B with a higher priority get in and try to acquire the lock
> that thread A currently holds, then thread B would spin, while at the
> same time thread A has no chance to run because it has lower priority,
> thus not being able to release the lock.
> (I haven't investigate enough into the source code, so I don't know
> how FreeBSD and Linux handle this priority inversion problem. Maybe
> they use priority inheritance or random boosting?)
>

FreeBSD's spin locks prevent priority inversion by preventing the holder
thread from
being descheduled.

MTX_DEF locks implement priority inheritance.

vasanth sabavat

unread,

Apr 9, 2017, 12:49:15 PM4/9/17

to

On Sun, Apr 9, 2017 at 9:24 AM Ryan Stone <rys...@gmail.com> wrote:

> On Sun, Apr 9, 2017 at 6:13 AM, Yubin Ruan <ablack...@gmail.com> wrote:
>
> >
> > #######1, spinlock used in an interrupt handler
> > If a thread A holding a spinlock T get interrupted and the interrupt
> > handler responsible for this interrupt try to acquire T, then we have
> > deadlock, because A would never have a chance to run before the
> > interrupt handler return, and the interrupt handler, unfortunately,
> > will continue to spin ... so in this situation, one has to disable
> > interrupt before spinning.
> >
> > As far as I know, in Linux, they provide two kinds of spinlocks:
> >
> > spin_lock(..); /* spinlock that does not disable interrupts */
> > spin_lock_irqsave(...); /* spinlock that disable local interrupt *
>
>
> In the FreeBSD locking style, a spinlock is only used in the case where one
> needs to
> synchronize with an interrupt handler. This is why spinlocks always
> disable local
> interrupts in FreeBSD.

Isn't it true that interrupt handlers instead of running on the current
thread stack now have their own thread?

--
Thanks,
Vasanth

Ryan Stone

unread,

Apr 9, 2017, 2:02:31 PM4/9/17

to

On Sun, Apr 9, 2017 at 12:48 PM, vasanth sabavat <vasanth...@gmail.com>
wrote:

> Isn't it true that interrupt handlers instead of running on the current
> thread stack now have their own thread?
>

It depend on what you mean by "interrupt handler" in this context,as that's
ambiguous in FreeBSD. Most driver interrupt handling is done through an
ithread, which does have its own thread context, and MTX_DEF mutexes are
the appropriate locking primitive to use with them.

However, it is possible to handle an interrupt through what FreeBSD calls
an "interrupt filter", which runs on the kernel stack of whatever thread
happened to be running on the CPU, and therefore you must use a spinlock to
synchronize with an interrupt. FreeBSD prefers the use of ithreads and
MTX_DEF mutexes over filters and spinlocks.

Sorry for the use of confusing terminology. I considering referring
interrupt filters in my last message, but I figured the term would be
unfamiliar to someone not intimately familiar with FreeBSD internals so I
decided to avoid it.

Yubin Ruan

unread,

Apr 9, 2017, 9:29:04 PM4/9/17

to

On 2017/4/10 0:24, Ryan Stone wrote:
>
>
> On Sun, Apr 9, 2017 at 6:13 AM, Yubin Ruan <ablack...@gmail.com
> <mailto:ablack...@gmail.com>> wrote:
>
>
> #######1, spinlock used in an interrupt handler
> If a thread A holding a spinlock T get interrupted and the interrupt
> handler responsible for this interrupt try to acquire T, then we have
> deadlock, because A would never have a chance to run before the
> interrupt handler return, and the interrupt handler, unfortunately,
> will continue to spin ... so in this situation, one has to disable
> interrupt before spinning.
>
> As far as I know, in Linux, they provide two kinds of spinlocks:
>
> spin_lock(..); /* spinlock that does not disable interrupts */
> spin_lock_irqsave(...); /* spinlock that disable local interrupt *
>
>
> In the FreeBSD locking style, a spinlock is only used in the case where
> one needs to synchronize with an interrupt handler. This is why spinlocks
> always disable local interrupts in FreeBSD.
>
> FreeBSD's lock for the first case is the MTX_DEF mutex, which is
> adaptively-spinning blocking mutex implementation. In short, the MTX_DEF
> mutex will spin waiting for the lock if the owner is running, but will
> block if the owner is deschedules. This prevents expensive trips through
> the scheduler for the common case where the mutex is only held for short
> periods, without wasting CPU cycles spinning in cases where the owner thread
> is descheduled and therefore will not be completing soon.

Great explanation! I read the man page at:

>
https://www.freebsd.org/cgi/man.cgi?query=mutex&sektion=9&apropos=0&manpath=FreeBSD+11.0-RELEASE+and+Ports

and now clear about MTX_DEF and MTX_SPIN mutexs. But, still a few more
question, if you don't mind:

Is it true that a thread holding a MTX_DEF mutex can be descheduled?
(shouldn't it disable interrupt like a MTX_SPIN mutex?) It is said on
the main page that MTX_DEF mutex is used by default in FreeBSD, so its
usecase must be very common. If a thread holding a MTX_DEF mutex can be
descheduled, which means that it did not disable interrupt, then we may
have lots of deadlock here, right?

>
> #######2, priority inversion problem
> If thread B with a higher priority get in and try to acquire the lock
> that thread A currently holds, then thread B would spin, while at the
> same time thread A has no chance to run because it has lower priority,
> thus not being able to release the lock.
> (I haven't investigate enough into the source code, so I don't know
> how FreeBSD and Linux handle this priority inversion problem. Maybe
> they use priority inheritance or random boosting?)
>
>
> FreeBSD's spin locks prevent priority inversion by preventing the holder
> thread from being descheduled.
>
> MTX_DEF locks implement priority inheritance.

Nice hints. Thanks!

regards,
Yubin Ruan

Warner Losh

unread,

Apr 9, 2017, 9:52:13 PM4/9/17

to

Yes, they can be descheduled. But that's not a problem. No other
thread can acquire the MTX_DEF lock. If another thread tries, it will
sleep and wait for the thread that holds the MTX_DEF lock to release
it. Eventually, the thread will get time to run again, and then
release the lock. Threads that just hold a MTX_DEF lock may also
migrate from CPU to CPU too.

Warner

Chris Torek

unread,

Apr 9, 2017, 10:51:03 PM4/9/17

to

Ryan Stone is of course correct here. I have not kept up with the
latest terminology shifts, but I can describe a bit of the history
of all of this. (I was somewhat involved with writing the
original MTX_DEF and MTX_SPIN mutex code way back when).

(None of the rest of this should be new to experienced kernel
hackers.)

In the old non-SMP days, BSD, like traditional V6 Unix, divided
the kernel into "top half" and "bottom half" sections. The top
half was anything driven from something other than an interrupt,
such as initial bootstrap or any user-sourced system call. Each
of these had just one (per-process) kernel stack, in the "u.
area", which was UPAGES * NBPG (number of bytes per page) bytes
long, but also had to contain "struct user".

(In other words, the stack space available was actually smaller
than that. The "user" struct was *above* the kernel stack, so
that ksp would not grow down into the structure; there was also
signal trampoline code wedged in there, at least on the VAX and
some of the early other ports. I desperately wanted to move the
trampoline code to libc for the sparc port. It was *in theory*
easy to do this :-) ... practice was another matter.)

When an interrupt arrived, as long as it was not interrupting
another interrupt, the system would get on a separate "interrupt
stack" -- some hardware supports this directly, with a separate
interrupt stack register -- which meant we did not have to provide
enough interrupt-handling space in the per-process kernel stack,
nor take interrupts on some possibly dodgy user stack.
(Interrupts can occur at any time, so the system may be running
user code, not kernel code.)

It also meant that a simple:

s = splfoo();

call in the top half would block any interrupts at priority foo or
lower, so that "top half" code could know that "bottom half" code
for foo would not run at this point.

With prioritized interrupts, *taking* an interrupt at level foo
automatically raised the CPU's priority to foo, so any "bottom
half" code for foo would know that no important "top half" code
was running at the time -- if that had been the case, the top half
would have done an splfoo() to block it -- and of course no other
"bottom half" code for foo could run now.

Meanwhile, no bottom-half code was *ever* allowed to block. When
you took an interrupt, you were committed to finishing all of the
work for that interrupt before issuing a "return from interrupt"
instruction (which would, if the interrupt was not interrrupting
another intterupt, switch back to the appropriate

A good way to describe this strategy:

s = splfoo(); /* block all bottom half code at this priority */
...
splx(s); /* resume ability of blocked bottom half code to run */

is that spl (set priority level) provides mutual exclusion to
*code paths*. The top half blocks out the bottom half with an
spl, and the bottom half blocks out the top half by simply *being*
bottom-half code, handling interrupts.

-----

With SMP, this whole strategy is a non-starter. We don't have
just one CPU running code; we cannot block *code* paths at all.
Instead, we switch to mutually exclusive access to *data*.

We then make several observations:

* Most data structures are mostly uncontested. (If not, we need to
redesign them!) "Get lock" should be fast in this usual case.

* If we provide what used to be "bottom half" drivers with *their
own* stacks / interrupt threads ("ithreads"), they can block if
they need to: when the data structure *is* actually contested.

This means we mainly need just one kind of mutex. For read-mostly
data structures, we would like to have shared-read locks and so
on, but we can build them on this base mutex. (As it turns out,
this view is a little simplistic; we want to build them on a base
compare-and-swap, typically, rather than a base full-blown-mutex.
It would also be nice to have CAS-based multi-producer single-
consumer and single-producer multi-consumer queues. These are
particuarly useful on systems with hundreds or thousands of
cores.)

Of course, we also have to start dealing with issues like priority
inversion and lock order / possible deadlock, any time we lock
data instead of code paths. But that's mostly a separate issue.

-----

This is all fine for most code, including most device drivers, but
for manipulating the hardware itself, the lowest level interrupt
dispatching code, and also the system scheduler, still must block
interrupts from time to time. We also have some special oddball
cases, such as profiling interrupts, where we want to know: "What
was running when the interrupt fired?" For these cases we *don't*
want to switch to a separate interrupt thread:

* In the hardware interrupt dispatcher, we may not *know which
thread to switch to* yet. We must find the right ithread,
then schedule it to run. (Then we have to manipulate the
hardware based on whether the interrupt is edge or level
triggered, and so on, but that's an in-the-weeds detail also
mostly unrelated to this scheduling.)

For the profiling "what was running" case, we'd like to sample
what was running, which we *can't* do from a separate thread:
we need access to the current stack. (Strictly speaking, we
merely need said access ... but we also need that thread to
remain paused while we sample it.)

And, for some low-cost paths such as gathering entropy, we may
not want or need to *pay* the up-front cost of a separate ithread.

* In the scheduler, we're either in the process of choosing
threads and changing stacks, or setting up data structures to
tell the chooser which threads to choose. We need to block all
scheduling events, including all interrupts, for some of these
super-critical sections.

These use the MTX_SPIN type lock, which is similar to MTX_DEF, but:

* does block interrupts, and

* never *invokes* the scheduler: never tries to put the current
thread out of the running until the locked data are available.

Since then, we have added another special case:

* In a "critical section", we wish to make sure that the current
thread does not migrate from one CPU to another. This does
not, strictly speaking, require blocking interrupts entirely,
but because the scheduler does its thing by blocking interrupts,
we block interrupts for short durations here as well (actually
when *leaving* the critical section, where we check to see if
the scheduler would *like* us to migrate).

This is not really a mutex at all, but it does interact with
them, so it's worth mentioning. Essentially, if you are in a
critical section, you may not switch threads, so if you need
a mutex, you must use a spin mutex.

(This *is* well-documented in "man 9 critical_enter".)

One might argue that being in a critical section should turn an
MTX_DEF mutex into an MTX_SPIN mutex, but it's not that easy to
do, and if you're taking a possibly slow MTX_DEF lock *while* in a
critical section, "you're doing it wrong" (as with the heavily
contested datum problem, we should rewrite the code so that the
critical section happens *while* holding the MTX_DEF object, not
the other way around).

Anyway, that's how we got here, and why things are the way they
are.

Chris

Yubin Ruan

unread,

Apr 10, 2017, 12:02:53 AM4/10/17

to

Does that imply that MTX_DEF should not be used in something like
interrupt handler? Putting an interrupt handler into sleep doesn't
make so much sense.

Yubin

Chris Torek

unread,

Apr 10, 2017, 12:26:49 AM4/10/17

to

>>> Is it true that a thread holding a MTX_DEF mutex can be descheduled?

>> Yes, they can be descheduled. But that's not a problem. No other

>> thread can acquire the MTX_DEF lock. ...

>Does that imply that MTX_DEF should not be used in something like
>interrupt handler? Putting an interrupt handler into sleep doesn't
>make so much sense.

Go back to the old top-half / bottom-half model, and consider that
now that there are interrupt *threads*, your ithread is also in the
"top half". It's therefore OK to suspend. ("Sleep" is not quite
correct here: a mutex wait is not a "sleep" state but instead is
just a waiting, not-scheduled-to-run state. The precise difference
is irrelevant at this level though.)

It's not *great* to suspend here, but all your alternatives are
*also* bad:

* You may grab incoming data and stuff it into a ring buffer, and
schedule some other thread to handle it later. But if the ring
buffer is full you have a problem, and all you have done is push
the actual processing off to another thread, adding more overhead.

* You may put the device itself on hold so that no more data can
come in (if it's that kind of device).

On the other hand, if you are handling an interrupt but not in an
interrupt thread, you are running in the "bottom half". It is
therefore *not OK* to suspend. You must now use one of those
alternatives.

Note that if you suspend on an MTX_DEF mutex, and your priority is
*higher* than the priority of whatever thread actually holds that
mutex now, that other thread gets a priority boost to your level
(priority propagation, to prevent priority inversion). So letting
your ithread suspend, assuming you have an ithread, is probably your
best bet.

Chris

Konstantin Belousov

unread,

Apr 10, 2017, 3:12:09 AM4/10/17

to

On Sun, Apr 09, 2017 at 07:16:26PM -0700, Chris Torek wrote:
> In the old non-SMP days, BSD, like traditional V6 Unix, divided
> the kernel into "top half" and "bottom half" sections. The top
> half was anything driven from something other than an interrupt,
> such as initial bootstrap or any user-sourced system call. Each
> of these had just one (per-process) kernel stack, in the "u.
> area", which was UPAGES * NBPG (number of bytes per page) bytes
> long, but also had to contain "struct user".
>
> (In other words, the stack space available was actually smaller
> than that. The "user" struct was *above* the kernel stack, so
> that ksp would not grow down into the structure; there was also
> signal trampoline code wedged in there, at least on the VAX and
> some of the early other ports. I desperately wanted to move the
> trampoline code to libc for the sparc port. It was *in theory*
> easy to do this :-) ... practice was another matter.)

Signal trampolines never were put on the kernel stack, simply because
uarea/kstack is not accessible from the user space. They lived on top
the user mode stack of the main thread. Currently on x86/powerpc/arm,
signal trampolines are mapped from the 'shared page', which was done to
allow marking the user stack as non-executable.

Kstack still contains the remnants of the uarea, renamed to (per-thread)
pcb. There is no much sense in the split of struct thread vs. struct
pcb, but it is historically survived up to this moment, and clearing
things up requires too much MD work.
My opinion is that pcb on kstack indeed only eats the space and better be
put into td_md.

Yet another thing which is shared with kstack, is the usermode FPU save
area for x86 and arm64. At least on x86, the save area is dynamically
sized at boot to support extentions like AVX/AVX256/AVX512 etc, and
chomping part of the kstack saves one more contiguous KVA allocation and
allows to reuse kstack cache. Again historically, pre-AVX kernels put
XMM save area into pcb->kstack.

>
> When an interrupt arrived, as long as it was not interrupting
> another interrupt, the system would get on a separate "interrupt
> stack" -- some hardware supports this directly, with a separate
> interrupt stack register -- which meant we did not have to provide
> enough interrupt-handling space in the per-process kernel stack,
> nor take interrupts on some possibly dodgy user stack.
> (Interrupts can occur at any time, so the system may be running
> user code, not kernel code.)

No, this is not a case, at least on x86. There, 'normal' interrupts
and exceptions reuse the current thread kstack, thus participating in
the common stack overflow business. On i386, only NMI and double fault
exceptions are routed through task gates in IDT, and are provided with
the separate stack [double fault almost always indicates that stack
overflow]. On amd64, TSS switching is impossible, but IDT descriptors
may be marked with non-zero IST, which basically reference some static
stack besides kstack. Only NMI uses IST.

> Since then, we have added another special case:
>
> * In a "critical section", we wish to make sure that the current
> thread does not migrate from one CPU to another. This does
> not, strictly speaking, require blocking interrupts entirely,
> but because the scheduler does its thing by blocking interrupts,
> we block interrupts for short durations here as well (actually
> when *leaving* the critical section, where we check to see if
> the scheduler would *like* us to migrate).

This is not true, both in explanation of intent, and in the implementation
details.

Critical section prevents de-scheduling of the current thread, disabling
any context switches on the current CPU. It works by incrementing current
thread td_critnest counter. Note that the interrupts are still enabled
when critical section is ensured, so the flow of control can still be
'preempted' to the interrupt, but after return from the interrupt, current
thread continues to execute. If any higher-priority thread needs to be
scheduled due to interrupt, the scheduler and context switch are done after
the td_critnest returns to zero.

>
> This is not really a mutex at all, but it does interact with
> them, so it's worth mentioning. Essentially, if you are in a
> critical section, you may not switch threads, so if you need
> a mutex, you must use a spin mutex.

You probably mixed critical_enter() and spinlock_enter() there.
The later indeed disables interrupt and intended to be used as part
of the spinlock (spin mutexes) implementation.

>
> (This *is* well-documented in "man 9 critical_enter".)

The explanation in critical_enter(9) is somewhat misleading.
The spinlock_enter() call consequences include most side-effects of
critical_enter(), because interrupts are disabled for later and thus
context-switching cannot occur at all. Spinlocks do not enter the
critical section technically, i.e. the td_critnest is not incremented.

Chris Torek

unread,

Apr 10, 2017, 4:11:46 AM4/10/17

to

>Signal trampolines never were put on the kernel stack ...

Oops, right, not sure why I was thinking that. However I would still
prefer to have libc supply the trampoline address (the underlying
signal system calls can do this, since until you are catching a
signal in the first place, there is no need for a known-in-advance
trampoline address).

>Kstack still contains the remnants of the uarea, renamed to (per-thread)
>pcb. There is no much sense in the split of struct thread vs. struct
>pcb, but it is historically survived up to this moment, and clearing
>things up requires too much MD work.
>My opinion is that pcb on kstack indeed only eats the space and better be
>put into td_md.

That would be good.

>> When an interrupt arrived, as long as it was not interrupting
>> another interrupt, the system would get on a separate "interrupt

>> stack" ...

No, this is not a case, at least on x86.

On VAX, and (emulated without hardware support) in my SPARC port,
it was. :-)

>There, 'normal' interrupts and exceptions reuse the current thread

>kstack ...

I never liked this very much, but if it's faster on x86, it's not
unreasonable. And without hardware support (or if the TSS switch
is too slow) it's OK.

>> * In a "critical section", we wish to make sure that the current
>> thread does not migrate from one CPU to another.

>> ... we block interrupts for short durations here as well (actually

>> when *leaving* the critical section, where we check to see if
>> the scheduler would *like* us to migrate).

>This is not true, both in explanation of intent, and in the implementation
>details.

Ah, and I see you added a compiler_membar and some comments here
recently. I did indeed misread the micro-optimization.

>You probably mixed critical_enter() and spinlock_enter() there.
>The later indeed disables interrupt and intended to be used as part
>of the spinlock (spin mutexes) implementation.

What I meant was that it's a dreadful error to do, e.g.:

critical_enter();
mtx_lock(mtx);
...
mtx_unlock(mtx);
critical_exit();

but the other order (lock first, then enter/exit) is OK. This
is similar to the prohibition against obtaining a default mutex
while holding a spin mutex.

Chris

Konstantin Belousov

unread,

Apr 10, 2017, 4:48:22 AM4/10/17

to

On Mon, Apr 10, 2017 at 01:11:25AM -0700, Chris Torek wrote:
> >Signal trampolines never were put on the kernel stack ...
>
> Oops, right, not sure why I was thinking that. However I would still
> prefer to have libc supply the trampoline address (the underlying
> signal system calls can do this, since until you are catching a
> signal in the first place, there is no need for a known-in-advance
> trampoline address).

I considered some variation of this scheme when I worked on the
non-executable stack support. AFAIR the reason why I decided not to do
this was that the kernel-injected signal trampoline is still needed
for backward ABI-compat. In other words, the shared page would be
still needed, and we would end up with both libc trampoline and kernel
trampoline, which felt somewhat excessive.

Selecting one scheme or another based e.g. on the binary osrel was too
fragile, e.g. new binary might have loaded old library, and the kernel
trampoline still must be present in this situation.

> What I meant was that it's a dreadful error to do, e.g.:
>
> critical_enter();
> mtx_lock(mtx);
> ...
> mtx_unlock(mtx);
> critical_exit();
>
> but the other order (lock first, then enter/exit) is OK. This
> is similar to the prohibition against obtaining a default mutex
> while holding a spin mutex.

Sure, this is a bug. Debugging kernel would catch this, at least
mi_switch() asserts that td_critnest == 0 (technically it checks that
td_critnest == 1 but the thread lock is owned there). So if such code tries
to lock contested mutex, the bug causes panic.

I am sorry my previous mail contained an error: the spinlock_enter() also
increments td_critnest. Still, since interrupts are disabled, this is
mostly cosmetics. The more important consequence is that critical_exit()
on spinlock unlock re-checks td_owepreempt and executes potential postponed
scheduling actions.

Chris Torek

unread,

Apr 10, 2017, 4:57:59 AM4/10/17

to

>I considered some variation of this scheme when I worked on the
>non-executable stack support. AFAIR the reason why I decided not to do
>this was that the kernel-injected signal trampoline is still needed
>for backward ABI-compat. In other words, the shared page would be
>still needed, and we would end up with both libc trampoline and kernel
>trampoline, which felt somewhat excessive.

Those are pretty much the same reasons I never did it as well.

>Selecting one scheme or another based e.g. on the binary osrel was too
>fragile, e.g. new binary might have loaded old library, and the kernel
>trampoline still must be present in this situation.

The method by which to select the scheme, though, is straightforward:
old vs new signal system call numbers and/or flags. ("Flags" presents
issues if users of existing mechanism are not good about clearing
unknown flag bits.)

Besides non-executable stack / shared-page, this would also be
particularly good for cases where a runtime library (not
necessarily libc itself, perhaps for other languages) wants a
different signal handling method in user space. For instance,
instead of signals being delivered to some existing thread as
interrupts, they might spin off new threads entirely.

I think it's still worth pursuing, but it's one of those "forever in
the future, low priority" ideas. I can't even seem to get back to
my medium-priority ideas these days...

Chris

Julian Elischer

unread,

Apr 10, 2017, 10:44:52 AM4/10/17

to

On 9/4/17 6:13 pm, Yubin Ruan wrote:
> On 2017/4/6 17:31, Ed Schouten wrote:
>> Hi Yubin,
>>
>> 2017-04-06 11:16 GMT+02:00 Yubin Ruan <ablack...@gmail.com>:
>>> Does this function provides the ordinary "spinlock" functionality?
>>> There
>>> is no special "test-and-set" instruction, and neither any extra
>>> locking
>>> to protect internal data structure manipulation. Isn't this
>>> subjected to
>>> race condition?
>>
>> Locking a spinlock is done through macro mtx_lock_spin(), which
>> expands to __mtx_lock_spin() in sys/sys/mutex.h. That macro first
>> calls into the function you looked at, spinlock_enter(), to disable
>> interrupts. It then calls into the _mtx_obtain_lock_fetch() to do the
>> test-and-set operation you were looking for.
>
> Thanks for replying. I have read some of those codes.

just in case it somehow slipped your attention or has not yet been
brought up there is the following overview:

https://www.freebsd.org/cgi/man.cgi?locking(9)

Yubin Ruan

unread,

Apr 11, 2017, 1:16:08 PM4/11/17

to

Thanks for your reply. I have read your mails and your discussion with
Konstantin Belousov

On 2017/4/10 12:26, Chris Torek wrote:
>>>> Is it true that a thread holding a MTX_DEF mutex can be descheduled?
>
>>> Yes, they can be descheduled. But that's not a problem. No other
>>> thread can acquire the MTX_DEF lock. ...
>
>> Does that imply that MTX_DEF should not be used in something like
>> interrupt handler? Putting an interrupt handler into sleep doesn't
>> make so much sense.
>
> Go back to the old top-half / bottom-half model, and consider that
> now that there are interrupt *threads*, your ithread is also in the
> "top half". It's therefore OK to suspend. ("Sleep" is not quite
> correct here: a mutex wait is not a "sleep" state but instead is
> just a waiting, not-scheduled-to-run state. The precise difference
> is irrelevant at this level though.)

I don't truely understand the "top-half/bottom-half" model you proposed,
but I think I get the idea of how things work now. Basically, we can
assume that if a thread is in the "bottom-half", then it should never
suspend(or, in the other words, be preempted). This is the case of the
"interrupt filter" in FreeBSD. On the other hand, if a thread is in the
"top-half", then it is safe to suspend/block. This is the case of the
"ithread".

The difference between the "ithread" and "interrupt filter" things is
that ithread has its own thread context, while interrupt handling
through interrupt filter shares the same kernel stack.

So, for ithread, we should use the MTX_DEF, which don't disable
interrupt, and for "interrupt filter", we should use the MTX_SPIN, which
disable interrupt.

What really confuses me is that I don't really see how owning an
"independent" thread context(i.e ithread) makes a thread run in the
"top-half" and how sharing the same kernel stack makes a thread run in
the "bottom-half".

I did read your long explanation in the previous mail. For the non-SMP
case, the "top-half/bottom-half" model goes well and I understand how
the *code* path/*data* path things go. But I cannot still fully
understand the model for the SMP case. Maybe you can draw something like

----- -----
| |<-- top-half | | <-- top-half
| | | |
| | | |
| | | |
| |<-- bottom-half | | <-- bottom-half
----- -----
CPU1 CPU2

to make things less abstract.

Thanks,
Yubin Ruan

Yubin Ruan

unread,

Apr 11, 2017, 1:17:37 PM4/11/17

to

Sorry for the ugly format. The mail client sucks.

Yubin Ruan

Chris Torek

unread,

Apr 11, 2017, 7:11:27 PM4/11/17

to

>The difference between the "ithread" and "interrupt filter" things
>is that ithread has its own thread context, while interrupt handling
>through interrupt filter shares the same kernel stack.

Right -- though rather than "the same" I would just say "shares
a stack", i.e., we're not concerned with *whose* stack and/or
thread we're borrowing, just that we have one borrowed.

>So, for ithread, we should use the MTX_DEF, which don't disable
>interrupt, and for "interrupt filter", we should use the MTX_SPIN, which
>disable interrupt.

Right.

>What really confuses me is that I don't really see how owning an
>"independent" thread context(i.e ithread) makes a thread run in the
>"top-half" and how sharing the same kernel stack makes a thread run in
>the "bottom-half".

It's not that it *makes* it run that way, it's that it *allows* it
to run that way -- and then the scheduler *does* run it that way.

>I did read your long explanation in the previous mail. For the non-SMP
>case, the "top-half/bottom-half" model goes well and I understand how
>the *code* path/*data* path things go. But I cannot still fully
>understand the model for the SMP case.

It's fundamentally fairly tricky, but we start with that same first
notion:

* If you have your own state (i.e., stack), you can be suspended
(stopped in the scheduler, giving the CPU to other threads):
*your* (private) state is preserved on *your* (private) stack.

* If you have borrowed someone else's state, anything that suspends
you, suspends them too. Since this may deadlock, you are not
allowed to do it at all.

Once we block interrupts locally (as for MTX_SPIN, or
automatically inside a filter style or "bottom half" interrupt),
we are in a special state: we may not take *any* MTX_DEF locks at
all (the kernel should panic if we do).

This in turn means that data structures are protected *either* by
a spin mutex *or* by a default (non-spin) mutex, never both. So
if you need to touch a spin-mutex data structure from thread-y
("top half") code, you obtain the spin mutex, and now no interrupts
will occur *on this CPU*, and as a key side effect, you won't move
*off* this CPU either. If an interrupt occurs on another CPU and
it goes to take the spin lock that protects that CPU, it loops
at that point, not switching tasks, waiting for the MTX_SPIN mutex
to be released:

CPU 1 CPU 2
----------------------------|-----------------------------
func() { | ... code not involving mtx
mtx_lock_spin(&mtx); | ...
do some work | mtx_lock_spin(&mtx); /* loops */
. | [stuck]
. | [stuck]
. | [stuck]
mtx_unlock_spin(&mtx); | [unstuck]
... | do some work

If an interrupt occurs on CPU 2, and that interrupt-handling code
wants to touch the data protected by the spin lock, that code
obtains the spin lock as usual. Meanwhile the interrupt *cannot*
occur on CPU 1, as holding the spin lock has blocked interrupts.
So the code path on CPU 2 blocks -- looping in mtx_lock_spin(),
not giving CPU 2 over to the scheduler -- for as long as CPU 1
holds the spin lock. The corresponding code path is already
blocked on CPU 1, the same way it was back in the non-SMP, single-
CPU days.

This means it is unwise to hold spin locks for long periods. In
fact, if CPU 2 waits too long in that [stuck] section, it will
panic, on the assumption that CPU 1 has done something terrible
and the system is now hung.

This is also waht gives rise to the constrant that you must take
MTX_SPIN locks "inside" any outer MTX_DEF locks.

Yubin Ruan

unread,

Apr 11, 2017, 10:32:52 PM4/11/17

to

clear. How can I distinguish these two conditions? I mean, whether I
am using my own state/stack or borrowing others' state.

Things become clearer now. Thanks for your reply.
If I understand correctly, which kind of lock should be used depends on
which thread model(i.e "thread filter" or "ithread") we use. If I want
to use a lock, I must know in advance which kind of thread model I am
in, otherwise the interrupt handling code might cause you deadlock or
kernel panic. The problem is, how can I tell which thread model I am
in? I am not so clear about the thread model things and scheduling
code of FreeBSD...

> This means it is unwise to hold spin locks for long periods. In
> fact, if CPU 2 waits too long in that [stuck] section, it will
> panic, on the assumption that CPU 1 has done something terrible
> and the system is now hung.
>
> This is also waht gives rise to the constrant that you must take
> MTX_SPIN locks "inside" any outer MTX_DEF locks.

What do you mean by "must take MTX_SPIN locks 'inside' any outer
MTX_DEF locks?

Regards,
Yubin Ruan

Chris Torek

unread,

Apr 12, 2017, 3:55:59 AM4/12/17

to

>clear. How can I distinguish these two conditions? I mean, whether I
>am using my own state/stack or borrowing others' state.

You choose it when you establish your interrupt handler. If you
say you are a filter interrupt, then you *are* one, and the rest
of your code must be written as one. Unless you know what you
are doing, don't do this, and then you *aren't* one and the rest
of your code can be written using the much more relaxed model.

>What do you mean by "must take MTX_SPIN locks 'inside' any outer
>MTX_DEF locks?

This means that any code path that is going to hold a spin-type
lock must obtain it while already holding any applicable non-spin
locks. For instance, if we look at <sys/proc.h> we find these:

#define PROC_STATLOCK(p) mtx_lock_spin(&(p)->p_statmtx)
#define PROC_ITIMLOCK(p) mtx_lock_spin(&(p)->p_itimmtx)
#define PROC_PROFLOCK(p) mtx_lock_spin(&(p)->p_profmtx)

Let's find a bit of code that uses one, such as in kern_time.c:

https://github.com/freebsd/freebsd/blob/master/sys/kern/kern_time.c#L338

(kern_clock_gettime()). This code reads:

case CLOCK_PROF:
PROC_LOCK(p);
PROC_STATLOCK(p);
calcru(p, &user, &sys);
PROC_STATUNLOCK(p);
PROC_UNLOCK(p);
timevaladd(&user, &sys);
TIMEVAL_TO_TIMESPEC(&user, ats);
break;

Note that the call to PROC_LOCK comes first, then the call to
PROC_STATLOCK. This is because PROC_LOCK

https://github.com/freebsd/freebsd/blob/master/sys/sys/proc.h#L825

is defined as:

#define PROC_LOCK(p) mtx_lock(&(p)->p_mtx)

If you obtain the locks in the other order -- i.e., if you grab
the PROC_STATLOCK first, then try to lock PROC_LOCK -- you are
trying to take a spin-type mutex while holding a default mutex,
and this is not allowed (can cause deadlock). But taking the
PROC_LOCK first (which may block), then taking the PROC_STATLOCK
(a spin lock) "inside" the outer PROC_LOCK default mutex, is OK.

(This is one of my mild objections to macros like PROC_LOCK and
PROC_STATLOCK: they hide whether the mutex in question is a spin
lock.)

Incidentally, any time you take *any* lock while holding any
other lock (e.g., lock A, then lock B while holding A), you have
created a "lock order" in which A predeces B. If some other
code path locks B first, then while holding B, attempts to lock
A, you get a deadlock if both code paths are running at the same
time. The WITNESS code dynamically discovers these various orders
and warns you at run time if you have a "lock order reversal"
(a case where one code path does A-then-B while another does
B-then-A).

(This is, in a sense, the same problem as discovering whether
there is a loop in a directed graph, or whether this directed
graph is acyclic. If you can force the graph to take the shape of
a tree, rather than the more general graph, there will never be
any loops in it, and you will never have lock order reversals.
And of course if you have only *one* lock for some data, there is
nothing to be reversed. Not all lock order reversals are
guaranteed to lead to deadlock, but sorting out which ones are
really OK, and which are not, is ... challenging.)

Chris

Yubin Ruan

unread,

Apr 12, 2017, 7:57:21 AM4/12/17

to

Is this a typo? I guess you mean something like "you are trying
to take a blocking mutex while holding spin-type mutex".

> and this is not allowed (can cause deadlock). But taking the
> PROC_LOCK first (which may block), then taking the PROC_STATLOCK
> (a spin lock) "inside" the outer PROC_LOCK default mutex, is OK.

I think I get your point: if you take a spin-type mutex, you
already disable interrupt, which in effect means that no other
code can preempt you. Under this circumstance, if you continue to
take a blocking mutex, you may get blocked. Since you already
disable interrupt and nobody can interrupt/preempt you, you are blocked
on that CPU, not being able to do anything, which is pretty much a
"deadlock" (actually this is not a deadlock, but, it is similar)

Regards,
Yubin Ruan

Chris Torek

unread,

Apr 12, 2017, 2:54:06 PM4/12/17

to

>> If you obtain the locks in the other order -- i.e., if you grab
>> the PROC_STATLOCK first, then try to lock PROC_LOCK -- you are
>> trying to take a spin-type mutex while holding a default mutex,

>Is this a typo? I guess you mean something like "you are trying
>to take a blocking mutex while holding spin-type mutex".

Yes, or rather brain-o (swapping words) -- these most often happen
if I am interrupted while composing a message :-)

>I think I get your point: if you take a spin-type mutex, you
>already disable interrupt, which in effect means that no other
>code can preempt you. Under this circumstance, if you continue to
>take a blocking mutex, you may get blocked. Since you already
>disable interrupt and nobody can interrupt/preempt you, you are blocked
>on that CPU, not being able to do anything, which is pretty much a
>"deadlock" (actually this is not a deadlock, but, it is similar)

Right. It *may* deadlock, and it is definitely not good -- and
the INVARIANTS kernel will check and panic.

Yubin Ruan

unread,

Apr 13, 2017, 5:28:40 AM4/13/17

to

On 2017年04月13日 02:53, Chris Torek wrote:
>>> If you obtain the locks in the other order -- i.e., if you grab
>>> the PROC_STATLOCK first, then try to lock PROC_LOCK -- you are
>>> trying to take a spin-type mutex while holding a default mutex,
>
>> Is this a typo? I guess you mean something like "you are trying
>> to take a blocking mutex while holding spin-type mutex".
>
> Yes, or rather brain-o (swapping words) -- these most often happen
> if I am interrupted while composing a message :-)
>
>> I think I get your point: if you take a spin-type mutex, you
>> already disable interrupt, which in effect means that no other
>> code can preempt you. Under this circumstance, if you continue to
>> take a blocking mutex, you may get blocked. Since you already
>> disable interrupt and nobody can interrupt/preempt you, you are blocked
>> on that CPU, not being able to do anything, which is pretty much a
>> "deadlock" (actually this is not a deadlock, but, it is similar)
>
> Right. It *may* deadlock, and it is definitely not good -- and
> the INVARIANTS kernel will check and panic.

I discover that in the current implementation in FreeBSD, spinlock
does not disable interrupt entirely:

607 for (;;) {
608 if (m->mtx_lock == MTX_UNOWNED &&
_mtx_obtain_lock(m, tid))
609 break;
610 /* Give interrupts a chance while we spin. */
611 spinlock_exit();
612 while (m->mtx_lock != MTX_UNOWNED) {
613 if (i++ < 10000000) {
614 cpu_spinwait();
615 continue;
616 }
617 if (i < 60000000 || kdb_active ||
panicstr != NULL)
618 DELAY(1);
619 else
620 _mtx_lock_spin_failed(m);
621 cpu_spinwait();
622 }
623 spinlock_enter();
624 }

This is `_mtx_lock_spin_cookie(...)` in kern/kern_mutex.c, which
implements the core logic of spinning. However, as you can see, while
spinning, it would enable interrupt "occasionally" and disable it
again... What is the rationale for that?

Regards,
Yubin Ruan

Chris Torek

unread,

Apr 13, 2017, 8:18:39 AM4/13/17

to

>I discover that in the current implementation in FreeBSD, spinlock
>does not disable interrupt entirely:

[extra-snipped here]

> 610 /* Give interrupts a chance while we spin. */
> 611 spinlock_exit();
> 612 while (m->mtx_lock != MTX_UNOWNED) {

[more snip]

>This is `_mtx_lock_spin_cookie(...)` in kern/kern_mutex.c, which
>implements the core logic of spinning. However, as you can see, while
>spinning, it would enable interrupt "occasionally" and disable it
>again... What is the rationale for that?

This code snippet is slightly misleading. The full code path runs
from mtx_lock_spin() through __mtx_lock_spin(), which first
invokes spinlock_enter() and then, in the *contested* case (only),
calls _mtx_lock_spin_cookie().

spinlock_enter() is:

td = curthread;
if (td->td_md.md_spinlock_count == 0) {
flags = intr_disable();
td->td_md.md_spinlock_count = 1;
td->td_md.md_saved_flags = flags;
} else
td->td_md.md_spinlock_count++;
critical_enter();

so it actualy disables interrupts *only* on the transition from
td->td_md.md_spinlock_count = 0 to td->td_md.md_spinlock_count = 1,
i.e., the first time we take a spin lock in this thread, whether
this is a borrowed thread or not. It's possible that interrupts
are actually disabled at this point. If so, td->td_md.md_saved_flags
has interrupts disabled as well. This is all just an optimization
to use a thread-local variable so as to avoid touching hardware.
The details vary widely, but typically, touching the actual hardware
controls requires flushing the CPU's instruction pipeline.

If the compare-and-swap fails, we enter _mtx_lock_spin_cookie()
and loop waiting to see if we can obtain the spin lock in time.
In that case, we don't actually *hold* this particular spin lock
itself yet, so we can call spinlock_exit() to undo the effect
of the outermost spinlock_enter() (in __mtx_lock_spin). That
decrements the counter. *If* it goes to zero, that also calls
intr_restore(td->td_md.md_saved_flags).

Hence, if we have failed to obtain our first spin lock, we restore
the interrupt setting to whatever we saved. If interrupts were
already locked out (as in a filter type interrupt handler) this is
a potentially-somewhat-expensive no-op. If interrupts were
enabled previously, this is a somewhat expensive re-enable of
interrupts -- but that's OK, and maybe good, because we have no
spin locks of our own yet. That means we can take hardware
interrupts now, and let them borrow our current thread if they are
that kind of interrupt, or schedule another thread to run if
appropriate. That might even preempt us, since we do not yet hold
any spin locks. (But it won't preempt us if we have done a
critical_enter() before this point.)

(In fact, the spinlock exit/enter calls that you see inside
_mtx_lock_spin_cookie() wrap a loop that does not use compare-and-
swap operations at all, but rather ordinary memory reads. These
are cheaper than CAS operations on a lot of CPUs, but they may
produce wrong answers when two CPUs are racing to write the same
location; only a CAS produces a guaranteed answer, which might
still be "you lost the race". The inner loop you are looking at
occurs after losing a CAS race. Once we think we might *win* a
future CAS race, _mtx_lock_spin_cookie() calls spinlock_enter()
again and tries the actual CAS operation, _mtx_obtain_lock_fetch(),
with interrupts disabled. Note also the calls to cpu_spinwait()
-- the Linux equivalent macro is cpu_relax() -- which translates
to a "pause" instruction on amd64.)

Chris

Yubin Ruan

unread,

Apr 13, 2017, 9:46:54 AM4/13/17

to

Good explanation. I just missed that "local" interrupt point.

> (In fact, the spinlock exit/enter calls that you see inside
> _mtx_lock_spin_cookie() wrap a loop that does not use compare-and-
> swap operations at all, but rather ordinary memory reads. These
> are cheaper than CAS operations on a lot of CPUs, but they may
> produce wrong answers when two CPUs are racing to write the same

why would that produce wrong result? I think what the inner loop wants
to do is to perform some no-op for a while before it tries again to
acquire the spinlock. So there is no race here.

> location; only a CAS produces a guaranteed answer, which might
> still be "you lost the race". The inner loop you are looking at
> occurs after losing a CAS race. Once we think we might *win* a
> future CAS race, _mtx_lock_spin_cookie() calls spinlock_enter()
> again and tries the actual CAS operation, _mtx_obtain_lock_fetch(),
> with interrupts disabled. Note also the calls to cpu_spinwait()
> -- the Linux equivalent macro is cpu_relax() -- which translates
> to a "pause" instruction on amd64.)

Regards,
Yubin Ruan

Chris Torek

unread,

Apr 13, 2017, 6:47:08 PM4/13/17

to

(This is getting a bit far afield; let me know if we should
take this off-list.)

>why would [regular read, vs CAS] produce wrong result?

There are both hardware architecture (and sometimes individual
CPU architecture) and compiler reasons for this.

First, compilers may try to optimize load and store operations,
especially on register-rich architectures. What's coded as:

... some code section A ...
x = p->foo;
y = p->bar;
... some code section B ...

might actually move the loads of p->foo and/or p->bar into either
the A or B sections. The same goes for stores. The compiler
makes the (somewhat reasonable for most programming) assumption
that only the instructions the compiler itself emits, actually
access the data -- not some instructions running on some other
CPU.

For any lock, this assumption is automatically wrong.

We can defeat part of this with the "volatile" keyword, but we
need to insert compiler level memory barriers to make sure that
the operations proceed in a temporally-defined manner, i.e.,
so that time appears to be linear.

Second, the CPU itself may also have both temporal and non-
temporal loads and stores (with arbitrarily complicated rules
about using them). In this case there may be special instructions
("sfence", "mfence", etc; "membar" on SPARC) for forcing order.
For more about non-temporal operations, see, e.g.:

http://stackoverflow.com/q/37070/1256452
http://infocenter.arm.com/help/topic/com.arm.doc.den0024a/CJACGJJF.html

There are some lock algorithms that work without most of this, but
they tend to be a bit hard to set up. Even then we usually depend
on an atomic compare-and-swap: see
http://wiki.c2.com/?LockFreeSynchronization for instance.

>I think what the inner loop wants to do is to perform some no-op
>for a while before it tries again to acquire the spinlock.

Yes - but the point is that it tries to "gently" read the actual
mutex lock value, and inspect the result to see whether to try the
more-savage (at the hardware level) CAS.

Some of this gets deep into the weeds of hardware implementations.
I had this in my earlier reply (but ripped it out as too much
detail). On Intel dual-socket systems, for instance, there is a
special bus that connects the two sockets called the QPI, and then
there are caches around each core within any one given socket.
These caches come in multiple levels (L1 and higher, details vary
and one should always expect the *next* CPU to do something
different) with some caches physically local to one core and
others shared between multiple cores in one socket.

These caches tend to coordinate using protocols called MESI or
MOESI. The letters stand for cache line states: Modified, Owned,
Exclusive, Shared, or Invalid. A Modified cache line has data not
yet written to the next level out (whether that's a higher level,
larger cache, or main memory). An Excusive line is in this cache
only and can therefore be written-to. (I'm ignoring "owned", it
is kind of a tweak between M and E.) A Shared line is in this
cache *and some other cache* and therefore can *not* be written
to, but *can* be read from; and finally, an Invalid line has no
valid data at all.

As a rule, the closer a cache line is to the CPU, the faster its
access is. (This rule is pretty reliable since otherwise there is
no point in having that cache.) *Writing* to a cache line
requires exclusive access, though, so we must know if the line is
shared. If it *is* shared, we must signal higher level caches
that we intend to write, and wait for them to give up their cached
copies of data. In other words we fire a bullet at them: "I want
exclusive access, kill off your shared copy." Then we must wait
for a reply, or a time delay (whichever is architecturally
appropriate), so that we know that this succeeded, or get a
failure indication ("you may not have that exlusively, not just
yet anyway"). This reply or delay takes up to the *worst case*,
slowest, access time may be. For dual-socket Intel that means
doing a QPI bus transaction to the other socket.

(This is true for any write operation, not just compare-and-swap.
For this reason, we often like our mutexes to fill out a complete
cache line, so that any data *protected by* the mutex is not shot
out of the cache every time we poke at the mutex itself.)

Note that when we *read* the object, however, we're doing a read,
not a write. This does not need exclusive access to the cache line:
shared access suffices. If we do not have the data in cache, we
send out a request: "I would like to read this." Any CPU that has
the item cached, e.g., whoever actually locked the lock, must drop
it back to whatever level accomplishes sharing -- if it's dirty,
writing it out -- and take his core-side cache line status back
from M or E to S. Any other CPU also spinning, waiting for the
lock, must go to this shared state. Now all CPUs interested in
the lock -- the holder, and all waiters -- have it shared. They
can all *read* the line and see whether it's still locked. There
is no traffic over inter-cache or inter-socket busses at this
point. These are the "gentle" spins, that only *read* the lock.

Eventually, whoever owns the lock, unlocks it. This requires a
write (or CAS) operation, which yanks the cache line away from all
the spinners (that part is unavoidable, and slow, and causes all
this bus traffic we were avoiding) and releases the lock. The
spinners then take the shared cache lines back to shared state and
see that they *may* be able to get the lock. At this point they
attempt the expensive operation, and *that* produces a reliable
answer -- which may be "someone else beat us to the lock so we
go back to the gentle spin code".

Note that some architectures do not have an actual compare-and-
swap instruction. For instance, PowerPC and MIPS use a different
technique: there is a load instruction that takes the cache line
to exclusive state *and* sets an internal CPU register to remember
this. If the cache line drops back out of exclusive state, a
subsequent "store conditional" instruction fails the condition,
does *not* store, and lets you branch to a loop that repeats the
load if needed. If it is still exclusive, the write succeeds (and
the cache line goes to M state, if the cache is write-back). This
lets you build a compare-and-swap from the low level cache-line
synchronization operations that are what the hardware uses.

There is more on this at:

http://stackoverflow.com/q/151783/1256452

and specifically for x86 (64 bit Intel and ARM) at:

http://stackoverflow.com/q/151783/1256452

(These are not the only ways to implement synchronization.
Some more exotic architectures have special regions of physical
memory that can act transactionally, or that auto-increment
upon load so that each CPU can "take a ticket" and use a
version of Lamport's Bakery algorithm: see
https://en.wikipedia.org/wiki/Lamport%27s_bakery_algorithm
for details. However, the BSD mtx_lock() is designed around
compare-and-swap plus optimizations for MESI cache implementations.)

Chris

Yubin Ruan

unread,

Apr 17, 2017, 1:16:15 PM4/17/17

to

Hmm...I agree. I mis-read your words.
The loop inside _mtx_lock_spin_cookie() might produce a wrong result,
but the overall _mtx_lock_spin_cookie() won't, because it finally use
a compare-and-swap instruction to test finally.
(i.e the loop might produce a "false positive" result, so we have to
check again....)

-- yubinr