[Sbcl-devel] x86/split lock detection

Stas Boukarev

unread,

Apr 25, 2022, 10:50:41 AM4/25/22

to sbcl-devel

Linux now complains about misaligned locked instructions, and there's
one in sb-vm::remove-static-links. Which doesn't seem that it'd be too
impactful on performance. But now there's spam in dmesg, and
optionally linux can generate a sigbus for it.

_______________________________________________
Sbcl-devel mailing list
Sbcl-...@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/sbcl-devel

Stas Boukarev

unread,

Apr 25, 2022, 10:54:44 AM4/25/22

to sbcl-devel

Some more info: https://lwn.net/Articles/806466/ "Developers split
over split-lock detection"

Douglas Katzman via Sbcl-devel

unread,

Apr 25, 2022, 11:01:52 AM4/25/22

to Stas Boukarev, sbcl-devel

I was afraid this day might come. I could see two modes of operation, "release mode" in which a 'call rel32' form is always used for function call but can never be changed (disallows function redefinition) and "develop mode" in which none of the static linking of fdefns is applied. Whether dev/release can be entirely policy-based, I don't know. I don't think people typically associate (declaim (optimize speed)) with an implication that it impedes dynamic redefinitions.

There are of course other possibilities:

1. always insert enough NOPs before every call site to align the CALL/JMP instruction's operand so that an atomic exchange works

2. don't use an atomic. Maybe that's good enough for this use-case? I'd worry that instruction decode on another CPU could see half of the old instruction and half the new instruction (torn read)

3. insert a 1-byte trap (0xCC) at the start of the instruction you're modifying so that if some other CPU hits it we can at least complain, then modify the instruction, then write the correct byte

I don't know the general expectation around being "in" a function and also modifying it. I wonder what tooling such as https://dynamorio.org/ does about this problem?

Stas Boukarev

unread,

Apr 25, 2022, 11:15:45 AM4/25/22

to Douglas Katzman, sbcl-devel

Yeah, none of these options sound good. At least not better than
[ 2733.215970] x86/split lock detection: #AC: sbcl/11802 took a
split_lock trap at address: 0x52cc4bc0
[ 2735.300232] x86/split lock detection: #AC: sbcl/12335 took a
split_lock trap at address: 0x52cc4bc0
[ 2743.454923] x86/split lock detection: #AC: sbcl/12644 took a
split_lock trap at address: 0x52cc4bc0
[ 2750.396864] x86/split lock detection: #AC: sbcl/12648 took a
split_lock trap at address: 0x53be7990
[ 3389.566296] x86/split lock detection: #AC: repl-thread/13965 took a
split_lock trap at address: 0x52d11ef0

which could also be turned off with a boot option, apparently.

Douglas Katzman via Sbcl-devel

unread,

Apr 25, 2022, 1:10:36 PM4/25/22

to Stas Boukarev, sbcl-devel

How about stopping all threads except the mutator? It's such a rare thing to "demote" call sites from wiring in the #<function> to indirecting through the FDEFN that I don't think it would bother people. Once a function is changed to use its FDEFN it will always stay that way (If that's not what I did, I can make it so).

We can stipulate that functions are only wired in when caller and callee are both in (speed 3). So less than speed 3 never experiences the overhead of remove-static-links.

In fact, we could advertise this last piece thusly in the manual: "Speed 3 removes one indirection from each call site, but there is more overhead on function redefinition. If a caller and callee are both compiled in speed 3, it is not guaranteed to be thread-safe to change definitions of functions that are on the execution stack of any currently running thread"

So strictly speaking this is less flexible than before, but: (a) it's keeping up with the evolution of hardware, (2) if you weren't optimizing for speed then you likely never cared that FDEFNs impose extra overhead.

Stas Boukarev

unread,

Apr 25, 2022, 1:50:30 PM4/25/22

to Douglas Katzman, sbcl-devel

I'm willing to just ignore this until it becomes a real problem.
Introducing different behavior is less than ideal, and disseminating
new information is always difficult.

Douglas Katzman via Sbcl-devel

unread,

Apr 27, 2022, 10:10:31 PM4/27/22

to Stas Boukarev, sbcl-devel

I think we can replace the cmpxchg with an ordinary mov. It's probably not technically any more correct using cmpxchg. I do not see in the processor manual that it guarantees to fetch instructions atomically across cache lines; only that if an instruction was prefetched and then you store into that address it must invalidate the prefetch. So my comment claiming that there seems to be some atomicity seems misguided.

In any event I want to eliminate FDEFNs for symbols, and eliminate the requirement for code to be in immobile space, so when i go down that road I'm likely to redesign this anyway.

FDEFNs can be dispensed with for symbols because every symbol has 1 word that can hold a raw address. (We already handle symbols differently in GC due to the encoded package so no big deal to decode a raw word). And it's not onerous to require that all functions in a symbol be callable as if it were a SIMPLE-FUN - I need make-simplifying-trampoline to work universally first.

Douglas Katzman via Sbcl-devel

unread,

Apr 30, 2022, 10:20:23 PM4/30/22

to Paul Khuong, sbcl-devel

Incredible. It looks like Golang has some analysis of the workaround which in turn links to some analysis by llvm people. I guess we need to do that.

On Sat, Apr 30, 2022 at 9:36 PM Paul Khuong <pkh...@gmail.com> wrote:

On 4/27/22, Douglas Katzman via Sbcl-devel

<sbcl-...@lists.sourceforge.net> wrote:
> I think we can replace the cmpxchg with an ordinary mov. It's probably not
> technically any more correct using cmpxchg. I do not see in the processor
> manual that it guarantees to fetch instructions atomically across cache
> lines; only that if an instruction was prefetched and then you store into
> that address it must invalidate the prefetch. So my comment claiming that
> there seems to be some atomicity seems misguided.

Empirically, Intel (and probably AMD as well) is much more careful
with respect to memory ordering between stores and the instruction
cache than the official documentation would make you think. However,
regular loads definitely observe non-atomic writes when the
destination straddles two cache lines, so I would assume this can
affect instruction decoding as well.

However, I think the real problem is the fact that control flow
instructions may cross cache line boundaries. Intel's erratum SKX 102
(https://www.intel.com/content/www/us/en/support/articles/000055650/processors.html)
tells us that jump instructions that cross a cache line boundary can
result in unpredictable behaviour (i.e., rare and inexplicable bugs),
or a noticeable slowdown on processors with the microcode patch.

The recommended mitigation
(https://www.intel.com/content/dam/support/us/en/documents/processors/mitigations-jump-conditional-code-erratum.pdf)
is to emit redundant CS segment override bytes (up to 5) in front of
the instruction *immediately before* the jump or fused conditional
jump instructions. Intel recommends such padding before jmp, jcc, and
fused conditional jump; that's the bare minimum, and aligning calls
isn't a bad idea, if only for performance stability.

If SBCL can guarantee that control flow instructions don't cross 32
byte boundaries, that should also render atomicity worries mostly
moot.

Paul Khuong

Stas Boukarev

unread,

Apr 30, 2022, 10:44:01 PM4/30/22

to Douglas Katzman, sbcl-devel, Paul Khuong

I wouldn't say need, as the microcode patch is supposed to solve the
correctness issue, and the performance issue is remaining. Presumably
it has received a proper fix in later generations of CPUs, so after
some time it wouldn't be a problem.
We normally don't do any CPU-specific optimizations as they are hard
to keep track of, measure and maintain.

Douglas Katzman via Sbcl-devel

unread,

Apr 30, 2022, 10:51:08 PM4/30/22

to Stas Boukarev, sbcl-devel

On Sat, Apr 30, 2022 at 10:43 PM Stas Boukarev <stas...@gmail.com> wrote:

We normally don't do any CPU-specific optimizations as they are hard
to keep track of, measure and maintain.

In that case I vote for entirely getting rid of fast_bzero() and the popcnt workaround

Or to phrase it differently, what's the dividing line between CPU-specific optimizations that we do or don't do?

Stas Boukarev

unread,

Apr 30, 2022, 10:56:05 PM4/30/22

to Douglas Katzman, sbcl-devel

Ok, what is fast_bzero even doing?

Stas Boukarev

unread,

Apr 30, 2022, 10:58:43 PM4/30/22

to Douglas Katzman, sbcl-devel

> Or to phrase it differently, what's the dividing line between CPU-specific optimizations that we do or don't do?

We don't have anything that behaves differently between, say, Haswell and Zen 3.

Charles Zhang via Sbcl-devel

unread,

Apr 30, 2022, 11:09:37 PM4/30/22

to Stas Boukarev, Douglas Katzman, sbcl-devel

I agree with Stas. I think especially because of the microcode aspect, it's hard to justify adding something so specific. It might make sense for llvm to do such fine grained analysis, but surely not for us. If we put in CPU-model specific workarounds, the threshold for how far-sighted they need to be feels much higher than this.

Douglas Katzman via Sbcl-devel

unread,

Jul 10, 2022, 9:51:22 PM7/10/22

to Stas Boukarev, sbcl-devel

I think I plan to kill several (as many as 3) birds with one stone in regards to split lock warnings.

When calling from immobile code to immobile code through an immobile fdefn, we can emit it as "call [rip+n]" where rip+n is an fdefn-raw-addr slot.

This call format removes the address range restriction, as long as the immobile code space and immobile fdefns are within +/- 2Gb. So I'll separate immobile fixed objects into a space of fdefns and a space of symbols ensuring that fdefns remain near enough to pages of code.

This proposal basically re-introduces the level of indirection conferred by fdefns, while still not requiring a load of RAX. This would also eliminate the so-called "static linking" which may not have done much anyway. In fact I would guess that if there is any slowdown, we can do some different optimization similar in spirit to the unboxed call convention. Such as we can avoid loading the arg-count-passing registers for known-fixed-arg receivers.

And of course then the split-lock problem goes away because we'll never overwrite an instruction to change an fdefn's function.

For elfinated core files, I'll perhaps still change the call instruction from "call [rip+n]" to call rel32 form, but that's a separate question. As it happens, some of our cloud machines in fact disallow writable code in '.text' segments, which is where lisp code goes after conversion to ELF.

Also as a totally unrelated issue, but drawing on a similar principle, it may be possible to relax the requirement for immobile symbols to be mapped sub-4Gb because they could be computed with "LEA Rn, [rip+n]". But this tends to cost 1 extra instruction. Symbols really want to be imm32 operands, which would require that the in-register representation of a symbol be its displacement from the base of a hypothetical "symbol space". This is equivalent to a JVM's compressed pointer representation. We can do the same thing with compact-instance-header layouts being 32-bit pointers but allowing the layouts to be anywhere. It adds one instruction to lots of things that use layouts (to compute an actual address whence to load the inherited layouts) but it doesn't add any instruction for comparison of layouts for EQness.

Reply all

Reply to author

Forward