compare-and-swap

108 views
Skip to first unread message

robf...@gmail.com

unread,
Jan 5, 2023, 3:07:16 AMJan 5
to
I have been searching for information on the compare-and-swap instruction.
Does it need to be a single instruction? It seems to me that it could be implemented using a sequence of instructions if interrupts were disabled and the bus was locked. A complication may be a page fault or some other fault during the execution of the compare-and-swap.
I would prefer to implement CAS as an atomic sequence of instructions using an ATOM modifier. That way CAS does not need to be added to the instruction set and existing instructions can be used. I could make the atomic sequence restart able since the start address of the sequence is known.

ATOM “LLLLAA”
LOAD a0,[a3]
CMP t0,a0,a1
PEQ t0,”TTF” ; predicate the following instructions
STORE a2,[a3]
LDI a0,1
LDI a0,0

Terje Mathisen

unread,
Jan 5, 2023, 3:31:50 AMJan 5
to
This is more or less what the HW sequencer has to do on any arch with
CAS, if your ATOM modifier still allows at least a similar chance of
making forward progress, then it would seem to be OK. Speed of execution
is important, assuming [a3] is already in Exclusive state in your local
$L1, then the HW version _can_ run in a cycle or two, but if you first
have to wait to aquire the cache line, then it can be much slower, and
in that case the 3-5 cycles your version seems to require would probably
be OK.

If you do the same for XADD, then that primitive has better odds of
actually allowing a high-contention storm to abate.

You should however look closely at what Mitch has done with much more
intelligent back-off-retry algorithms, in his favorite implementation
progress is guaranteed and worst case complexity is ~O(N) instead of
O(N^2) or worse.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Scott Lurndal

unread,
Jan 5, 2023, 11:33:01 AMJan 5
to
"robf...@gmail.com" <robf...@gmail.com> writes:
>I have been searching for information on the compare-and-swap instruction.
>Does it need to be a single instruction? It seems to me that it could be im=
>plemented using a sequence of instructions if interrupts were disabled and =
>the bus was locked. A complication may be a page fault or some other fault =
>during the execution of the compare-and-swap.
>I would prefer to implement CAS as an atomic sequence of instructions using=
> an ATOM modifier. That way CAS does not need to be added to the instructio=
>n set and existing instructions can be used. I could make the atomic sequen=
>ce restart able since the start address of the sequence is known.
>
>ATOM =E2=80=9CLLLLAA=E2=80=9D
>LOAD a0,[a3]
>CMP t0,a0,a1
>PEQ t0,=E2=80=9DTTF=E2=80=9D ; predicate the following instructions
>STORE a2,[a3]
>LDI a0,1
>LDI a0,0
>

Neither disabling interrupts nor locking the bus are performant,
particularly with high core/thread counts. With
atomic compare and exchange, modern systems delegate such operations
to the point of coherency for performance.

robf...@gmail.com

unread,
Jan 5, 2023, 12:59:31 PMJan 5
to
Yeah, speed is the drawback to the atomic sequence of instructions. I was not liking
the whole lock then operate and unlock from the core side because it would be
very slow. Its appeal is that it is simple to implement and does not require more
opcodes. Delegating to the point of coherency has its own issues. The ISA has only
three opcodes left at the root level and CMPXCHG would consume one, but its
probably worth it to use one. I suppose an atomic sequence for CMPXCHG could be
fused into a single instruction then sent off. To delegate to the point of coherency
the instruction would need to be sent through the on-chip network to the system
cache / DRAM controller. Then a read-modify-write cycle would need to take place. I
do not currently have the memory system supporting read-modify-write cycles. I set
it up to support LL / SC which does not require RMW cycles. But I am getting the
feeling that it would be better to support CMPXCHG than LL / SC. I am not keen on
adding a decoder and state machine to the memory system to handle CMPXCHG and
XADD.

MitchAlsup

unread,
Jan 5, 2023, 3:04:57 PMJan 5
to
That is the general idea, however, for My 66000 ISA I used a bit in the 2-operand
memory references to denote "participation" in the ATOMIC event. When an inbound
memory reference has its lock bit set, it is said that the associated cache line is
participating in the event. When an outbound memory reference has the lock bit
set, it is the last instruction in the event. So, the above event would be in asm::
<
LD Ra0,[Rp].lock
CMP Rt,Ra0,Ra1
PEQ Rt,T
ST Ra2,[Rp].lock
MOV Rr,Rt,<1:eq>
<
a) you only have to predicate the store
b) you can extract True/False from the predicate condition
c) however this is subject to the ABA problem during the time between the
....corresponding LD Ra1 many cycles above (not illustrated)
d) if you add a check for interference the ABA problem vanishes
<
LD Ra0,[Rp].lock
CMP Rt,Ra0,Ra1
PEQ Rt,TT
PINF --,F
ST Ra2,[Rp].lock
MOV Rr,Rt,<1:eq>
<

Scott Lurndal

unread,
Jan 5, 2023, 3:13:01 PMJan 5
to
MitchAlsup <Mitch...@aol.com> writes:
>On Thursday, January 5, 2023 at 2:07:16 AM UTC-6, robf...@gmail.com wrote:
>> I have been searching for information on the compare-and-swap instruction=
>.=20
>> Does it need to be a single instruction? It seems to me that it could be =
>implemented using a sequence of instructions if interrupts were disabled an=
>d the bus was locked. A complication may be a page fault or some other faul=
>t during the execution of the compare-and-swap.=20
>> I would prefer to implement CAS as an atomic sequence of instructions usi=
>ng an ATOM modifier. That way CAS does not need to be added to the instruct=
>ion set and existing instructions can be used. I could make the atomic sequ=
>ence restart able since the start address of the sequence is known.=20
>>=20
>> ATOM =E2=80=9CLLLLAA=E2=80=9D=20
>> LOAD a0,[a3]=20
>> CMP t0,a0,a1=20
>> PEQ t0,=E2=80=9DTTF=E2=80=9D ; predicate the following instructions=20
>> STORE a2,[a3]=20
>> LDI a0,1=20
>> LDI a0,0
><
>That is the general idea, however, for My 66000 ISA I used a bit in the 2-o=
>perand
>memory references to denote "participation" in the ATOMIC event. When an in=
>bound
>memory reference has its lock bit set, it is said that the associated cache=
> line is
>participating in the event. When an outbound memory reference has the lock =
>bit
>set, it is the last instruction in the event. So, the above event would be =
>in asm::
><
> LD Ra0,[Rp].lock
> CMP Rt,Ra0,Ra1
> PEQ Rt,T
> ST Ra2,[Rp].lock
> MOV Rr,Rt,<1:eq>
><
>a) you only have to predicate the store
>b) you can extract True/False from the predicate condition
>c) however this is subject to the ABA problem during the time between the
>....corresponding LD Ra1 many cycles above (not illustrated)
>d) if you add a check for interference the ABA problem vanishes
><
> LD Ra0,[Rp].lock
> CMP Rt,Ra0,Ra1
> PEQ Rt,TT
> PINF --,F
> ST Ra2,[Rp].lock
> MOV Rr,Rt,<1:eq>
><

What prevents the operating system from rescheduling the
thread between the load and the store? What happens to
the lock on the cache line if that is the case?

MitchAlsup

unread,
Jan 5, 2023, 3:34:10 PMJan 5
to
The cache line is not locked, but it is monitored for interference.
The LD.lock instruction fetches the line for read-with-intent-to-modify
and setup one of the miss buffers to monitor that cache line physical
address. At the end of LD, the line is present and writeable in the cache.
<
If we have not arrived at PINF and we receive a context switch
then IP is reset to @LD, the context switch happens, and the
ATOMIC event did not happen. When control returns, the event
is played from the LD.
<
If the participating cache line is "writeable" the scheduled context
switch will be deferred until after the ST and the ATOMIC event
will have succeeded (or failed by stepping over the ST). When
control returns, Rr is extracted from the CMP result and SW
can look at the value and do something reasonable.
<
If the participating cache line is not "writeable" then interference
will have been detected, the event fails and Rr will contain 0. SW
looks at Rr and decides the event failed and will go do something
else.

Scott Lurndal

unread,
Jan 5, 2023, 5:31:55 PMJan 5
to
Basically the same functionality as ARM64 Load Exclusive/Store Exclusive
without the conditional branch.

Which didn't scale to high processor/thread counts, which begat the ARMv8
Large System Extensions including instructions such as a compare-and-swap,
and the various load-and-op instructions (e.g. LDADD, LDSET, LDCLR, LDXOR et alia)
which could be forwarded to an external (to the CPU) agent such as a cache
or PCIe Root Complex to encode as a TLP sent to an PCIe endpoint for handling.

><
>If we have not arrived at PINF and we receive a context switch
>then IP is reset to @LD, the context switch happens, and the
>ATOMIC event did not happen. When control returns, the event
>is played from the LD.
><
>If the participating cache line is "writeable" the scheduled context
>switch will be deferred until after the ST and the ATOMIC event
>will have succeeded (or failed by stepping over the ST). When
>control returns, Rr is extracted from the CMP result and SW
>can look at the value and do something reasonable.
><
>If the participating cache line is not "writeable" then interference
>will have been detected, the event fails and Rr will contain 0. SW
>looks at Rr and decides the event failed and will go do something
>else.

Seems like a lot of work. Our SoCs forward the entire atomic
operation to the point of coherency (e.g. cache agent) and let that agent
handle the complete operation as well as supporting PCI-express. And
going forward, things like CXL.memory.

MitchAlsup

unread,
Jan 5, 2023, 5:51:37 PMJan 5
to
Except that up to 8 unique cache lines may participate in an ATOMIC event.
>
> Which didn't scale to high processor/thread counts, which begat the ARMv8
> Large System Extensions including instructions such as a compare-and-swap,
> and the various load-and-op instructions (e.g. LDADD, LDSET, LDCLR, LDXOR et alia)
> which could be forwarded to an external (to the CPU) agent such as a cache
> or PCIe Root Complex to encode as a TLP sent to an PCIe endpoint for handling.
<
In other words, they gave it a try and when it failed, they punted.
<
> ><
> >If we have not arrived at PINF and we receive a context switch
> >then IP is reset to @LD, the context switch happens, and the
> >ATOMIC event did not happen. When control returns, the event
> >is played from the LD.
> ><
> >If the participating cache line is "writeable" the scheduled context
> >switch will be deferred until after the ST and the ATOMIC event
> >will have succeeded (or failed by stepping over the ST). When
> >control returns, Rr is extracted from the CMP result and SW
> >can look at the value and do something reasonable.
> ><
> >If the participating cache line is not "writeable" then interference
> >will have been detected, the event fails and Rr will contain 0. SW
> >looks at Rr and decides the event failed and will go do something
> >else.
<
> Seems like a lot of work. Our SoCs forward the entire atomic
> operation to the point of coherency (e.g. cache agent) and let that agent
> handle the complete operation as well as supporting PCI-express. And
> going forward, things like CXL.memory.
<
It is not a lot of work (for example it is a lot less work that <say> a FP DIV.)
It is more work than a LD or a CAS--but it makes it possible to quit inventing
new ATOMIC instructions each generation. That is, HW can get out of the
game and let SW figure out what kinds of ATOMIC things it wants, needs,
and can use.
>
Most of what goes on is that when an inbound memory reference has the lock
bit set, that the translated physical address is placed in a miss buffer where
it can watch if the line is disturbed in any malicious to an ATOMIC sequence.
Since you have multiple elements in the miss buffer, you can have multiple
lines participate in the event. {This path is already present in the cache access
pipeline stage(s).}
<
The only real "extra" work is ORing all the miss buffer entries into the
Interference signal and allowing the BC instruction to access that signal.
<
Well pushing it out to the coherence point requires you to make a big bundle.
<
BOOLEAN MoveElement( Element *fr, Element *to )
{
esmLOCK( fn = fr->next );
esmLOCK( fp = fr->prev );
esmLOCK( tn = to->next );
esmLOCK( fn );
esmLOCK( fp );
esmLOCK( tn );
if( !esmINTERFERENCE() )
{
fp->next = fn;
fn->prev = fp;
to->next = fr;
tn->prev = fr;
fr->prev = to;
esmLOCK( fr->next = tn );
return TRUE;
}
return FALSE;
}
<
Everything but the returns are part of the ATOMIC event.

robf...@gmail.com

unread,
Jan 5, 2023, 6:33:56 PMJan 5
to
Okay, now I have gone down the path of pushing instruction sequences to the
coherence point rather than just individual instructions using ATOM to identify
the sequences. The sequences are short.

ATOM a0,“AAA”
LOAD a0,[a2]
ADD t0,a0,a1
STORE t0,[a2]

“ATOM” would build a cache-line wide sequence of data and instructions and
send it to the coherence point to be executed as a unit.

The above would encode a Load-add-store sequence. I think only a few bits are
needed for each instruction. Load seems to always be the first operation so
that may not need to be specified. I think there are only three source operands
and one destination. I am using 96-bit operand values so encoding four in a
cache line with ops should be possible.

The coherence point has its own mini-cpu with instruction set.

Scott Lurndal

unread,
Jan 5, 2023, 7:09:09 PMJan 5
to
What use case would require atomicity across 8 cache lines?

>>
>> Which didn't scale to high processor/thread counts, which begat the ARMv8
>> Large System Extensions including instructions such as a compare-and-swap,
>> and the various load-and-op instructions (e.g. LDADD, LDSET, LDCLR, LDXOR et alia)
>> which could be forwarded to an external (to the CPU) agent such as a cache
>> or PCIe Root Complex to encode as a TLP sent to an PCIe endpoint for handling.
><
>In other words, they gave it a try and when it failed, they punted.

No, they started with the old ARMv7 mechanism and realized that it
didn't scale to large processor counts in ARMv8 around which high-core
count systems were being designed.

I don't see your method as better than delegating to an agent,
and I suspect it would perform just as poorly under contention as
load/store exclusive or LL/SC.

Scott Lurndal

unread,
Jan 5, 2023, 7:17:23 PMJan 5
to
"robf...@gmail.com" <robf...@gmail.com> writes:
>On Thursday, January 5, 2023 at 5:51:37 PM UTC-5, MitchAlsup wrote:
>> On Thursday, January 5, 2023 at 4:31:55 PM UTC-6, Scott Lurndal wrote:=20
>> > MitchAlsup <Mitch...@aol.com> writes:=20
>> > >On Thursday, January 5, 2023 at 2:13:01 PM UTC-6, Scott Lurndal wrote:=
>=20
>> > >> MitchAlsup <Mitch...@aol.com> writes:=20
>> > >> >On Thursday, January 5, 2023 at 2:07:16 AM UTC-6, robf...@gmail.com=
> wrote:=20
>> > >> >> I have been searching for information on the compare-and-swap ins=
>truction=3D=20
>> > >> >.=3D20=20
>> > >> >> Does it need to be a single instruction? It seems to me that it c=
>ould be =3D=20
>> > >> >implemented using a sequence of instructions if interrupts were dis=
>abled an=3D=20
>> > >> >d the bus was locked. A complication may be a page fault or some ot=
>her faul=3D=20
>> > >> >t during the execution of the compare-and-swap.=3D20=20
>> > >> >> I would prefer to implement CAS as an atomic sequence of instruct=
>ions usi=3D=20
>> > >> >ng an ATOM modifier. That way CAS does not need to be added to the =
>instruct=3D=20
>> > >> >ion set and existing instructions can be used. I could make the ato=
>mic sequ=3D=20
>> > >> >ence restart able since the start address of the sequence is known.=
>=3D20=20
>> > >> >>=3D20=20
>> > >> >> ATOM =3DE2=3D80=3D9CLLLLAA=3DE2=3D80=3D9D=3D20=20
>> > >> >> LOAD a0,[a3]=3D20=20
>> > >> >> CMP t0,a0,a1=3D20=20
>> > >> >> PEQ t0,=3DE2=3D80=3D9DTTF=3DE2=3D80=3D9D ; predicate the followin=
>g instructions=3D20=20
>> > >> >> STORE a2,[a3]=3D20=20
>> > >> >> LDI a0,1=3D20=20
>> > >> >> LDI a0,0=20
>> > >> ><=20
>> > >> >That is the general idea, however, for My 66000 ISA I used a bit in=
> the 2-o=3D=20
>> > >> >perand=20
>> > >> >memory references to denote "participation" in the ATOMIC event. Wh=
>en an in=3D=20
>> > >> >bound=20
>> > >> >memory reference has its lock bit set, it is said that the associat=
>ed cache=3D=20
>> > >> > line is=20
>> > >> >participating in the event. When an outbound memory reference has t=
>he lock =3D=20
>> > >> >bit=20
>> > >> >set, it is the last instruction in the event. So, the above event w=
>ould be =3D=20
>> > >> >in asm::=20
>> > >> ><=20
>> > >> > LD Ra0,[Rp].lock=20
>> > >> > CMP Rt,Ra0,Ra1=20
>> > >> > PEQ Rt,T=20
>> > >> > ST Ra2,[Rp].lock=20
>> > >> > MOV Rr,Rt,<1:eq>=20
>> > >> ><=20
>> > >> >a) you only have to predicate the store=20
>> > >> >b) you can extract True/False from the predicate condition=20
>> > >> >c) however this is subject to the ABA problem during the time betwe=
>en the=20
>> > >> >....corresponding LD Ra1 many cycles above (not illustrated)=20
>> > >> >d) if you add a check for interference the ABA problem vanishes=20
>> > >> ><=20
>> > >> > LD Ra0,[Rp].lock=20
>> > >> > CMP Rt,Ra0,Ra1=20
>> > >> > PEQ Rt,TT=20
>> > >> > PINF --,F=20
>> > >> > ST Ra2,[Rp].lock=20
>> > >> > MOV Rr,Rt,<1:eq>=20
>> > >> ><=20
>> > >> What prevents the operating system from rescheduling the=20
>> > >> thread between the load and the store? What happens to=20
>> > >> the lock on the cache line if that is the case?=20
>> > ><=20
>> > >The cache line is not locked, but it is monitored for interference.=20
>> > >The LD.lock instruction fetches the line for read-with-intent-to-modif=
>y=20
>> > >and setup one of the miss buffers to monitor that cache line physical=
>=20
>> > >address. At the end of LD, the line is present and writeable in the ca=
>che.=20
>> <=20
>> > Basically the same functionality as ARM64 Load Exclusive/Store Exclusiv=
>e=20
>> > without the conditional branch.=20
>> <
>> Except that up to 8 unique cache lines may participate in an ATOMIC event=
>.
>> >=20
>> > Which didn't scale to high processor/thread counts, which begat the ARM=
>v8=20
>> > Large System Extensions including instructions such as a compare-and-sw=
>ap,=20
>> > and the various load-and-op instructions (e.g. LDADD, LDSET, LDCLR, LDX=
>OR et alia)=20
>> > which could be forwarded to an external (to the CPU) agent such as a ca=
>che=20
>> > or PCIe Root Complex to encode as a TLP sent to an PCIe endpoint for ha=
>ndling.=20
>> <
>> In other words, they gave it a try and when it failed, they punted.
>> <=20
>> > ><=20
>> > >If we have not arrived at PINF and we receive a context switch=20
>> > >then IP is reset to @LD, the context switch happens, and the=20
>> > >ATOMIC event did not happen. When control returns, the event=20
>> > >is played from the LD.=20
>> > ><=20
>> > >If the participating cache line is "writeable" the scheduled context=
>=20
>> > >switch will be deferred until after the ST and the ATOMIC event=20
>> > >will have succeeded (or failed by stepping over the ST). When=20
>> > >control returns, Rr is extracted from the CMP result and SW=20
>> > >can look at the value and do something reasonable.=20
>> > ><=20
>> > >If the participating cache line is not "writeable" then interference=
>=20
>> > >will have been detected, the event fails and Rr will contain 0. SW=20
>> > >looks at Rr and decides the event failed and will go do something=20
>> > >else.=20
>> <=20
>> > Seems like a lot of work. Our SoCs forward the entire atomic=20
>> > operation to the point of coherency (e.g. cache agent) and let that age=
>nt=20
>> > handle the complete operation as well as supporting PCI-express. And=20
>> > going forward, things like CXL.memory.=20
>> <
>> It is not a lot of work (for example it is a lot less work that <say> a F=
>P DIV.)=20
>> It is more work than a LD or a CAS--but it makes it possible to quit inve=
>nting=20
>> new ATOMIC instructions each generation. That is, HW can get out of the=
>=20
>> game and let SW figure out what kinds of ATOMIC things it wants, needs,=
>=20
>> and can use.=20
>> >=20
>> Most of what goes on is that when an inbound memory reference has the loc=
>k=20
>> bit set, that the translated physical address is placed in a miss buffer =
>where=20
>> it can watch if the line is disturbed in any malicious to an ATOMIC seque=
>nce.=20
>> Since you have multiple elements in the miss buffer, you can have multipl=
>e=20
>> lines participate in the event. {This path is already present in the cach=
>e access=20
>> pipeline stage(s).}=20
>> <=20
>> The only real "extra" work is ORing all the miss buffer entries into the=
>=20
>> Interference signal and allowing the BC instruction to access that signal=
>.=20
>> <=20
>> Well pushing it out to the coherence point requires you to make a big bun=
>dle.=20
>> <=20
>> BOOLEAN MoveElement( Element *fr, Element *to )=20
>> {=20
>> esmLOCK( fn =3D fr->next );=20
>> esmLOCK( fp =3D fr->prev );=20
>> esmLOCK( tn =3D to->next );=20
>> esmLOCK( fn );=20
>> esmLOCK( fp );=20
>> esmLOCK( tn );=20
>> if( !esmINTERFERENCE() )=20
>> {=20
>> fp->next =3D fn;=20
>> fn->prev =3D fp;=20
>> to->next =3D fr;=20
>> tn->prev =3D fr;=20
>> fr->prev =3D to;=20
>> esmLOCK( fr->next =3D tn );=20
>> return TRUE;=20
>> }=20
>> return FALSE;=20
>> }=20
>> <=20
>> Everything but the returns are part of the ATOMIC event.
>
>Okay, now I have gone down the path of pushing instruction sequences to the
>coherence point rather than just individual instructions using ATOM to iden=
>tify
>the sequences. The sequences are short.
>
>ATOM a0,=E2=80=9CAAA=E2=80=9D
>LOAD a0,[a2]
>ADD t0,a0,a1
>STORE t0,[a2]
>
>=E2=80=9CATOM=E2=80=9D would build a cache-line wide sequence of data and i=
>nstructions and
>send it to the coherence point to be executed as a unit.

Traditionally, it's just done as another opcode on the bus/mesh/switch/fabric
between system components. Various flavors of loads and stores mainly.

It doesn't really make sense to force software to be aware of what
type of memory it is loading and storing to, it's much more flexible
to simply have an atomic load instruction that sends the appropropriate
opcode to the target. If the physical page underlying the program virtual mapping
is in a PCI express device BAR space, the host bus/mesh/interconnect routes
the opcode to the root complex bridge, otherwise it is routed to the cache complex.

It seems counterintuitive to complicate each possible agent[*] with the logic
necessary to execute a sequence of arbitrary programs (which sound
much like IBM channel programs).

[*] cache agents, root complex agents, onboard coprocessors and functional block
MMIO registers such as hardware work schedulers, etc.

>
>The above would encode a Load-add-store sequence. I think only a few bits a=
>re
>needed for each instruction. Load seems to always be the first operation so
>that may not need to be specified. I think there are only three source oper=

MitchAlsup

unread,
Jan 5, 2023, 8:42:21 PMJan 5
to
My routine that consumes 5 cache lines is the biggest ATOMIC primitive
I have "run into", although the cores I am working in and on have 8-entry
miss buffers and as long as I can use 5 of them why not be able to use
them all. So, 8 is more of a maximum limit that we expose to software.
>
Also note: the ATOMIC event can manipulate (i.e., modify) several
containers in each cache line. For example, one might modify a pointer
and also modify a status container telling which thread performed the
las state change of that pointer. These are often known of as "cursors".
> >>
> >> Which didn't scale to high processor/thread counts, which begat the ARMv8
> >> Large System Extensions including instructions such as a compare-and-swap,
> >> and the various load-and-op instructions (e.g. LDADD, LDSET, LDCLR, LDXOR et alia)
> >> which could be forwarded to an external (to the CPU) agent such as a cache
> >> or PCIe Root Complex to encode as a TLP sent to an PCIe endpoint for handling.
> ><
> >In other words, they gave it a try and when it failed, they punted.
<
> No, they started with the old ARMv7 mechanism and realized that it
> didn't scale to large processor counts in ARMv8 around which high-core
> count systems were being designed.
>
> I don't see your method as better than delegating to an agent,
> and I suspect it would perform just as poorly under contention as
> load/store exclusive or LL/SC.
<
It has a feature (which we have not discussed) that SW can use pro
actively to reduce or eliminate interference.
<
Say we have an event (such as timer goes off) and the servicing of
said timer causes <say> 100 threads to contend for core processing
services. Now, let us say we have 64 cores and 100 contenders. So,
64 cores attempt to take a unit of work off the work queues (with 100
entries on the queues). 1 of these 64 (at most) makes forward progress
63 of them move a large number of cache lines and cache tag state
around only to determine 63 of them did not make forward progress.
These 63 make another attempt, 1 makes forward progress, 62 do not
etc.
<
This is why multiple cores taking work of a queue is at least quadratic in
"stuff" goin on in the system (and often cubic when certain boundary
conditions are met.)
<
Now, imaging a system in which those 64 originating cores attempt the
same work removal from the queues. In pass 1, 1 core makes forward
progress and 63 do not. So, when performing pass 2, each core bundles
its ATOMIC event and sends it to an arbitor. The arbitor chooses who makes
forward progress, AND sends back a unique 'order' to the failing cores.
The cores see that they failed, read their order, and then access queue[order]
instead of queue[head] making all 62 cores make forward progress simultaneously.
This has taken an at least quadratic problem (BigO( n**2 )) and converted it into
a BigO( 3 ) problem. Converting a near worst case problem to better than linear.
<
If you slug through the math you find, when inserting and removing work from
the queues rather simultaneously, that the resulting throughput is about
BigO( ln2( contenders) ), far better than BigO( contenders**2 ).
<
And that is why it scales.

MitchAlsup

unread,
Jan 5, 2023, 8:50:24 PMJan 5
to
On Thursday, January 5, 2023 at 6:17:23 PM UTC-6, Scott Lurndal wrote:
> "robf...@gmail.com" <robf...@gmail.com> writes:

> Traditionally, it's just done as another opcode on the bus/mesh/switch/fabric
> between system components. Various flavors of loads and stores mainly.
<
Except we want the case where there are multiple addresses and multiple units of
integer arithmetic on several of those memory addresses. So, it is not "just" an
OpCode.
>
> It doesn't really make sense to force software to be aware of what
> type of memory it is loading and storing to, it's much more flexible
> to simply have an atomic load instruction that sends the appropropriate
> opcode to the target. If the physical page underlying the program virtual mapping
> is in a PCI express device BAR space, the host bus/mesh/interconnect routes
<
The BAR in config space, or the MMIO space mapped by the BAR in config space ??
<
3 points:
a) I do not think config space supports anything other than uncacheable LD and ST.
b) I do not think it can be assumed that MMIO space can deal with other than
.... uncacheable LD and ST.
c) generally only DRAM is capable of performing under cache coherent command
.... set.
<
> the opcode to the root complex bridge, otherwise it is routed to the cache complex.
<
Now who is sounding complicated.....
>
> It seems counterintuitive to complicate each possible agent[*] with the logic
> necessary to execute a sequence of arbitrary programs (which sound
> much like IBM channel programs).
<
No argument, here.
>
> [*] cache agents, root complex agents, onboard coprocessors and functional block
> MMIO registers such as hardware work schedulers, etc.
<
You forgot cache controllers, memory controllers, DRAM controllers.
>
> >
> >The above would encode a Load-add-store sequence. I think only a few bits a=
> >re
> >needed for each instruction. Load seems to always be the first operation so
> >that may not need to be specified. I think there are only three source oper=
> >ands
> >and one destination. I am using 96-bit operand values so encoding four in a
> >cache line with ops should be possible.
> >
> >The coherence point has its own mini-cpu with instruction set.
> >
<
Probably better to say an integer-data-path than a mini-CPU.
It certainly is not taking interrupts, allowed to deal with exceptions, no
FP and likely not even LD, ST, or even branches.

Anton Ertl

unread,
Jan 6, 2023, 6:56:07 AMJan 6
to
sc...@slp53.sl.home (Scott Lurndal) writes:
>No, they started with the old ARMv7 mechanism and realized that it
>didn't scale to large processor counts in ARMv8 around which high-core
>count systems were being designed.

Why did it not scale? The only thing that comes to my mind is that
it's a contended memory location that several cores want to change at
the same time, much of the time). Yes, in that case I can see that
doing it at a central place and only sending the results to the
individual cores can result in much higher throughput (say one change
every few ns) than sending the cache lines involved to one of the
cores (20ns-80ns on current multi-core CPUs), and letting it make the
change there, and then sending it onwards to the next core; and that
assumes that the other cores asking for the cache lines do not disturb
the chosen one.

But can such high contention not be avoided by redesigning the
software? Admittedly, solving a problem in hardware is often less
effort, but it's not clear to me why that should be the case here.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

Anton Ertl

unread,
Jan 6, 2023, 7:03:19 AMJan 6
to
MitchAlsup <Mitch...@aol.com> writes:
>Say we have an event (such as timer goes off) and the servicing of
>said timer causes <say> 100 threads to contend for core processing
>services. Now, let us say we have 64 cores and 100 contenders. So,
>64 cores attempt to take a unit of work off the work queues (with 100
>entries on the queues). 1 of these 64 (at most) makes forward progress
>63 of them move a large number of cache lines and cache tag state
>around only to determine 63 of them did not make forward progress.
>These 63 make another attempt, 1 makes forward progress, 62 do not
>etc.

How about having per-core queues? The thread receiving the event puts
the 100 threads (or the first 64) into the per-core queues, or maybe
it does half of that work itself and delegates half of it to the first
core it wakes up.

Terje Mathisen

unread,
Jan 6, 2023, 9:50:09 AMJan 6
to
Anton Ertl wrote:
> MitchAlsup <Mitch...@aol.com> writes:
>> Say we have an event (such as timer goes off) and the servicing of
>> said timer causes <say> 100 threads to contend for core processing
>> services. Now, let us say we have 64 cores and 100 contenders. So,
>> 64 cores attempt to take a unit of work off the work queues (with 100
>> entries on the queues). 1 of these 64 (at most) makes forward progress
>> 63 of them move a large number of cache lines and cache tag state
>> around only to determine 63 of them did not make forward progress.
>> These 63 make another attempt, 1 makes forward progress, 62 do not
>> etc.
>
> How about having per-core queues? The thread receiving the event puts
> the 100 threads (or the first 64) into the per-core queues, or maybe
> it does half of that work itself and delegates half of it to the first
> core it wakes up.

That was how I designed my multi-threaded ntpd deamon: One core had to
be the designated receiver of network interrupts since the required
linux kernel version did not allow multiple cores to attach to the same
IRQ signal.

This core did as little as it was possible get away with, i.e. inspect
2-3 bytes in the packet header to verify that this was a standard client
time request, with no extra options like a peer-to-peer server, or
packet authentication: At least 99.9% of all packets passed this ~10
cycles (+plus typically one cache miss) gate, at which point it was
handed off to the N-1 worker cores, each of them having a single
producer/single consumer queue using XADD for all updates.

robf...@gmail.com

unread,
Jan 6, 2023, 10:10:35 AMJan 6
to

>Probably better to say an integer-data-path than a mini-CPU.
>It certainly is not taking interrupts, allowed to deal with exceptions, no
>FP and likely not even LD, ST, or even branches.

I think I am going to try implementing a coherence point processor, CPP,
to see how well it works. I am sketching out a CPP with 16 opcodes, 13-bit
instructions. It will be able to load / store and branch according to counted
loop. Treating the instruction buffer as a shift register, so no program
counter even. I am reminded of the additional logic that does raster ops in
EGA video or something like the COPPER in the AMIGA. The CPP will have
a set of “programs” allowing it to do compare-and-swap and
exchange-and-add and others. My main core will pass arguments and a
program number to execute to the CPP.

EricP

unread,
Jan 6, 2023, 10:30:32 AMJan 6
to
Scott Lurndal wrote:
> MitchAlsup <Mitch...@aol.com> writes:
>> On Thursday, January 5, 2023 at 4:31:55 PM UTC-6, Scott Lurndal wrote:
>> <
>>> Basically the same functionality as ARM64 Load Exclusive/Store Exclusive
>>> without the conditional branch.
>
>>> Which didn't scale to high processor/thread counts, which begat the ARMv8
>>> Large System Extensions including instructions such as a compare-and-swap,
>>> and the various load-and-op instructions (e.g. LDADD, LDSET, LDCLR, LDXOR et alia)
>>> which could be forwarded to an external (to the CPU) agent such as a cache
>>> or PCIe Root Complex to encode as a TLP sent to an PCIe endpoint for handling.
>> <
>> In other words, they gave it a try and when it failed, they punted.
>
> No, they started with the old ARMv7 mechanism and realized that it
> didn't scale to large processor counts in ARMv8 around which high-core
> count systems were being designed.

Is there a write-up of ARMv7 and/or ARMv8 LL/SC scaling issues someplace?
Are they any different from Alpha LL/SC scaling?
(I'm wondering if this is an ARM thing or LL/SC thing.)

Scott Lurndal

unread,
Jan 6, 2023, 10:53:05 AMJan 6
to
MitchAlsup <Mitch...@aol.com> writes:
>On Thursday, January 5, 2023 at 6:09:09 PM UTC-6, Scott Lurndal wrote:

>> >Except that up to 8 unique cache lines may participate in an ATOMIC event.
>> What use case would require atomicity across 8 cache lines?
><
>My routine that consumes 5 cache lines is the biggest ATOMIC primitive
>I have "run into", although the cores I am working in and on have 8-entry
>miss buffers and as long as I can use 5 of them why not be able to use
>them all. So, 8 is more of a maximum limit that we expose to software.

You haven't elaborated on your use case. What ATOMIC primitive will require
access to 8 cache lines, or even 8 different memory addresses?

>>
>Also note: the ATOMIC event can manipulate (i.e., modify) several
>containers in each cache line. For example, one might modify a pointer
>and also modify a status container telling which thread performed the
>las state change of that pointer. These are often known of as "cursors".

That doesn't seem to be a common enough paradigm to require specific
hardware support.


>> I don't see your method as better than delegating to an agent,
>> and I suspect it would perform just as poorly under contention as
>> load/store exclusive or LL/SC.
><
>It has a feature (which we have not discussed) that SW can use pro
>actively to reduce or eliminate interference.
><
>Say we have an event (such as timer goes off) and the servicing of
>said timer causes <say> 100 threads to contend for core processing
>services. Now, let us say we have 64 cores and 100 contenders. So,
>64 cores attempt to take a unit of work off the work queues (with 100
>entries on the queues). 1 of these 64 (at most) makes forward progress
>63 of them move a large number of cache lines and cache tag state
>around only to determine 63 of them did not make forward progress.
>These 63 make another attempt, 1 makes forward progress, 62 do not
>etc.

Don't do that. Use more queues instead. Or add a hardware agent
to manage/schedule queues outside of the cores (useful when the work items in
the queues are passed between different hardware blocks like crypto,
compression, classification, ingress and egress blocks).

Scott Lurndal

unread,
Jan 6, 2023, 11:07:48 AMJan 6
to
MitchAlsup <Mitch...@aol.com> writes:
>On Thursday, January 5, 2023 at 6:17:23 PM UTC-6, Scott Lurndal wrote:
>> "robf...@gmail.com" <robf...@gmail.com> writes:
>
>> Traditionally, it's just done as another opcode on the bus/mesh/switch/fabric
>> between system components. Various flavors of loads and stores mainly.
><
>Except we want the case where there are multiple addresses and multiple units of
>integer arithmetic on several of those memory addresses. So, it is not "just" an
>OpCode.

You want it, I can see that. Who is "we" - is there a real-world
use case where this justifies the added cost of the hardware (design, rtl, verif)?

>>
>> It doesn't really make sense to force software to be aware of what
>> type of memory it is loading and storing to, it's much more flexible
>> to simply have an atomic load instruction that sends the appropropriate
>> opcode to the target. If the physical page underlying the program virtual mapping
>> is in a PCI express device BAR space, the host bus/mesh/interconnect routes
><
>The BAR in config space, or the MMIO space mapped by the BAR in config space ??

BAR space in this context describes the address range mapped by the BAR register(s)
in the configuration space. Which could be MMIO space or real external memory.

><
>3 points:
>a) I do not think config space supports anything other than uncacheable LD and ST.

That's specific to the processor implementation. PCI Configuration space supports
the standard PCI Config TLPs for access (albeit perhaps not atomics). But that doesn't matter
since the PCI configuration space, with few exceptions, is only accessed at boot
time (and during hot-plug events) to configure the device for operation.
(An exception is that PCI config space registers are used for function-level reset).

>b) I do not think it can be assumed that MMIO space can deal with other than
>.... uncacheable LD and ST.

The PCI Express specification requires support for a whole host of access
types: loads, stores and atomics.

Our onboard devices also support operations like atomic 128-bit accesses
(synthesized from a LOAD PAIR or STORE PAIR instruction).


>c) generally only DRAM is capable of performing under cache coherent command
>.... set.

Generally processors have a mechanism (MTRR, MAIR) for software to instruct hardware that
certain ranges of memory addresses are uncachable; MMIO regions where accesses
have side effects are generally mapped uncacheable. MMIO regions where accesses
do not have side effects (e.g. DRAM) can be mapped cacheable.

><
>> the opcode to the root complex bridge, otherwise it is routed to the cache complex.
><
>Now who is sounding complicated.....

Every modern processor from intel and amd to arm have this capabilily. If your
processor is intended to be competitive with the those processors, you'll need
that complexity as well.

How does your processor route MMIO accesses to a PCIe device like a SATA
controller or a network interface card? Fixed physical addresses are
discouraged by all the major OS vendors.

>>
>> It seems counterintuitive to complicate each possible agent[*] with the logic
>> necessary to execute a sequence of arbitrary programs (which sound
>> much like IBM channel programs).
><
>No argument, here.
>>
>> [*] cache agents, root complex agents, onboard coprocessors and functional block
>> MMIO registers such as hardware work schedulers, etc.
><
>You forgot cache controllers, memory controllers, DRAM controllers.

Actually, I didn't. I said "cache agents". In many processors cache agents
are the direct interface to the CPU and accesses to the memory controller
happen from the cache agent (depending on the memory type, the cache agent
may bypass the cache and interface directly to memory).

Scott Lurndal

unread,
Jan 6, 2023, 11:17:35 AMJan 6
to
an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>sc...@slp53.sl.home (Scott Lurndal) writes:
>>No, they started with the old ARMv7 mechanism and realized that it
>>didn't scale to large processor counts in ARMv8 around which high-core
>>count systems were being designed.
>
>Why did it not scale? The only thing that comes to my mind is that
>it's a contended memory location that several cores want to change at
>the same time, much of the time).

A big problem was fairness, i.e. avoiding starvation, when a large
number of cores were accessing the same object.

> Yes, in that case I can see that
>doing it at a central place and only sending the results to the
>individual cores can result in much higher throughput (say one change
>every few ns) than sending the cache lines involved to one of the
>cores (20ns-80ns on current multi-core CPUs), and letting it make the
>change there, and then sending it onwards to the next core; and that
>assumes that the other cores asking for the cache lines do not disturb
>the chosen one.

Indeed.

>
>But can such high contention not be avoided by redesigning the
>software? Admittedly, solving a problem in hardware is often less
>effort, but it's not clear to me why that should be the case here.

I spent a few weeks at LLNL a bit over a decade ago. One of their
acceptance tests include spinning on a single spinlock from all
cores (256 in our system) at the same time. The system we were testing was built
from AMD instanbul cores with a custom ASIC on hypertransport to
extend the processor coherency domain to other nodes over infiniband
(or 10G ethernet, but ethernet switches had double the latency of
IB switches). In Intel/AMD systems, when cache line contention
causes an internal cache timer to fire because the cache couldn't
acquire the line, the core will revert to the system-wide
bus lock. Which absolutely killed performanace. Removing the
cache contention by handling atomics at the cache rather than
acquiring the cache line is a partial solution for this type of
scaling problem (Customers aren't generally willing to
rewrite applications if they don't have to....)

Since PCI-Express also supports atomic operations, having
the ability to transfer the operation to the endpoint makes
sense.

EricP

unread,
Jan 6, 2023, 11:24:11 AMJan 6
to
How is this CPP different from a single serializing lock?


EricP

unread,
Jan 6, 2023, 11:24:12 AMJan 6
to
MitchAlsup wrote:
> On Thursday, January 5, 2023 at 6:09:09 PM UTC-6, Scott Lurndal wrote:
>>
>> I don't see your method as better than delegating to an agent,
>> and I suspect it would perform just as poorly under contention as
>> load/store exclusive or LL/SC.
> <
> It has a feature (which we have not discussed) that SW can use pro
> actively to reduce or eliminate interference.
> <
> Say we have an event (such as timer goes off) and the servicing of
> said timer causes <say> 100 threads to contend for core processing
> services. Now, let us say we have 64 cores and 100 contenders. So,
> 64 cores attempt to take a unit of work off the work queues (with 100
> entries on the queues). 1 of these 64 (at most) makes forward progress
> 63 of them move a large number of cache lines and cache tag state
> around only to determine 63 of them did not make forward progress.
> These 63 make another attempt, 1 makes forward progress, 62 do not
> etc.
> <
> This is why multiple cores taking work of a queue is at least quadratic in
> "stuff" goin on in the system (and often cubic when certain boundary
> conditions are met.)

Yes, the cost of try-fail-retry approach is the sum of
series 63 + 62 + 61... = O(N^2)

There is also the "thundering herd" problem, when the 62 waiters
all wake up at once and fight to see who is the next one winner.
If the herd all hammer the same single memory location to decide
the next winner, say using an atomic-compare-exchange, and the
underlying coherence protocol is *also* using a try-fail-retry
then it can wind up with O(N^4) coherence message cost.

This is what happens with TTS Test and Test and Set spinlocks.

For high contention it is also important to manage who retries and when.

> <
> Now, imaging a system in which those 64 originating cores attempt the
> same work removal from the queues. In pass 1, 1 core makes forward
> progress and 63 do not. So, when performing pass 2, each core bundles
> its ATOMIC event and sends it to an arbitor. The arbitor chooses who makes
> forward progress, AND sends back a unique 'order' to the failing cores.
> The cores see that they failed, read their order, and then access queue[order]
> instead of queue[head] making all 62 cores make forward progress simultaneously.
> This has taken an at least quadratic problem (BigO( n**2 )) and converted it into
> a BigO( 3 ) problem. Converting a near worst case problem to better than linear.
> <
> If you slug through the math you find, when inserting and removing work from
> the queues rather simultaneously, that the resulting throughput is about
> BigO( ln2( contenders) ), far better than BigO( contenders**2 ).
> <
> And that is why it scales.

In situations where there is inherent high contention for
a single resource, such as multiple producers and consumers
all hammering the same linked list queue head,
transactions aren't going to make any difference and
may make it worse by having more overhead.

The simplest and most efficient solution here is to use an
MCS/CLH spin-lock queue that we have previously discussed here,
as that only requires an atomic exchange on one location
to build a queue of separate cache lines to spin-lock on.

Transactions look most useful when there are multiple low probability
potential points of contention, such as lock-free balancing an
AVL tree for multiple concurrent inserters, deleters, and readers.



Scott Lurndal

unread,
Jan 6, 2023, 12:20:42 PMJan 6
to
I'm not sure; most of what I have is under NDA.

>Are they any different from Alpha LL/SC scaling?
>(I'm wondering if this is an ARM thing or LL/SC thing.)

I believe many of the same faults apply to both since the
cache line can be moved to another core between the
load and store.

robf...@gmail.com

unread,
Jan 6, 2023, 9:54:47 PMJan 6
to
It could be used to implement a single serializing lock so it is not different. I had
thought being able to issue more memory operations under the same lock
would be useful. However, once a flag is set to lock an area ordinary instructions
could be used.
I have set the CPP aside for now as being overly complex. Unless a really
complicated AMO is desired, it is easier just to implement indivisible RMW
cycles at the coherence point. Implementing the RMW cycles bumped up the
size of my memory controller a lot, about 25%. 21,000 to 26,000 LUTs. A CPP
would be worse still. It will be a challenge getting the whole project to fit.
From the core's perspective how things are implemented at the coherence point
should be irrelevant. It just sends out an opcode to that point. I think I have got
the idea now. And using the Wishbone bus has some to be desired as it does not
support CMPXCHG directly. I must build onto it.

EricP

unread,
Jan 7, 2023, 2:25:41 PMJan 7
to
There are two uses for Read-Modify-Write RMW memory operations:
as non-interruptible sequences for small communications between priority
interrupt handlers and lower or non-interrupt levels within a single core,
and atomic operation between cores in an SMP system.

Non-interruptible operations are needed so that an OS does not
have to be continually disabling and enabling interrupts as that
can be very expensive.

Non-interruptible operations are locally coherent because a
single core is always sees its own memory operations in order
and so do not require any memory barriers.

The set non-interruptible instructions is the same as atomic ones,
NFADD Non-interruptible fetch then ADD
NFAND Non-interruptible fetch then AND
NFOR Non-interruptible fetch then OR
NFXOR Non-interruptible fetch then XOR
NSWAP Non-interruptible SWAP
NCAS Non-interruptible Compare-And-Swap

There is also a use for some double wide swap operations
NSWAPW Non-interruptible SWAP Wide
NCASW Non-interruptible Compare-And-Swap Wide

Atomic operations have the additional requirement for various
memory barriers to ensure the global coherence rules are followed.
The barriers are applied locally first, such as flushing local caches,
then at the cache coherence level, such synchronizing with the
inbound and outbound cache comms message queues,
before the memory RMW changes can be applied.

Those barriers can be explicit or implicit.
If those barriers are explicit then to get atomic operations
they would be combined with the non-interruptible instructions:
MEMBAR_RW
NFADD
MEMBAR_RW

If those barriers are implied in the atomic instruction then there
is a second set of RMW instruction which combine the non-interruptible
operations with the implied barriers:
AFADD, AFAND, AFOR, AFXOR, ASWAP, ACAS, ASWAPW, and ACASW.

The advantage to having the barriers explicit is that a
programmer may know more about their use context and may
be able to use a weaker barrier with less performance implications.
Also multiple instructions can occur between two barriers.

With the implied barriers in atomic instructions they must
assume the worst case an use the strongest barrier possible
applied both before and after each atomic instruction.

My inclination would be to do all of the above, and implement
both the non-interruptible and atomic RMW's in the cache controller.
Sorting out the barriers and their effects on various things
like load-store-queue, store buffers, synchronizing with
pending inbound and outbound coherence messages,
looks to me like a far bigger job than the RMW operations.



EricP

unread,
Jan 7, 2023, 4:40:56 PMJan 7
to
robf...@gmail.com wrote:
> I have set the CPP aside for now as being overly complex. Unless a really
> complicated AMO is desired, it is easier just to implement indivisible RMW
> cycles at the coherence point. Implementing the RMW cycles bumped up the
> size of my memory controller a lot, about 25%. 21,000 to 26,000 LUTs. A CPP
> would be worse still. It will be a challenge getting the whole project to fit.
> From the core's perspective how things are implemented at the coherence point
> should be irrelevant. It just sends out an opcode to that point. I think I have got
> the idea now. And using the Wishbone bus has some to be desired as it does not
> support CMPXCHG directly. I must build onto it.

If implementation cost is a consideration then LL/SC might be
the way to go. It doesn't require putting an ALU and
sequencing logic into the cache controller, it just requires
pinning the cache line for a period of time while the lock is held.
It requires releasing the lock if an interrupt occurs,
or a timer-counter times out, or when a SC is successful.
LL/SC also gives you all the non-interruptible operations for free.

The performance cost is the line stays pinned slightly longer than
it would with the explicit atomic RMW's, and that may require
shutting off the cache coherence comms queues,
and maybe that affects system wide performance.

The memory barrier functions look the same either way.

EricP

unread,
Jan 8, 2023, 12:03:29 PMJan 8
to
This is a well know problem for Test and Test and Set (TTS) spinlocks
under high contention - high rates of cache line ping-pong,
many nodes trying to do a read_exclusive at the same instant
causing timeouts and retries at the coherence protocol level,
floods of messages on the coherence network causing transmission backlogs.

> Removing the
> cache contention by handling atomics at the cache rather than
> acquiring the cache line is a partial solution for this type of
> scaling problem

Moving the atomic operations from the core to the cache controller
should eliminate the window during which a cache line is pinned
and that may help system performance, but does not eliminate the
ping-pongs or lessen the number of messages.

> (Customers aren't generally willing to
> rewrite applications if they don't have to....)

TTS spinlocks are used because they are cheap - 1 DWORD per lock.
The problem is that the TTS spinlock method is too simple and
it should only be used when contention is known to be unlikely.

The solution is a spin-queue which builds a linked list of cache lines
with each waiter spinning on a separate cache line, so no ping-pongs,
no thundering herds fighting for exclusive access to the same DWORD.
Most major OS and many user mode multithreading apps now use such
spin-queues wherever there is high contention for a data structure.
The cost is one cache line per lock, plus one line per contender,
and each lock has a pointer to the list head.
Enqueing a lock request is just an atomic exchange on the list head
which has none to very light contention.

In terms of code changes, its a few hours work replacing SpinlockXxx
functions with SpinqueueXxx functions.


Scott Lurndal

unread,
Jan 8, 2023, 12:14:11 PMJan 8