>> On Thursday, January 5, 2023 at 4:31:55 PM UTC-6, Scott Lurndal wrote:=20
>> > MitchAlsup <
Mitch...@aol.com> writes:=20
>> > >On Thursday, January 5, 2023 at 2:13:01 PM UTC-6, Scott Lurndal wrote:=
>=20
>> > >> MitchAlsup <
Mitch...@aol.com> writes:=20
>> > >> >On Thursday, January 5, 2023 at 2:07:16 AM UTC-6,
robf...@gmail.com=
> wrote:=20
>> > >> >> I have been searching for information on the compare-and-swap ins=
>truction=3D=20
>> > >> >.=3D20=20
>> > >> >> Does it need to be a single instruction? It seems to me that it c=
>ould be =3D=20
>> > >> >implemented using a sequence of instructions if interrupts were dis=
>abled an=3D=20
>> > >> >d the bus was locked. A complication may be a page fault or some ot=
>her faul=3D=20
>> > >> >t during the execution of the compare-and-swap.=3D20=20
>> > >> >> I would prefer to implement CAS as an atomic sequence of instruct=
>ions usi=3D=20
>> > >> >ng an ATOM modifier. That way CAS does not need to be added to the =
>instruct=3D=20
>> > >> >ion set and existing instructions can be used. I could make the ato=
>mic sequ=3D=20
>> > >> >ence restart able since the start address of the sequence is known.=
>=3D20=20
>> > >> >>=3D20=20
>> > >> >> ATOM =3DE2=3D80=3D9CLLLLAA=3DE2=3D80=3D9D=3D20=20
>> > >> >> LOAD a0,[a3]=3D20=20
>> > >> >> CMP t0,a0,a1=3D20=20
>> > >> >> PEQ t0,=3DE2=3D80=3D9DTTF=3DE2=3D80=3D9D ; predicate the followin=
>g instructions=3D20=20
>> > >> >> STORE a2,[a3]=3D20=20
>> > >> >> LDI a0,1=3D20=20
>> > >> >> LDI a0,0=20
>> > >> ><=20
>> > >> >That is the general idea, however, for My 66000 ISA I used a bit in=
> the 2-o=3D=20
>> > >> >perand=20
>> > >> >memory references to denote "participation" in the ATOMIC event. Wh=
>en an in=3D=20
>> > >> >bound=20
>> > >> >memory reference has its lock bit set, it is said that the associat=
>ed cache=3D=20
>> > >> > line is=20
>> > >> >participating in the event. When an outbound memory reference has t=
>he lock =3D=20
>> > >> >bit=20
>> > >> >set, it is the last instruction in the event. So, the above event w=
>ould be =3D=20
>> > >> >in asm::=20
>> > >> ><=20
>> > >> > LD Ra0,[Rp].lock=20
>> > >> > CMP Rt,Ra0,Ra1=20
>> > >> > PEQ Rt,T=20
>> > >> > ST Ra2,[Rp].lock=20
>> > >> > MOV Rr,Rt,<1:eq>=20
>> > >> ><=20
>> > >> >a) you only have to predicate the store=20
>> > >> >b) you can extract True/False from the predicate condition=20
>> > >> >c) however this is subject to the ABA problem during the time betwe=
>en the=20
>> > >> >....corresponding LD Ra1 many cycles above (not illustrated)=20
>> > >> >d) if you add a check for interference the ABA problem vanishes=20
>> > >> ><=20
>> > >> > LD Ra0,[Rp].lock=20
>> > >> > CMP Rt,Ra0,Ra1=20
>> > >> > PEQ Rt,TT=20
>> > >> > PINF --,F=20
>> > >> > ST Ra2,[Rp].lock=20
>> > >> > MOV Rr,Rt,<1:eq>=20
>> > >> ><=20
>> > >> What prevents the operating system from rescheduling the=20
>> > >> thread between the load and the store? What happens to=20
>> > >> the lock on the cache line if that is the case?=20
>> > ><=20
>> > >The cache line is not locked, but it is monitored for interference.=20
>> > >The LD.lock instruction fetches the line for read-with-intent-to-modif=
>y=20
>> > >and setup one of the miss buffers to monitor that cache line physical=
>=20
>> > >address. At the end of LD, the line is present and writeable in the ca=
>che.=20
>> <=20
>> > Basically the same functionality as ARM64 Load Exclusive/Store Exclusiv=
>e=20
>> > without the conditional branch.=20
>> <
>> Except that up to 8 unique cache lines may participate in an ATOMIC event=
>.
>> >=20
>> > Which didn't scale to high processor/thread counts, which begat the ARM=
>v8=20
>> > Large System Extensions including instructions such as a compare-and-sw=
>ap,=20
>> > and the various load-and-op instructions (e.g. LDADD, LDSET, LDCLR, LDX=
>OR et alia)=20
>> > which could be forwarded to an external (to the CPU) agent such as a ca=
>che=20
>> > or PCIe Root Complex to encode as a TLP sent to an PCIe endpoint for ha=
>ndling.=20
>> <
>> In other words, they gave it a try and when it failed, they punted.
>> <=20
>> > ><=20
>> > >If we have not arrived at PINF and we receive a context switch=20
>> > >then IP is reset to @LD, the context switch happens, and the=20
>> > >ATOMIC event did not happen. When control returns, the event=20
>> > >is played from the LD.=20
>> > ><=20
>> > >If the participating cache line is "writeable" the scheduled context=
>=20
>> > >switch will be deferred until after the ST and the ATOMIC event=20
>> > >will have succeeded (or failed by stepping over the ST). When=20
>> > >control returns, Rr is extracted from the CMP result and SW=20
>> > >can look at the value and do something reasonable.=20
>> > ><=20
>> > >If the participating cache line is not "writeable" then interference=
>=20
>> > >will have been detected, the event fails and Rr will contain 0. SW=20
>> > >looks at Rr and decides the event failed and will go do something=20
>> > >else.=20
>> <=20
>> > Seems like a lot of work. Our SoCs forward the entire atomic=20
>> > operation to the point of coherency (e.g. cache agent) and let that age=
>nt=20
>> > handle the complete operation as well as supporting PCI-express. And=20
>> > going forward, things like CXL.memory.=20
>> <
>> It is not a lot of work (for example it is a lot less work that <say> a F=
>P DIV.)=20
>> It is more work than a LD or a CAS--but it makes it possible to quit inve=
>nting=20
>> new ATOMIC instructions each generation. That is, HW can get out of the=
>=20
>> game and let SW figure out what kinds of ATOMIC things it wants, needs,=
>=20
>> and can use.=20
>> >=20
>> Most of what goes on is that when an inbound memory reference has the loc=
>k=20
>> bit set, that the translated physical address is placed in a miss buffer =
>where=20
>> it can watch if the line is disturbed in any malicious to an ATOMIC seque=
>nce.=20
>> Since you have multiple elements in the miss buffer, you can have multipl=
>e=20
>> lines participate in the event. {This path is already present in the cach=
>e access=20
>> pipeline stage(s).}=20
>> <=20
>> The only real "extra" work is ORing all the miss buffer entries into the=
>=20
>> Interference signal and allowing the BC instruction to access that signal=
>.=20
>> <=20
>> Well pushing it out to the coherence point requires you to make a big bun=
>dle.=20
>> <=20
>> BOOLEAN MoveElement( Element *fr, Element *to )=20
>> {=20
>> esmLOCK( fn =3D fr->next );=20
>> esmLOCK( fp =3D fr->prev );=20
>> esmLOCK( tn =3D to->next );=20
>> esmLOCK( fn );=20
>> esmLOCK( fp );=20
>> esmLOCK( tn );=20
>> if( !esmINTERFERENCE() )=20
>> {=20
>> fp->next =3D fn;=20
>> fn->prev =3D fp;=20
>> to->next =3D fr;=20
>> tn->prev =3D fr;=20
>> fr->prev =3D to;=20
>> esmLOCK( fr->next =3D tn );=20
>> return TRUE;=20
>> }=20
>> return FALSE;=20
>> }=20
>> <=20
>> Everything but the returns are part of the ATOMIC event.
>
>Okay, now I have gone down the path of pushing instruction sequences to the
>coherence point rather than just individual instructions using ATOM to iden=
>tify
>the sequences. The sequences are short.
>
>ATOM a0,=E2=80=9CAAA=E2=80=9D
>LOAD a0,[a2]
>ADD t0,a0,a1
>STORE t0,[a2]
>
>=E2=80=9CATOM=E2=80=9D would build a cache-line wide sequence of data and i=
>nstructions and
>send it to the coherence point to be executed as a unit.
Traditionally, it's just done as another opcode on the bus/mesh/switch/fabric
between system components. Various flavors of loads and stores mainly.
It doesn't really make sense to force software to be aware of what
type of memory it is loading and storing to, it's much more flexible
to simply have an atomic load instruction that sends the appropropriate
opcode to the target. If the physical page underlying the program virtual mapping
is in a PCI express device BAR space, the host bus/mesh/interconnect routes
the opcode to the root complex bridge, otherwise it is routed to the cache complex.
It seems counterintuitive to complicate each possible agent[*] with the logic
necessary to execute a sequence of arbitrary programs (which sound
much like IBM channel programs).
[*] cache agents, root complex agents, onboard coprocessors and functional block
MMIO registers such as hardware work schedulers, etc.
>
>The above would encode a Load-add-store sequence. I think only a few bits a=
>re
>needed for each instruction. Load seems to always be the first operation so
>that may not need to be specified. I think there are only three source oper=