The hardware implementation of Atomic instruction in RISCV and performance differences compared to HTM

79 views

Skip to first unread message

jagten leo

unread,

May 24, 2023, 9:19:56 AM5/24/23

to RISC-V HW Dev

1. I'm confused with "registers a reservation set" in the sentence "LR.W loads a word from the address in rs1, places the sign-extended value in rd, and registers a reservation set—a set of bytes that subsumes the bytes in the addressed word" from page 48 Volume I: RISC-V Unprivileged ISA V20191213.

I want to know if the reservation set is registered in the cache during the load request from load reaserve instruction, Because the CPU cannot be sure if it will store conditional successfully when executing store conditional, only the L1 Cache first knows if an ST has occurred after LR, and the CPU's store conditional cannot be retained in the store buffer and Should be sent to the cache. If store conditional carries rl attribute, does it mean that the entries in front of store buffer need to be sent to the cache in order before the entry of store conditional can be sent to the cache? And the cache needs special handling when receiving the store conditional request. If any probe of a store matching the address is received during lr/sc, it should fail, and give a feedback signal to the CPU to set rd register of store conditional instruction non-zero, flush the instruction beginning with lr.

Is there something wrong with my understanding?

2. I'm also confused with "set of bytes that subsumes the bytes in the addressed word" in the sentence above!

From what I just understood, if the reservation set is on the cache side, then the granularity of the cache detection conflict is a cacheline and whether it is necessary to record as byte granularity?

3. I'm confused with "livelock" in the sentence "The main disadvantage of LR/SC over CAS is livelock, which we avoid, under certain circumstances, with an architected guarantee of eventual forward progress as described below.

As I know, The interconnect will sequence the store to the same cacheline, And always the core that executes sc first succeds, and the other core that executes SC later fails! So I can't imagine any scenario that could lead to a live lock. I saw a post saying that load will also cause sc failure, but this is not consistent with the stipulations in spec "An SC may succeed only if no store from another hart to

the reservation set can be observed to have occurred between the LR and the SC, and if there is no other SC between the LR and itself in program order."

4. Can someone tell me how real riscv hardware implements stomicadd isntruction from a cpu and interconnection collaboration perspective.

All I can think of is the implementation of macro instructions, such as when cpu load the value then lock the bus, and release the bus only when wirte has completed to insure the atomic semantics of read-modify-write! Or As described in the spec, implement AMOs at memory controllers.

5. Does anyone know about the possible performance differences between HTM and atomic instructions，just As spec said that "More generally, a multi-word atomic primitive is desirable, but there is still considerable debate about what form this should take, and guaranteeing forward progress adds complexity to a system. Our current thoughts are to include a small limited-capacity transactional memory buffer along the lines of the original transactional memory proposals as an optional standard extension “T”"

Any help will be appreciated!

Valentin Nechayev

unread,

Jun 14, 2023, 5:36:05 AM6/14/23

to jagten leo, RISC-V HW Dev

hi,

> 1. I'm confused with "*registers a reservation set*" in the sentence "LR.W

> loads a word from the address in rs1, places the sign-extended value in rd,
> and registers a reservation set—a set of bytes that subsumes the bytes in
> the addressed word" from page 48 Volume I: RISC-V Unprivileged ISA
> V20191213.
> I want to know if the reservation set is registered in the cache during the
> load request from load reaserve instruction,

Yes (provided that the memory region is cacheable).

> Because the CPU cannot be
> sure if it will store conditional successfully when executing store
> conditional, only the L1 Cache first knows if an ST has occurred after LR,

SC?

> and the CPU's store conditional cannot be retained in the store buffer and
> Should be sent to the cache. If store conditional carries rl attribute,
> does it mean that the entries in front of store buffer need to be sent to
> the cache in order before the entry of store conditional can be sent to
> the cache?

Letʼs speak not about cache, but about a generic "memory interface"
which interacts with CPU using tagged asynchronous message exchange.
With it, "getmem, address=64, len=4, tag=1234" and "getmem,
address=128, len=4, tag=5678" may meet response with tag=5678 much
earlier than with tag=1234 because 128 was in L1 cache but 64 had to
load from slow RAM module.
This is inevitable because of cache nature.
In this case, LR is translated to "getmem" with "monitor=true" and
SC is translated to "putmem" with "if_monitored=true". This is
this memory interface liability to deal with cache mode (a memory
range can be, BTW, uncacheable - in this case, LR reservation is
implemented as a memory snooper).
Release (rl) attribute, well, postpones sending of respective "putmem"
until all previous unanswered getmem/putmem are answered.

> And the cache needs special handling when receiving the store
> conditional request.

Yep. For example, LR can be implemented as load which requires getting
Exclusive state (I follow MESI protocol in general) and putting
special mark on the line. Any other hart write results in losing this
mark. SC checks this mark and rejects write if it is already lost.

Real world includes inter-level negotiation, more protocol states,
etc., but the principle still may be used. How this is projected to
the real implementation - is up to its designer. If L1 cache is common
for multiple harts, one shall add which hart requested this LR into
cache line state. And so on.

> If any probe of a store matching the address is
> received during lr/sc, it should fail, and give a feedback signal to the
> CPU to set rd register of store conditional instruction non-zero, flush the
> instruction beginning with lr.
> Is there something wrong with my understanding?

A separate feedback signal is not needed because (in traditional cache
style) cache level already interacts with RAM and other caches and
knows using MESI messages or something more advanced that the
reservation has been lost.
But the part of CPU (hart) which proceeds with SC just requires
interaction like:
req: putmem, addr=1000, len=4, is_monitored=1, tag=9012
resp: error=1, tag=9012

> 2. I'm also confused with "*set of bytes that subsumes the bytes in the
> addressed word*" in the sentence above!

> From what I just understood, if the reservation set is on the cache side,
> then the granularity of the cache detection conflict is a cacheline

With the standard approach, yes. RL/SC drops "RL clean" state on any
write to the monitored region, not only one that changes something. If
your platform extends MESI protocol inter-cache messages with modified
bytes range inside a cache line... well, this is possible, but more costly.

At least, most standard implementations of such a synchronization at
software level tend to allocate a whole cache line for mutex/etc.

> and whether it is necessary to record as byte granularity?

Most software implementations doesnʼt expect it. But, if you want to
alleviate performance of not so quality ones, consider extending
inter-cache protocol.
You may compare x86 which allows atomic operations even over cache
line boundary (with cost of total bus lock) with most RISCs which
reject non-aligned atomic access even if it is fully in a single cache
line. Which one is better and why?

> 3. I'm confused with "*livelock*" in the sentence "The main disadvantage of

> LR/SC over CAS is livelock, which we avoid, under certain circumstances,
> with an architected guarantee of eventual forward progress as described
> below.
> As I know, The interconnect will sequence the store to the same cacheline,

Not necessary. Imagine NUMA style with different domains when an
inter-domain bridge exposes itself as a yet another cache to other
peers in the domain that owns a memory region.
In this case you canʼt avoid preference to the closer caches, and if
they highly contend on a cache line, it could be impossible to a user
from another domain to perform its actions at least at the same speed:
a bridge of this style requires additional clocks to relay requests
and responses.

So one of the simplest approaches to provide guarantees against
livelocks is that a hart issued RL locksi the cache line for a time
period (or, as in spec, for 64 bytes of commands) and delays all
requests to seize the line. Of course, this lock shall be limited
against software errors.

> And always the core that executes sc first succeds, and the other core that
> executes SC later fails! So I can't imagine any scenario that could lead to
> a live lock.

NUMA (see above) and not only it - who knows what next approach will
appear?

> 5. Does anyone know about the possible performance differences *between HTM
> and atomic instructions*，just As spec said that "More generally, a

> multi-word atomic primitive is desirable, but there is still considerable
> debate about what form this should take, and guaranteeing forward progress
> adds complexity to a system. Our current thoughts are to include a small
> limited-capacity transactional memory buffer along the lines of the
> original transactional memory proposals as an optional standard extension
> “T”"

The simplest approach for transactional memory support is to lock all
affected cachelines by the hart performing the transaction, until
commit or rollback (explicit, by timeout or other event), and
implement an old version store for modified lines. Definitely, this
costs something, but I guess the main cost is when other harts with
contend on these lines, and at rollback operation. Will the optimistic
way cost be expressed in means of time or not, depends on
implementation quality.

-netch-

Reply all

Reply to author

Forward

0 new messages