Thank you, Krste, and others, for your helpful comments.
To continue with this topic, I think it could be helpful to describe what I have now built, consider whether and why it is not a conforming implementation according to the spec and the additional guidance of your last message, and ponder whether or not it should be conforming.
In general, software uses LR/SC to atomically change memory provided some condition holds. It uses LR to load the data that determines whether the condition holds and/or contributes to the new value, and SC to atomically store new data if no store by any agent may have disturbed that condition. The specification must define the minimum functionality for LR/SC and its environment to enable portable atomic software libraries. Thus the specification language constrains portable LR/SC code sequences and guarantees they can eventually succeed in any conforming implementation. I submit it is still an open question what behaviors the specification should guarantee when an implementation runs an LR/SC sequence that is *not* so constrained.
# Background/rationale on LR/SC for GRVI Phalanx.
Recall in a base GRVI Phalanx configuration, each cluster has 8 PEs (i.e. 8 harts) sharing a 4-way interleaved (on addr[3:2]) 32 KB cluster RAM (CRAM) which is also shared with a 32 B/cycle NOC interface and (optionally) an 8-way interleaved accelerator. No caches. No non-local memory accesses -- for now, data between clusters are sent as explicit messages on the NOC.
The multiprocessor needs LR/SC to build locks/crit secs and to perform lock free atomic actions such as atomic add to memory.
(I am not planning to implement AMOADD etc. at this time. LR/SC + software suffices and here in anticipated use implementing the other atomics in hardware is not a good throughput-per-area tradeoff.)
Consider a parallel histogram use-case. Each PE may use LR/SC to atomically increment some counter:
atomic_add: ; a0=&counter, a1=increment
lr.w a2,(a0)
add a2,a2,a1
sc.w a2,a2,(a0)
bgtz a2,atomic_add
Many PEs may simultaneously attempt to atomic_add to the same counter. To avoid livelock, some PE might temporarily "own" the reserved memory region (the "LR-domain".) This is the approach envisioned in the draft spec's rationale: upon LR the PE acquires its cache line in (I infer) exclusive or exclusive-modified states E/M and keeps it there for a while -- "holding off remote cache interventions"!
Alternately, multiple PEs might be permitted to "share" a non-exclusive load-reservation, a.k.a. write-monitor, on the same LR-domain. This is a reasonable approach for a non-cached banked memory system and avoids complicating the bank memory arbiters with transitory exclusive ownership considerations.
** Principle: To avoid livelock, either 1) LR gives one PE transitory exclusivity to a LR-domain, or 2) multiple PEs must be able to simultaneously reserve the same LR-domain.
Also -- perhaps this is obvious -- each PE must have its own individual reservation on an LR-domain. It is not correct to coalesce / share a reservation between PEs. Otherwise:
PE[0]: lr.w X ; PE[0] reserves X
PE[1]: sw X ; kills PE[0]'s reservation
PE[2]: lr.w X ; PE[2] reserves X
PE[0]: sc.w X ; must fail -- thus PE[0]'s reservation != PE[2]'s reservation
** Principle: Each PE must have its own reservation state.
To improve concurrency / reduce SC failure, each LR-domain should be as small as possible, so that concurrent non-conflicting LR-SCs succeed. To reduce the cost of implementation, each LR-domain should be as large as possible, to reduce the area/energy required to remember and compare their addresses. Actual use will reveal the right tradeoff. When and if per-bank LR-domains are demonstrated to be too coarse grained, I will revise the (FPGA!) implementation to introduce finer grained LR-domains.
# A Simple LR/SC Implementation for a GRVI cluster
The LR-domain is a memory bank (here, one of four address-interleaved banks, e.g. RAM with the same address bits A[3:2]).
Alternative 1: LR reserves an exclusive reservation for a PE for a bank. Other PEs' accesses to the bank are held off for a period of time or until the first PE's SC commits.
Rejected because: as noted, this complicates bank arbitration which is a critical path. (However, this approach may achieve dynamically fewer atomic-try-fail-retry iterations and so there may be an energy efficiency rationale for it. We'll see.)
Alternative 2: An LR reservation is non-exclusive. Multiple PEs may hold a reservation on a bank. The reservation is lost when the bank is written.
Alternative 2a: The state of which PE has reserved (or not) which bank is kept at the respective PEs. PEs monitor all writes in the cluster (e.g. all PEs monitor the four banks' write event buses). If a write is observed to a PE's reserved bank, the reservation is lost. An SC may issue to memory if the PE's prior LR reservation still holds.
Rejected because: it doesn't work. The memory system is pipelined. At the moment an SC issues, the reservation may be held; but by the time it hits the memory bank, the reservation may be lost. Example: Two PEs simultaneously execute lr.w; add; sc.w. Both acquire a reservation; both perform the add; both still hold the reservation; both launch the SC; the SCs serialize at the bank arbiter. One goes first; the other (now invalid) comes second -- must somehow be invalidated.
Alternative 2b: The state of which PEs have reserved which banks is kept at the banks. PEs blindly issue LR and SC requests to memory. These arrive at a bank, which tracks which PEs hold reservations and are permitted to commit SCs, and which don't.
This (2b) is the current GRVI Phalanx implementation.
In particular, each PE treats LR as a load with an additional reservation request signal, and greats SC as a fused store-load which stores rs2 at 0(rs1) and "loads" the SC outcome from the memory system, as follows:
1. In each bank controller there is a resvdmask[] -- a bit vector, one bit per PE, of which PEs currently have a reservation on this bank.
2. When PE[i] performs a LR to bank[b], it sets bank[b].resvdmask[i].
3. When any PE[i] performs a store or an SC to bank[b], it zeroes bank[b].resvdmask[*]. Every PE loses it reservation on that bank.
4. When the NOC writes a message into CRAM (striped across all banks), it zeroes bank[*].resvdmask[*]. Every PE loses their reservations. (NB: at this time, NOC messages are unconditionally 32 bytes wide and write across banks in one cycle.)
5. When PE[i] performs a SC to bank[b], it tests bank[b].resvdmask[i]. If set, the store occurs and the result is 0. Otherwise, the store is killed and the result is 1.
Notice how this implementation supports atomic_add (above) sans livelock: as all PEs hit the same atomic_add, and gradually acquire load-reservations on the same bank, all make progress. One of the PEs is first to conditionally store (SC) to the bank. That PE succeeds, clearing all PEs' reservations. The remaining SCs all lack the reservation; all fail, and spin. Soon another PE will successfully LR...SC. Forward progress continues.
# Why this is not "conforming"
Krste wrote: "A LR by a hart reserves an implementation-defined subset of the memory space - *** any earlier LR reservation is cleared ***. An SC on the same hart is allowed to succeed provided no ***load***/stores from other harts to the reserved subset of the memory space can be observed to have occured between an active LR and the SC."
1. In the GRVI cluster implementation of LR/SC, a PE may have multiple LRs to different LR-domains. This fails the "any earlier LR reservation is cleared" constraint.
PE[0]: lr.w X ; first LR-domain
PE[0]: lr.w Y ; different LR-domain
; no intervening *stores* by other agents
PE[0]: sc.w X ; succeeds
PE[0]: SC Y ; succeeds
** Question: why *must* earlier LR reservations be cleared by a conforming implementation?
Of course "portable" constrained atomic sequences must not depend upon this (i.e. must only depend upon last LR) but it would be unfortunate to forbid implementations that can do more than one reservation, or implementations for which multiple lingering reservations are simpler (!) than maximum-one reservation. Unless/until a write to a reserved locations occurs, it a benign to leave a reservation in place -- even if memory is littered with dozens of such reservations, as is possible in some implementations.
An implementation that supports multiple outstanding LR reservations runs portable constrained atomic sequences just fine. It just may lure you into using unportable unconstrained sequences. (Writing concurrent data structures is advanced systems programming -- if you know enough to use LR/SC correctly you know whether you have written a constrained sequence or not.) If possible let us not spend hardware or energy enforcing that.
2. In the GRVI cluster implementation of LR/SC, a load from another PE (hart) does not disturb a reservation.
PE[0]: lr.w X ; reserves X on PE[0]
PE[1]: lr.w Y ; reserves Y (same bank as X) on PE[1], no effect on PE[0] reservation
PE[2]: lw X ; regular load, no effect on PE[0] or PE[1] reservations
PE[0]: sc.w X ; succeeds, kills PE[1] reservation
PE[1]: sc.w Y ; fails
Here the 2nd and 3rd loads occurred between the first LR and its SC. Whether you consider them to have been observed given that there was no cache line invalidation seems debatable. But there they are just the same. I hope you do not *require* an implementation to fail a subsequent sc.w because of a load somewhere else in the system.
# Non-portable LR/SC sequences with load/store?
In the GRVI cluster design, the following sequence may succeed (and might be useful):
PE[0]: lr.w X ; one LR-domain
PE[1]: lw Y ; same LR-domain
PE[2]: ...
PE[3]: sc.w X ; may succeed!
This is also possible but rather tricky:
PE[0]: lr.w X ; one LR-domain
PE[1]: sw Z ; by design/explicit placement, in a different LR-domain
PE[2]: ...
PE[3]: sc.w X ; may succeed!
# Summary
In summary, I suggest:
* implementation defined LR-region sizes? Yes!
* at most one active LR-region per hart? No thanks.
* during one hart's LR..SC sequence, another hart's load of the same address *must* cause the SC to fail? Please no!
Thank you for any comments.
By the way, what is a good broad survey of best practices for LR/SC (or prior LL/SC implementations?)
Thank you.
Jan.
--
You received this message because you are subscribed to the Google Groups "RISC-V HW Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
hw-dev+un...@groups.riscv.org.
To post to this group, send email to
hw-...@groups.riscv.org.
Visit this group at
https://groups.google.com/a/groups.riscv.org/group/hw-dev/.
To view this discussion on the web visit
https://groups.google.com/a/groups.riscv.org/d/msgid/hw-dev/22213.13066.404560.387567%40KAMacBookAir2012.local.