[riscv-hw] LR/SC, area efficient implementations, edge cases

1,657 views
Skip to first unread message

Jan Gray

unread,
Jan 9, 2016, 11:10:24 AM1/9/16
to hw-...@lists.riscv.org

Greetings,

 

I am planning the implementation of LR/SC for a GRVI multiprocessor cluster with banked, address interleaved memory. I have read chapter 5.2 of the RISC-V 2.0 spec. I have a few edge case questions that the spec does not explicitly forbid or permit. I would appreciate your thoughts as to what these code sequences must or must not or may do. Note, I do not think these code sequences are necessarily desirable, or are supposed to accomplish something useful, or are concurrency safe. Rather I just want to know what the unstated behaviors are permitted in an as-simple-as-possible implementation.

 

In particular I am contemplating reserving an entire bank to the core issuing the LR, i.e. not tracking the specific reserved address nor comparing the reserved address against the conditional store address.

 

So, given:

               li a0,0x100

               li a1,1

               li a2,0x104  # different bank from @0

               li a3,0x110  # same bank as @0

 

Case 1 – LR/SC addresses don’t match – can this succeed?

               lr.w t0,(a0)

               sc.w t1,a1,(a3)

               # is t1 ever zero?

(When LR places a reservation on a particular memory address, is it also permitted to place a reservation on other memory address(es)? A line? Addresses module 64K? An entire bank (i.e. 1/4 of address space))? I hope so. In general it is safe to reserve more than you need, although it will worsen potential for livelocks. I can live with that in my setting. If this impacts throughput or forward progress under contention, I can always reconfigure the cluster uncore with something more precise, or exact match.)

 

Case 2 – unbalanced SCs

               lr.w t0,(a0)

               sc.w t1,a1,(a0)

               addi a1,a1,1

               sc.w t2,a1,(a0)

               # is t2 ever zero?

 

Case 3 – multiple LRs, SCs from one core

               lr.w t0,(a0)

               lr.w t1,(a2)

               sc.w t2,a1,(a0)

               # is t2 ever zero?

               sc.w t3,a1,(a2)

               # is t3 ever zero?

 

By the way, the commentary states: “We reserve a failure code of 1 to mean “unspecified” so that simple implementations may return this value using the existing mux required for the SLT/SLTU instructions.” IMO that was the right answer.

 

Thank you,

Jan Gray

Gray Research LLC

 

Christopher Celio

unread,
Jan 9, 2016, 12:32:46 PM1/9/16
to Jan Gray, hw-...@lists.riscv.org
Very interesting questions!

I was looking at the Spike simulator, Rocket, and riscv-tests for lr/sc for guidance on the intention of the designers, and it's not entirely clear what's possible.  I did notice however that Spike demands a strict address match (and the wording of Section 6.2, to me, seems to also imply a strict word-address).  However, my intuition (and a reading of Rocket confirms) that the hardware will only bother with cache-line granularity locking.

val s2_lrsc_addr_match = lrsc_valid && lrsc_addr === (s2_req.addr >> blockOffBits)

I think it's "interesting" that this is a place where Spike is going to naturally differ from hardware implementations, unless it demands strict address matching (as The Spec has tried its best to stay agnostic of cache-line size).  I hope this can be reconciled, since I'd hate this ambiguity to lead to non-portable programs. 

Just as an example, the riscv-tests' LR/SC test makes the check: "# make sure that sc with the wrong reservation fails" by changing the address by 1024. That allows for a lot of different implementations to pass the RISC-V validation test, yet also allows the possibility of code to run fine on one cpu to hang on another since it will never succeed in getting the reservation.

-Chris

riscv-tests
Spike
Rocket

Samuel Falvo II

unread,
Jan 9, 2016, 1:53:24 PM1/9/16
to Christopher Celio, Jan Gray, hw-...@lists.riscv.org
Without suggesting we mimic our competitors, it should be noted that
PowerPC's reservation loads and stores are free to reserve more than
just a word of memory. IIRC, PowerPC 601's instructions worked on a
cache line basis, and only allowed one reservation per CPU. So,
given:

lr.w .... ; reservation #1
lr.w .... ; reservation #2

Only reservation #2 would work; reservation #1 was considered lost.
Likewise, if you performed two conditional stores, the first store to
succeed would close out the reservation, and all subsequent
conditional stores would fail, even if they addressed the same
reservation.

Disclaimer: I'm operating from memory dating back to when I read the
PowerPC 601 programmer's manual, so my knowledge may not reflect
contemporary Power8 behavior.

--
Samuel A. Falvo II

Andrew Waterman

unread,
Jan 10, 2016, 8:21:05 PM1/10/16
to Samuel Falvo II, Christopher Celio, Jan Gray, hw-dev
As written, it is implementation-defined. M-mode software, which
necessarily knows the implementation details, could conceivably take
advantage of this property.

Since the software models (spike, qemu) are pessimistic and require an
exact address match, software developed against them should be safe to
run anywhere.

It's worth noting that this is not the only implementation-defined
property of LR/SC. The allowed number and type of instructions in the
critical section also vary from system to system.

David Chisnall

unread,
Jan 11, 2016, 4:49:12 AM1/11/16
to Jan Gray, hw-...@lists.riscv.org
On 9 Jan 2016, at 16:10, Jan Gray <jsg...@acm.org> wrote:
>
> So, given:
> li a0,0x100
> li a1,1
> li a2,0x104 # different bank from @0
> li a3,0x110 # same bank as @0
>
> Case 1 ��� LR/SC addresses don���t match ��� can this succeed?
> lr.w t0,(a0)
> sc.w t1,a1,(a3)
> # is t1 ever zero?
> (When LR places a reservation on a particular memory address, is it also permitted to place a reservation on other memory address(es)? A line? Addresses module 64K? An entire bank (i.e. 1/4 of address space))? I hope so. In general it is safe to reserve more than you need, although it will worsen potential for livelocks. I can live with that in my setting. If this impacts throughput or forward progress under contention, I can always reconfigure the cluster uncore with something more precise, or exact match.)

The RISC-V spec here contains a lot of waffle and very little that qualifies as a specification, however the usual definition of sc makes it undefined behaviour for an sc that doesn���t match a corresponding ll. This allows implementations to match at the granularity that is most natural. It would be a good idea to clarify this in the next version of the RISC-V spec.

On a software-facing note, it���s very important to ensure that stores from the same core to nearby addresses (without an interrupt occurring in the middle) do not trigger invalidation. Compilers have a habit of occasionally spilling to the stack in the middle of C[++]11 atomic sequences (especially at low optimisation levels). This bit Apple���s early ARM cores, which would invalidate on *any* store to the cache line, causing some compiler-generated code to infinite loop.

David

kr...@berkeley.edu

unread,
Feb 17, 2016, 9:57:19 PM2/17/16
to Jan Gray, hw-...@lists.riscv.org

Hi Jan,

>>>>> On Sat, 9 Jan 2016 08:10:24 -0800, "Jan Gray" <jsg...@acm.org> said:
| Greetings,
| I am planning the implementation of LR/SC for a GRVI multiprocessor cluster
| with banked, address interleaved memory. I have read chapter 5.2 of the RISC-V
| 2.0 spec. I have a few edge case questions that the spec does not explicitly
| forbid or permit. I would appreciate your thoughts as to what these code
| sequences must or must not or may do. Note, I do not think these code sequences
| are necessarily desirable, or are supposed to accomplish something useful, or
| are concurrency safe. Rather I just want to know what the unstated behaviors
| are permitted in an as-simple-as-possible implementation.

We're proposing to refine the definition of LR/SC to cover these
cases.

A LR by a hart reserves an implementation-defined subset of the memory
space - any earlier LR reservation is cleared. An SC on the same hart
is allowed to succeed provided no load/stores from other harts to the
reserved subset of the memory space can be observed to have occured
between an active LR and the SC.

So, to apply this to your examples, inline below:

| In particular I am contemplating reserving an entire bank to the core issuing
| the LR, i.e. not tracking the specific reserved address nor comparing the
| reserved address against the conditional store address.

This is fine.

| So, given:
| li a0,0x100
| li a1,1
| li a2,0x104 # different bank from @0
| li a3,0x110 # same bank as @0
| Case 1 ? LR/SC addresses don?t match ? can this succeed?
| lr.w t0,(a0)
| sc.w t1,a1,(a3)
| # is t1 ever zero?

Yes, if no accesses from other harts can be observed to M[a0] and
M[a3] inbetween the lr.w and the sc.w.

| (When LR places a reservation on a particular memory address, is it also
| permitted to place a reservation on other memory address(es)? A line? Addresses
| module 64K? An entire bank (i.e. 1/4 of address space))? I hope so. In general
| it is safe to reserve more than you need, although it will worsen potential for
| livelocks. I can live with that in my setting. If this impacts throughput or
| forward progress under contention, I can always reconfigure the cluster uncore
| with something more precise, or exact match.)

Yes, as above.

| Case 2 ? unbalanced SCs
| lr.w t0,(a0)
| sc.w t1,a1,(a0)
| addi a1,a1,1
| sc.w t2,a1,(a0)
| # is t2 ever zero?

Yes, this is allowed to succeed provided no accesses from other harts
to M[a0] are observed to have occured inbetween the initial lr.w and
the final sc.w.

| Case 3 ? multiple LRs, SCs from one core
| lr.w t0,(a0)
| lr.w t1,(a2)
| sc.w t2,a1,(a0)
| # is t2 ever zero?

Yes, t2 can be zero if no other harts' accesses to M[a2] and M[a0] can
be observed to have occured between the _second_ lr.w and the sc.w.

| sc.w t3,a1,(a2)
| # is t3 ever zero?

Yes, provided no other harts' accesses to M[a2] can be observed to
have occured. M[a0] is not reserved at this point.


Note, implementations only have to ensure success for the constrained
case we describe in the manual, these additional broader rules just
indicate when it's OK to signal success.

| By the way, the commentary states: ?We reserve a failure code of 1 to mean
| ?unspecified? so that simple implementations may return this value using the
| existing mux required for the SLT/SLTU instructions.? IMO that was the right
| answer.

Glad you noticed.

Krste

Jan Gray

unread,
Feb 18, 2016, 7:02:09 PM2/18/16
to kr...@berkeley.edu, hw-...@lists.riscv.org
Thank you, Krste, and others, for your helpful comments.

To continue with this topic, I think it could be helpful to describe what I have now built, consider whether and why it is not a conforming implementation according to the spec and the additional guidance of your last message, and ponder whether or not it should be conforming.

In general, software uses LR/SC to atomically change memory provided some condition holds. It uses LR to load the data that determines whether the condition holds and/or contributes to the new value, and SC to atomically store new data if no store by any agent may have disturbed that condition. The specification must define the minimum functionality for LR/SC and its environment to enable portable atomic software libraries. Thus the specification language constrains portable LR/SC code sequences and guarantees they can eventually succeed in any conforming implementation. I submit it is still an open question what behaviors the specification should guarantee when an implementation runs an LR/SC sequence that is *not* so constrained.


# Background/rationale on LR/SC for GRVI Phalanx.

Recall in a base GRVI Phalanx configuration, each cluster has 8 PEs (i.e. 8 harts) sharing a 4-way interleaved (on addr[3:2]) 32 KB cluster RAM (CRAM) which is also shared with a 32 B/cycle NOC interface and (optionally) an 8-way interleaved accelerator. No caches. No non-local memory accesses -- for now, data between clusters are sent as explicit messages on the NOC.

The multiprocessor needs LR/SC to build locks/crit secs and to perform lock free atomic actions such as atomic add to memory.

(I am not planning to implement AMOADD etc. at this time. LR/SC + software suffices and here in anticipated use implementing the other atomics in hardware is not a good throughput-per-area tradeoff.)

Consider a parallel histogram use-case. Each PE may use LR/SC to atomically increment some counter:

atomic_add: ; a0=&counter, a1=increment
lr.w a2,(a0)
add a2,a2,a1
sc.w a2,a2,(a0)
bgtz a2,atomic_add

Many PEs may simultaneously attempt to atomic_add to the same counter. To avoid livelock, some PE might temporarily "own" the reserved memory region (the "LR-domain".) This is the approach envisioned in the draft spec's rationale: upon LR the PE acquires its cache line in (I infer) exclusive or exclusive-modified states E/M and keeps it there for a while -- "holding off remote cache interventions"!

Alternately, multiple PEs might be permitted to "share" a non-exclusive load-reservation, a.k.a. write-monitor, on the same LR-domain. This is a reasonable approach for a non-cached banked memory system and avoids complicating the bank memory arbiters with transitory exclusive ownership considerations.

** Principle: To avoid livelock, either 1) LR gives one PE transitory exclusivity to a LR-domain, or 2) multiple PEs must be able to simultaneously reserve the same LR-domain.

Also -- perhaps this is obvious -- each PE must have its own individual reservation on an LR-domain. It is not correct to coalesce / share a reservation between PEs. Otherwise:
PE[0]: lr.w X ; PE[0] reserves X
PE[1]: sw X ; kills PE[0]'s reservation
PE[2]: lr.w X ; PE[2] reserves X
PE[0]: sc.w X ; must fail -- thus PE[0]'s reservation != PE[2]'s reservation

** Principle: Each PE must have its own reservation state.

To improve concurrency / reduce SC failure, each LR-domain should be as small as possible, so that concurrent non-conflicting LR-SCs succeed. To reduce the cost of implementation, each LR-domain should be as large as possible, to reduce the area/energy required to remember and compare their addresses. Actual use will reveal the right tradeoff. When and if per-bank LR-domains are demonstrated to be too coarse grained, I will revise the (FPGA!) implementation to introduce finer grained LR-domains.


# A Simple LR/SC Implementation for a GRVI cluster

The LR-domain is a memory bank (here, one of four address-interleaved banks, e.g. RAM with the same address bits A[3:2]).

Alternative 1: LR reserves an exclusive reservation for a PE for a bank. Other PEs' accesses to the bank are held off for a period of time or until the first PE's SC commits.

Rejected because: as noted, this complicates bank arbitration which is a critical path. (However, this approach may achieve dynamically fewer atomic-try-fail-retry iterations and so there may be an energy efficiency rationale for it. We'll see.)

Alternative 2: An LR reservation is non-exclusive. Multiple PEs may hold a reservation on a bank. The reservation is lost when the bank is written.
Alternative 2a: The state of which PE has reserved (or not) which bank is kept at the respective PEs. PEs monitor all writes in the cluster (e.g. all PEs monitor the four banks' write event buses). If a write is observed to a PE's reserved bank, the reservation is lost. An SC may issue to memory if the PE's prior LR reservation still holds.

Rejected because: it doesn't work. The memory system is pipelined. At the moment an SC issues, the reservation may be held; but by the time it hits the memory bank, the reservation may be lost. Example: Two PEs simultaneously execute lr.w; add; sc.w. Both acquire a reservation; both perform the add; both still hold the reservation; both launch the SC; the SCs serialize at the bank arbiter. One goes first; the other (now invalid) comes second -- must somehow be invalidated.

Alternative 2b: The state of which PEs have reserved which banks is kept at the banks. PEs blindly issue LR and SC requests to memory. These arrive at a bank, which tracks which PEs hold reservations and are permitted to commit SCs, and which don't.
This (2b) is the current GRVI Phalanx implementation.

In particular, each PE treats LR as a load with an additional reservation request signal, and greats SC as a fused store-load which stores rs2 at 0(rs1) and "loads" the SC outcome from the memory system, as follows:
1. In each bank controller there is a resvdmask[] -- a bit vector, one bit per PE, of which PEs currently have a reservation on this bank.
2. When PE[i] performs a LR to bank[b], it sets bank[b].resvdmask[i].
3. When any PE[i] performs a store or an SC to bank[b], it zeroes bank[b].resvdmask[*]. Every PE loses it reservation on that bank.
4. When the NOC writes a message into CRAM (striped across all banks), it zeroes bank[*].resvdmask[*]. Every PE loses their reservations. (NB: at this time, NOC messages are unconditionally 32 bytes wide and write across banks in one cycle.)
5. When PE[i] performs a SC to bank[b], it tests bank[b].resvdmask[i]. If set, the store occurs and the result is 0. Otherwise, the store is killed and the result is 1.

Notice how this implementation supports atomic_add (above) sans livelock: as all PEs hit the same atomic_add, and gradually acquire load-reservations on the same bank, all make progress. One of the PEs is first to conditionally store (SC) to the bank. That PE succeeds, clearing all PEs' reservations. The remaining SCs all lack the reservation; all fail, and spin. Soon another PE will successfully LR...SC. Forward progress continues.


# Why this is not "conforming"

Krste wrote: "A LR by a hart reserves an implementation-defined subset of the memory space - *** any earlier LR reservation is cleared ***. An SC on the same hart is allowed to succeed provided no ***load***/stores from other harts to the reserved subset of the memory space can be observed to have occured between an active LR and the SC."

1. In the GRVI cluster implementation of LR/SC, a PE may have multiple LRs to different LR-domains. This fails the "any earlier LR reservation is cleared" constraint.

PE[0]: lr.w X ; first LR-domain
PE[0]: lr.w Y ; different LR-domain
; no intervening *stores* by other agents
PE[0]: sc.w X ; succeeds
PE[0]: SC Y ; succeeds

** Question: why *must* earlier LR reservations be cleared by a conforming implementation?

Of course "portable" constrained atomic sequences must not depend upon this (i.e. must only depend upon last LR) but it would be unfortunate to forbid implementations that can do more than one reservation, or implementations for which multiple lingering reservations are simpler (!) than maximum-one reservation. Unless/until a write to a reserved locations occurs, it a benign to leave a reservation in place -- even if memory is littered with dozens of such reservations, as is possible in some implementations.

An implementation that supports multiple outstanding LR reservations runs portable constrained atomic sequences just fine. It just may lure you into using unportable unconstrained sequences. (Writing concurrent data structures is advanced systems programming -- if you know enough to use LR/SC correctly you know whether you have written a constrained sequence or not.) If possible let us not spend hardware or energy enforcing that.

2. In the GRVI cluster implementation of LR/SC, a load from another PE (hart) does not disturb a reservation.

PE[0]: lr.w X ; reserves X on PE[0]
PE[1]: lr.w Y ; reserves Y (same bank as X) on PE[1], no effect on PE[0] reservation
PE[2]: lw X ; regular load, no effect on PE[0] or PE[1] reservations
PE[0]: sc.w X ; succeeds, kills PE[1] reservation
PE[1]: sc.w Y ; fails

Here the 2nd and 3rd loads occurred between the first LR and its SC. Whether you consider them to have been observed given that there was no cache line invalidation seems debatable. But there they are just the same. I hope you do not *require* an implementation to fail a subsequent sc.w because of a load somewhere else in the system.

# Non-portable LR/SC sequences with load/store?

In the GRVI cluster design, the following sequence may succeed (and might be useful):

PE[0]: lr.w X ; one LR-domain
PE[1]: lw Y ; same LR-domain
PE[2]: ...
PE[3]: sc.w X ; may succeed!

This is also possible but rather tricky:

PE[0]: lr.w X ; one LR-domain
PE[1]: sw Z ; by design/explicit placement, in a different LR-domain
PE[2]: ...
PE[3]: sc.w X ; may succeed!


# Summary

In summary, I suggest:
* implementation defined LR-region sizes? Yes!
* at most one active LR-region per hart? No thanks.
* during one hart's LR..SC sequence, another hart's load of the same address *must* cause the SC to fail? Please no!

Thank you for any comments.

By the way, what is a good broad survey of best practices for LR/SC (or prior LL/SC implementations?)

Thank you.
Jan.
--
You received this message because you are subscribed to the Google Groups "RISC-V HW Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hw-dev+un...@groups.riscv.org.
To post to this group, send email to hw-...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/hw-dev/.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/hw-dev/22213.13066.404560.387567%40KAMacBookAir2012.local.

kr...@berkeley.edu

unread,
Feb 18, 2016, 8:12:57 PM2/18/16
to Jan Gray, kr...@berkeley.edu, hw-...@lists.riscv.org

Hi Jan,

Thanks for taking the time to push on the spec.

>>>>> On Thu, 18 Feb 2016 16:01:46 -0800, "Jan Gray" <jsg...@acm.org> said:
[...]

| (I am not planning to implement AMOADD etc. at this time. LR/SC +
| software suffices and here in anticipated use implementing the other
| atomics in hardware is not a good throughput-per-area tradeoff.)

Yes - we did assume LR/SC could be microarchitectural support for
implementing AMOs so you'd just implement AMOs in microcode.

| # Why this is not "conforming"

| Krste wrote: "A LR by a hart reserves an implementation-defined
| subset of the memory space - *** any earlier LR reservation is
| cleared ***. An SC on the same hart is allowed to succeed provided
| no ***load***/stores from other harts to the reserved subset of the
| memory space can be observed to have occured between an active LR
| and the SC."

| 1. In the GRVI cluster implementation of LR/SC, a PE may have multiple LRs to different LR-domains. This fails the "any earlier LR reservation is cleared" constraint.

| PE[0]: lr.w X ; first LR-domain
| PE[0]: lr.w Y ; different LR-domain
| ; no intervening *stores* by other agents
| PE[0]: sc.w X ; succeeds
| PE[0]: SC Y ; succeeds

| ** Question: why *must* earlier LR reservations be cleared by a conforming implementation?

I agree, there is no need to require the reservation is cleared on a
second LR. However, wording the specification might get tricky.
As an attempt:

Each LR has an associated reservation covering some subset of address
space. An SC can succeed if the hart has an active reservation
including that address, and there has been no observable memory
operation by another hart on that address since the reservation was
made.

In the simple case of one LR and one SC, remote ***loads*** do not
necessarily cause the reservation to be lost, as they appear to happen
before the LR if they return the memory location's value prior to the
SC, even if they happen in time between the LR and SC.

With the looser definition of success, it is important to include
loads, as in the sequence:

lr.w X, (a)
sc.w Y, (a)
sc.w Z, (a)

The second sc.w should fail if a remote load saw the Y value.

[...]
| An implementation that supports multiple outstanding LR reservations
| runs portable constrained atomic sequences just fine. It just may
| lure you into using unportable unconstrained sequences. (Writing
| concurrent data structures is advanced systems programming -- if you
| know enough to use LR/SC correctly you know whether you have written
| a constrained sequence or not.) If possible let us not spend
| hardware or energy enforcing that.

Agreed, as long as we can write a clear spec.

| 2. In the GRVI cluster implementation of LR/SC, a load from another PE (hart) does not disturb a reservation.

| PE[0]: lr.w X ; reserves X on PE[0]
| PE[1]: lr.w Y ; reserves Y (same bank as X) on PE[1], no effect on PE[0] reservation
| PE[2]: lw X ; regular load, no effect on PE[0] or PE[1] reservations
| PE[0]: sc.w X ; succeeds, kills PE[1] reservation
| PE[1]: sc.w Y ; fails

| Here the 2nd and 3rd loads occurred between the first LR and its
| SC. Whether you consider them to have been observed given that there
| was no cache line invalidation seems debatable. But there they are
| just the same. I hope you do not *require* an implementation to fail
| a subsequent sc.w because of a load somewhere else in the system.

That's right. There's no need provided the condition on the store is
met and the load value returned by the 2nd and 3rd loads are the same
as would have been returned had they executed before the 1st load.
The presence/absence of a reservation is only observable with the SC.

| # Non-portable LR/SC sequences with load/store?

| In the GRVI cluster design, the following sequence may succeed (and might be useful):

| PE[0]: lr.w X ; one LR-domain
| PE[1]: lw Y ; same LR-domain
| PE[2]: ...
| PE[3]: sc.w X ; may succeed!

This is fine by spec.

| This is also possible but rather tricky:

| PE[0]: lr.w X ; one LR-domain
| PE[1]: sw Z ; by design/explicit placement, in a different LR-domain
| PE[2]: ...
| PE[3]: sc.w X ; may succeed!

This is also fine by spec.

| # Summary

| In summary, I suggest:
| * implementation defined LR-region sizes? Yes!
| * at most one active LR-region per hart? No thanks.

Right - we can allow more as above.

| * during one hart's LR..SC sequence, another hart's load of the same address *must* cause the SC to fail? Please no!

The original spec did not demand this for a single LR/SC as the load
value would be indistinguishable from it having happened before the
LR. When you allow two SCs to the same address, then a load after the
first SC must cause the second SC to fail, as above.

| Thank you for any comments.

| By the way, what is a good broad survey of best practices for LR/SC (or prior LL/SC implementations?)

I haven't seen anything like this. LMK if you run into anything.
Some machines used separate registers, others punned on presence in
exclusive state in cache at SC.

Thanks,
Krste

| Thank you.
| Jan.

Rishi Khan

unread,
May 10, 2018, 8:33:00 PM5/10/18
to RISC-V HW Dev, hw-...@lists.riscv.org
Has there been any progress on the spec in regards to these points Jan brings up? I have some of my own from the reading, on page 41 of riscv spec 2.2:
An SC can succeed if no accesses from other harts to the address can be observed to have occurred between the SC and the last LR in this hart to reserve the address.

I read this as 'access' means ld, lr.d, sd, sc.d, AMO* (using double word in this case, but similar instructions for word, half and byte). Is that correct? I understand why AMOs, stores and store conditionals would cause an invalidation of the reservation, but load and load reserved really shouldn't affect the condition. If this is the case, I see many opportunities for livelock.

For example, one may implement an atomic_decrement_if_positive(uint64_t *value) as (in pseudocode):
1. while (!done){
2.   while (*value ==0) sleep_some_with_backoff();
3.   int64_t new_value = lr.d *value
4.   new_value--;
5.   done = sc.d new_value, *value
6. }

The above code run by multiple threads will livelock if loads on line 2 are considered 'invalidating accesses'. I would think that loads would not be considered invalidating accesses.

Also, if lr.d of the same address of another hart is considered invalidating, then we can get livelock because the lr.d of hart 1 will invalidate the lr.d of hart 2 and then the lr.d of hart 1 will invalidate the lr.d of hart 2 in the next iteration. Instead, I would think that both reservations would be valid and the first store conditional will invalidate all reservations to this address.

If this is all in agreement, here's my suggestion to change the spec language:
An SC can succeed if no atomic memory operations (AMOs), stores, or store conditionals (SC) from other harts to the address can be observed to have occurred between the SC and the last LR in this hart to reserve the address. Further, if multiple harts issue an LR to the same address, they will all reserve the address, but the first hart to store or store conditionally will succeed and invalidate the reservation for all other harts.

Is this correct?

Rishi

Rishi Khan

unread,
May 10, 2018, 10:43:50 PM5/10/18
to RISC-V HW Dev
I see that the new 2.3 draft spec addresses many of these issues. Specifically, ‘accesses' have been changed to ‘stores’. I assume that should include atomic operations that change the value (i.e. atomic add would invalidate the load reservation, but atomic max that didn’t change the value wouldn’t). Is this correct?

--
You received this message because you are subscribed to the Google Groups "RISC-V HW Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hw-dev+un...@groups.riscv.org.
To post to this group, send email to hw-...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/hw-dev/.

Daniel Lustig

unread,
May 21, 2018, 5:36:47 PM5/21/18
to Rishi Khan, RISC-V HW Dev
Hi Rishi,

AMOs are always considered stores, even in the case you described.
Likewise, an AMOADD with rs2=x0 is also considered a store, even though
it won't change the value in memory either. ...and so on. It would be
pretty difficult to reason about the memory consistency behavior of AMOs
if this weren't the case.

We'll add a blurb into the spec to clarify this.

Thanks for the question,
Dan

On 5/10/2018 7:43 PM, Rishi Khan wrote:
> I see that the new 2.3 draft spec addresses many of these issues. Specifically,
> ‘accesses' have been changed to ‘stores’. I assume that should include atomic
> operations that change the value (i.e. atomic add would invalidate the load
> reservation, but atomic max that didn’t change the value wouldn’t). Is this
> correct?
>
>> On May 10, 2018, at 8:33 PM, Rishi Khan <rish...@gmail.com
>> <mailto:hw-dev+un...@groups.riscv.org>.
>> To post to this group, send email to hw-...@groups.riscv.org
>> <mailto:hw-...@groups.riscv.org>.
>> <https://groups.google.com/a/groups.riscv.org/d/msgid/hw-dev/41e8801c-81d2-4277-9b52-d9f4f0808bbc%40groups.riscv.org?utm_medium=email&utm_source=footer>.
>
> --
> You received this message because you are subscribed to the Google Groups
> "RISC-V HW Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email
> to hw-dev+un...@groups.riscv.org <mailto:hw-dev+un...@groups.riscv.org>.
> To post to this group, send email to hw-...@groups.riscv.org
> <mailto:hw-...@groups.riscv.org>.
> Visit this group at https://groups.google.com/a/groups.riscv.org/group/hw-dev/.
> To view this discussion on the web visit
> https://groups.google.com/a/groups.riscv.org/d/msgid/hw-dev/92A47D53-16FB-420E-BB2C-0C204988BBF0%40gmail.com
> <https://groups.google.com/a/groups.riscv.org/d/msgid/hw-dev/92A47D53-16FB-420E-BB2C-0C204988BBF0%40gmail.com?utm_medium=email&utm_source=footer>.
>
-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may contain
confidential information. Any unauthorized review, use, disclosure or distribution
is prohibited. If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------
Reply all
Reply to author
Forward
0 new messages