Zacas specification: AMOCAS.D

229 views
Skip to first unread message

L Peter Deutsch

unread,
Sep 8, 2023, 7:40:10 PM9/8/23
to isa...@groups.riscv.org
The spec looks good overall, but I wondered about one thing. The proposed
AMOCAS.D instructions execute quite differently on RV32 and RV64, using
paired registers on the former and single registers on the latter. The
rationale for allowing this is explained well in Section 1.3 of the master
RISC-V Instruction Set Manual. However, it's my perception that one of the
architectural guidelines for extensions in general is to support RV32
capabilities in RV64. This would suggest to me an alternative consisting of
.W and .WW (the present .D) on RV32, and .W, .WW (not currently supported),
.D, and .DD (the present .Q) on RV64 (names perhaps not well chosen, but I
hope they convey the idea). Was this alternative considered?

L Peter Deutsch :: Aladdin Enterprises :: Healdsburg, CA & Burnaby, BC

Ved Shanbhogue

unread,
Sep 9, 2023, 2:27:47 PM9/9/23
to L Peter Deutsch, isa...@groups.riscv.org
Thank you, Peter. As you rightly pointed out, Section 1.3 of the
Unpriv specification delves into this topic. The non-normative
text on page 5 further discusses the nuances about supporting
all operations of RV32 in RV64. The amocas.d provides identical
capability in RV32 and RV64 of performing a compare-and-swap
operation on a doubleword. The ".d" is used to emphasise that
the instruction operates on a data value of doubleword width and
".q" to emphasise that it operates on a data value of quadword
width.

regards
ved


L Peter Deutsch

unread,
Sep 9, 2023, 3:05:25 PM9/9/23
to Ved Shanbhogue, isa...@groups.riscv.org
> The amocas.d provides identical capability in RV32 and RV64 of
> performing a compare-and-swap operation on a doubleword.

I see two slightly different ways of looking at this.

#1 (the one adopted in the present specification): The operation being
supported is "compare-and-swap on a datum of length <N>," for N = 32
and 64 on RV32, and N = 64 and 128 on RV64. The mapping of the datum to
registers may be different on RV32 and RV64.

#2 (the alternative I'm suggesting for consideration): The operation being
supported is "compare-and-swap on a datum of length <N> or a pair of data of
length <N>," for N = 32 on RV32 and N = 32 and 64 on RV64. The mapping of
the datum or data to registers is identical on RV32 and RV64.

I'd just like to know that the use case of a pair of 32-bit data items (as
opposed to a single 64-bit item) was considered for RV64 and omitted for
insufficient utility. (Of course on RV64 it's possible to take apart a
64-bit item into two 32-bits with 1 or 2 additional instructions, or
assemble a 64-bit value from two 32-bits with 2 instructions.)

I would note that I see double-register / double-word instructions as a
larger issue. The base (G) ISA doesn't have any, and I gather they were
considered controversial or undesirable since even the M extension doesn't
have N x N -> 2N or 2N / N instructions. Also, IIRC double-word loads and
stores have been proposed and not adopted. But double-datum instructions
clearly have utility, since they keep cropping up in other extensions.
Since Zacas has to support double-register / double-word accesses, it seemed
a little arbitrary to me to add a double-register 32-bit load/store in RV32
and take it away in RV64. Perhaps this is a time to reconsider double-word
loads and stores in general?

Thanks -

L Peter Deutsch

unread,
Sep 9, 2023, 3:14:13 PM9/9/23
to Ved Shanbhogue, isa...@groups.riscv.org
Sorry, I wrote:

> #1 (the one adopted in the present specification): The operation being
> supported is "compare-and-swap on a datum of length <N>," for N = 32
> and 64 on RV32, and N = 64 and 128 on RV64. The mapping of the datum to
> registers may be different on RV32 and RV64.

My oversight, N = 32 is supported on RV64 as well.

--

Ved Shanbhogue

unread,
Sep 10, 2023, 10:35:48 AM9/10/23
to L Peter Deutsch, isa...@groups.riscv.org
L Peter Deutsch wrote:
>I'd just like to know that the use case of a pair of 32-bit data items (as
>opposed to a single 64-bit item) was considered for RV64 and omitted for
>insufficient utility. (Of course on RV64 it's possible to take apart a
>64-bit item into two 32-bits with 1 or 2 additional instructions, or
>assemble a 64-bit value from two 32-bits with 2 instructions.)

Thank you, Peter. You're right; in RV64, one could technically
represent the 64-bit data values, both source and result, using
two 32-bit registers.

However, I trust you'd concur that this approach may not be the
most efficient for RV64. Given that XLEN=64 is RV64's native
data width, such a design would demand extra registers, potentially
causing spills. This would also necessitate further instructions
to break apart the compare value, split the swap value, and
subsequently merge the register pair holding the prior value.

The act of splitting a 64-bit data value into two 32-bit segments
and subsequently recombining them doesn't seem to offer substantial
benefits for algorithms that utilize these instructions. Hence,
avoiding this added complexity seems prudent.

>a little arbitrary to me to add a double-register 32-bit load/store in RV32

The guiding principle is: when amocas instruction operates on a
2xXLEN data width, it fetches its operands from a pair of
XLEN-wide registers and outputs its result to a similar pair. I'd
also like to direct you to the non-normative note in the proposal
discussing the potential use of amocas.q in a future RV128 context.

>and take it away in RV64. Perhaps this is a time to reconsider double-word
>loads and stores in general?

Multi-XLEN load/store operations might indeed be beneficial
for certain application domains. The recent Zc extension has
incorporated support for them. The broader utility of operations
such as load-pair/store-pair or even load-multiple/store-multiple
might warrant independent consideration.

regards
ved

L Peter Deutsch

unread,
Sep 10, 2023, 1:04:50 PM9/10/23
to Ved Shanbhogue, isa...@groups.riscv.org
> The act of splitting a 64-bit data value into two 32-bit segments
> and subsequently recombining them doesn't seem to offer substantial
> benefits for algorithms that utilize these instructions. Hence,
> avoiding this added complexity seems prudent.

Sorry, I didn't make my use case clear enough.

The amocas.q example on p. 6 of the draft spec illustrates atomically
accessing two *semantically distinct but coupled, and physically adjacent*
64-bit values on RV64. amocas.d provides the same function for 32-bit
values on RV32. My question was whether there is a significant use case for
*this function* (two coupled values) for 32-bit values on RV64. If there
is, the current proposal would require taking apart the two packed 32-bit
values and putting them back together. I'm not familiar enough with
multithreaded algorithms to know whether such a use case might exist: the
values would presumably not be pointers.

I don't want to spin this discussion out unnecessarily, and I apologize for
not having posed my question enough detail before.

--

Ved Shanbhogue

unread,
Sep 10, 2023, 3:30:58 PM9/10/23
to L Peter Deutsch, isa...@groups.riscv.org
L Peter Deutsch wrote:
> Sorry, I didn't make my use case clear enough.
>
> values on RV32. My question was whether there is a significant use case for
> *this function* (two coupled values) for 32-bit values on RV64. If there

Thank you for the clarification; I now fully grasp
your question. Most algorithms employing amocas
target individual data values, typically spanning
either 32 or 64 bits. Two coupled values of 32-bit
values is less common on RV64. CAS operations that
make use of the 2xXLEN width are vital in addressing
the ABA issue inherent in CAS-based, lock-free
algorithms that usually operate on XLEN wide pointers
and associate a version tag with the pointer.

regards
ved

L Peter Deutsch

unread,
Sep 10, 2023, 3:42:07 PM9/10/23
to Ved Shanbhogue, isa...@groups.riscv.org
> Most algorithms employing amocas target individual data values, typically
> spanning either 32 or 64 bits. Two coupled values of 32-bit values is less
> common on RV64. CAS operations that make use of the 2xXLEN width are vital
> in addressing the ABA issue inherent in CAS-based, lock-free algorithms
> that usually operate on XLEN wide pointers and associate a version tag
> with the pointer.

Thanks, I understand. Then I agree that the present spec is appropriate.

Iztok Jeras

unread,
Sep 12, 2023, 12:14:39 PM9/12/23
to RISC-V ISA Dev, L Peter Deutsch, isa...@groups.riscv.org, Ved Shanbhogue
Hi Ved,

Regarding 2xXLEN instructions in general,
is there some written argumentation in the base standard on why only even numbered registers can be used,
or more precisely the LO part is stored in the R even GPR and the HI part is stored in the R+1 odd GPR.

I can think of two basic options for implementing this in hardware.
1. The simplest option is to access the GPR register file sequentially over two separate clock periods.
2. In case a 32-bit CPU has a 64-bit system bus, this could be done in a single clock period.
The register file can be split into even/odd parts, thus enabling double throughput.

At first glance the even R requirement might reduce logic complexity,
but in practice I do not think it would spare even an XLEN wide multiplexer compared to unrestricted R value.

The even R requirement makes it easy to avoid the case where R=31 and R+1=X0, is this the reason?

Are there any consideration on how 2xXLEN instructions relate to the ABI?
I am just curious here, I do not have enough ABI experience to comment on it.

My interest comes from planning to add to my RV32 implementation a custom extension supporting atomic 64-bit load/store,
and I am writing a system bus protocol which would support atomic 64-bit transactions on a 32-bit data bus (locked transfer pair).
The primary purpose of this instructions would be atomic access to 64-bit timer/counter registers in peripherals,
thus avoiding the need to check for whether the HI part changed after reading the LO part.
Also it would enable atomic access to 64-bit GPIO registers and similar use cases.
This could be useful for CSR instructions too.

Such 64-bit load/store instructions on RV32 could reuse the existing codes for RV64 64-bit load/store, both the base I and C extension.
The encoding for RV128 128-bit load/store is a bit different and thus not a perfect fit to extend RV32.

Another use case I am (will be) looking into is the P extension.
See 'Processing of 64-bit Values in RV32' in Chapter 7, which states:
"Use of misaligned (odd-numbered) registers for 64-bit operands is reserved.".
A RV32IP CPU with a 64bit load/store data bus could take advantage of 64-bit load/store instructions
to increase memory throughput.
This in general would be an advantage for any implementation focused on DSP functionality needing high data throughput.

The load-multiple/store-multiple approach in the Zc extension is rather different and focused on stack push/pop.

So what I am looking for is argumentation in the base standard that would deal with various aspects
(odd/even, X0 source/destination, atomicity,  endianness, ABI, ...) of load-pair/store-pair instructions,
providing general recommendations for various extensions.

Regards,
Iztok Jeras

Ved Shanbhogue

unread,
Sep 12, 2023, 1:59:41 PM9/12/23
to Iztok Jeras, RISC-V ISA Dev, L Peter Deutsch
Iztok Jeras wrote:
> Regarding 2xXLEN instructions in general, is there some written
> argumentation in the base standard on why only even numbered
> registers can be used, or more precisely the LO part is stored
> in the R even GPR and the HI part is stored in the R+1 odd GPR.

Here are the rationales for the choice made by Zacas:
1. This convention aligns with other RISC-V extensions,
such as the P extension and Zdinx.
2. Generating the paired register through a logical OR
operation is simpler and obviates the need for an adder.
3. Allows use of x0 when a zero source is needed or the
result is to be discarded, avoiding special treatment of
x1 or x31.
4. Integer calling convention, passes 2xXLEN arguments in
registers as an aligned even-odd pair, with the even
register containing the least-significant bits. This
scheme is amenable to the convention, and compilers
comprehend such constraints.

regards
ved

Iztok Jeras

unread,
Sep 13, 2023, 2:29:25 AM9/13/23
to RISC-V ISA Dev, Ved Shanbhogue, RISC-V ISA Dev, L Peter Deutsch, Iztok Jeras
Thanks Ved, I definitely forgot about the register index adder.

Michael Chapman

unread,
Sep 13, 2023, 1:32:22 PM9/13/23
to Ved Shanbhogue, Iztok Jeras, RISC-V ISA Dev, L Peter Deutsch
An XOR rather than logical OR may generate some interesting
possibilities by
reversing the high/low parts of the register pair for very little
additional
implementation cost (XOR vs logical OR).

On 12/09/2023 19:59, Ved Shanbhogue wrote:
> ...
> 2. Generating the paired register through a logical OR
>    operation is simpler and obviates the need for an adder.
> ...

Allen Baum

unread,
Sep 13, 2023, 1:43:48 PM9/13/23
to Michael Chapman, Ved Shanbhogue, Iztok Jeras, RISC-V ISA Dev, L Peter Deutsch
Would there be an actual advantage sto allowing that?
 I think "interesting possibilities" that had any measureable performance impact would be hard to come up with. 

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/1306c819-835a-3a0d-b80c-3401c14409d9%40gmail.com.
Reply all
Reply to author
Forward
0 new messages