Floating point atomic memory operations

192 views
Skip to first unread message

Eric Lorimer

unread,
Sep 15, 2023, 8:56:44 AM9/15/23
to RISC-V ISA Dev
Hello,

As I understand from reading the 'A' spec, the atomic memory operations (specifically atomic add) only apply to integers. Is this correct? Is there any plan or proposal to support floating point atomic memory operations (like atomic add), or is this even desirable?


- Eric

Nick Knight

unread,
Sep 15, 2023, 10:28:06 AM9/15/23
to Eric Lorimer, RISC-V ISA Dev
Correct, the A-extension instructions only support a small collection of integer arithmetic and bitwise logical operations.

Software can use these atomic building blocks to perform more general atomic operations: for example, here is how 
C++ floating-point atomic fetch_{add,sub} can be implemented:

I was not involved in the design of the A-extension, but I suspect one of the reasons that atomic floating-point arithmetic was omitted related to the implementation challenges of supporting handling of floating point exception flags, needed for compliance with IEEE-754 and consistency with other RISC-V floating point support.

There are certainly important use-cases, like parallel reductions. To justify pursuing hardware support, it would help to demonstrate that current software-based approaches incur a substantial overhead in the context of the overall application. Depending on the (HW/SW) platform, the bottleneck may be elsewhere: threading library, OS scheduler, cache coherence mechanism, etc.

Best,
Nick Knight

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/a4c2c04d-6211-42d9-a255-f74db24916ebn%40groups.riscv.org.

Abel Bernabeu

unread,
Sep 16, 2023, 3:23:41 AM9/16/23
to Nick Knight, Eric Lorimer, RISC-V ISA Dev
Parallel reduction needs fast messaging, rather than atomics. Zawrs provides that.

Floating point atomics are uncommon in real world applications.That must be a reason why the need has not been flagged before.

If you come up with a killer use case, of course, deserves consideration.

Regards.

Paul Campbell

unread,
Sep 16, 2023, 6:38:32 PM9/16/23
to RISC-V ISA Dev
On Sat, Sep 16, 2023 at 7:23 PM 'Abel Bernabeu' via RISC-V ISA Dev
<isa...@groups.riscv.org> wrote:
>
> Parallel reduction needs fast messaging, rather than atomics. Zawrs provides that.
>
> Floating point atomics are uncommon in real world applications.That must be a reason why the need has not been flagged before.

Also FMV.X.*/FMV.*.X may just be register renaming in some implementations

- Paul

David Chisnall

unread,
Sep 25, 2023, 12:26:52 PM9/25/23
to Nick Knight, Eric Lorimer, RISC-V ISA Dev
On 15 Sep 2023, at 15:27, 'Nick Knight' via RISC-V ISA Dev <isa...@groups.riscv.org> wrote:
>
> I was not involved in the design of the A-extension, but I suspect one of the reasons that atomic floating-point arithmetic was omitted related to the implementation challenges of supporting handling of floating point exception flags, needed for compliance with IEEE-754 and consistency with other RISC-V floating point support.

For anyone who is interested, the C11 spec has pseudocode for how atomic operations should be lowered:

fenv_t fenv;
T1 *addr = &E1;
T2 val = E2;
T1 old = *addr;
T1 new;
feholdexcept(&fenv);
for (;;) {
new = old op val;
if (atomic_compare_exchange_strong(addr, &old, new))
break;
feclearexcept(FE_ALL_EXCEPT);
}
feupdateenv(&fenv);

Encoding all of that in a single hardware atomic operation would be quite painful.

GPUs typically have different requirements and atomic operations on floating point values for a GPU or other accelerator that doesn’t need to run arbitrary C[++] code would be a different matter.

David



MitchAlsup

unread,
Oct 5, 2023, 6:46:03 PM10/5/23
to RISC-V ISA Dev, Eric Lorimer
On Friday, September 15, 2023 at 7:56:44 AM UTC-5 Eric Lorimer wrote:
Hello,

As I understand from reading the 'A' spec, the atomic memory operations (specifically atomic add) only apply to integers. Is this correct? Is there any plan or proposal to support floating point atomic memory operations (like atomic add), or is this even desirable?

While there may be a case made that this is desirable, one must recognize that the size of an integer ALU is "not very big" (~2k gates) that of a FPU is "quite large" (~50K-100K gates). So while we can plop an integer ALUs where we need them at low power and area overheads, the same is not true for FPUs. And there there is the issue with exceptions--memory based ATOMICs can be defined with exception free behavior (unsigned) while this becomes significantly harder with FPUs.

So, because of these things, FP ATOMICs are not on the drawing board, and likely never to be.

BGB

unread,
Oct 6, 2023, 12:47:51 PM10/6/23
to isa...@groups.riscv.org
Yes, agreed.


Integer atomic instructions are annoying because it "merely" means
needing to shove some ALU logic into the L1 cache or Load/Store
mechanism or similar.

Also, one can "mostly" pull off these sorts of ALU ops within a single
clock-cycle, so presumably would not have to increase latency or add any
pipeline stalls or similar.

So, "not too horrible" for the most part.

But, for floating point, it is very different in that one is almost
invariably going to need multiple clock cycles for this, which means
either needing to make Load/Store operations have a significantly longer
latency (unrealistic) or (more likely) needing to handle each
floating-point atomic with a stall or similar (in addition to needing to
bolt an FPU onto the memory-access logic, which is also not ideal).


Similarly, if one assumes that the main use case of atomic operations is
for writing code that deals with synchronization between threads or
similar, then floating-point atomic operations don't make sense.


Decided to leave out possible alternative pipelines to allow for stuff
like this (if one could make it work, and be fully pipelined, but needed
to effectively have an 10 or 11 stage pipeline as a result, would it be
worth it?...).

Though, one possible merit could be that an 11 stage pipeline could more
directly mimic x86, but would (in a reasonable implementation) likely
end up with worse average case performance than using a Load/Store
oriented pipeline (in a naive design, one would likely be looking at 3
or 4 cycles of latency even for a plain integer ADD).


> --
> You received this message because you are subscribed to the Google
> Groups "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to isa-dev+u...@groups.riscv.org
> <mailto:isa-dev+u...@groups.riscv.org>.
> To view this discussion on the web visit
> https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/915cc49c-3036-4258-95bf-332d82516de3n%40groups.riscv.org <https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/915cc49c-3036-4258-95bf-332d82516de3n%40groups.riscv.org?utm_medium=email&utm_source=footer>.

Robert Finch

unread,
Oct 6, 2023, 5:41:48 PM10/6/23
to RISC-V ISA Dev, BGB

Correct me if I am wrong, but I think it is possible to build up more complex atomic operations using simpler ones based on integers. That is, a float sequence could be protected by a mutex or semaphore and the results transferred atomically. I am wondering if an atomic FP sequence could be wrapped up in micro-code that uses integer atomics? That would give instruction density if FP atomics were needed. Would it be worthwhile to have a few opcodes reserved for FP atomics implemented with micro-code? I am guessing not. Maybe an instruction modifier could handle it.

I wonder if there is a HLL that handles atomic sequences?

Also, only a small subset of integer operations is typically available for atomics. My memory controller does not support multi-bit shifts by default since it is a lot of hardware. Similarly to FP, multi-cycle integer ops are also not supported (MUL, DIV, etc). So how would these ops be made atomic?

Daniel Petrisko

unread,
Oct 6, 2023, 6:04:29 PM10/6/23
to Robert Finch, RISC-V ISA Dev, BGB
One can achieve atomic operations on floats by the sequence 

- load reserved t0
- fmov t0 f0
- fadd f0 f0 f1
- fmov f0 t0
- store conditional t0
- loop if broken

If there were a FLR/FSC pair, could get rid of the intermediate moves and shorten to 4 instructions. That would seem to be a fairly cheap extension in comparison to full float atomics. Not sure if this has been proposed / discussed in the past
 
Best,
- Dan

On Oct 6, 2023, at 2:41 PM, Robert Finch <robf...@gmail.com> wrote:


To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/8b13b54f-9b13-4298-b530-66492970eb1bn%40groups.riscv.org.

Nick Knight

unread,
Oct 6, 2023, 6:22:21 PM10/6/23
to Daniel Petrisko, Robert Finch, RISC-V ISA Dev, BGB
> Not sure if this has been proposed / discussed in the past

There was some related discussion in this PR conversation:

Eric Lorimer

unread,
Oct 9, 2023, 9:47:06 AM10/9/23
to RISC-V ISA Dev, Nick Knight, Robert Finch, RISC-V ISA Dev, BGB, Daniel Petrisko
Thanks for the good discussion, everyone. For context, the application is a graph application (GNN) with a push-based update algorithm which uses the atomic add. (The code was ported from somewhere else, but I believe atomic add (or other reduction) is common for push-based graph processing algorithms). The target hardware is a SIMT architecture. Indeed, the floating point atomic add function gets lowered to the 4 instruction compare-and-swap loop shown in the 'A' extension spec. This necessitates divergence handling and the overhead that comes with that, though, as some threads may need to exit the loop while others make another iteration. Naturally, it would require much more investigation to accurately quantify the overheads, costs, and potential gains to see if it makes sense, but it sounds like it is non-trivial and not of general interest, so I don't think I will pursue this idea further right now.


Thanks,
- Eric

K. York

unread,
Oct 14, 2023, 5:29:18 AM10/14/23
to Eric Lorimer, RISC-V ISA Dev, Nick Knight, Robert Finch, BGB, Daniel Petrisko
If you're making a SIMT core, you're acting like a GPU and including some custom instructions to handle this may be prudent. I suspect you'll just be pushing down the retry loop desync but the code size savings for your GPU driver to just write X.amofadd is probably worth it.

Reply all
Reply to author
Forward
0 new messages