## Summary
This proposal gives a brief overview of the challenges of lowering to LL/SC
loops and details the approach I am taking for RISC-V. Beyond getting feedback
on that work, my intention is to find consensus on moving other backends
towards a similar approach and sharing common code where feasible. Scroll down
to 'Questions' for a summary of the issues I think need feedback and
agreement.
For the original discussion of LL/SC lowering, please refer to James
Knight's 2016 thread on the topic:
http://lists.llvm.org/pipermail/llvm-dev/2016-May/099490.html
I'd like to thank James Knight, JF Bastien, and Eli Friedman for being so
generous with their review feedback on this atomics work so far.
## Background: Atomics in LLVM
See the documentation for full details <https://llvm.org/docs/Atomics.html>.
In short: LLVM defines memory ordering constraints to match the C11/C++11
memory model (unordered, monotonic, acquire, release, acqrel, seqcst).
These can be given as parameters to the atomic operations supported in LLVM
IR:
* Fences with the fence instruction
* Atomic load and store with the 'load atomic' and 'store atomic' variants of
the load/store instructions..
* Fetch-and-binop / read-modify-write operations through the atomicrmw
instruction.
* Compare and exchange via the cmpxchg instruction. Takes memory ordering for
both success and failure cases. Can also specify a 'weak' vs 'strong' cmpxchg,
where the weak variant allows spurious failure
## Background: Atomics in RISC-V
For full details see a recent draft of the ISA manual
<https://github.com/riscv/riscv-isa-manual/releases/download/draft-20180612-548fd40/riscv-spec.pdf>,
which incorporates work from the Memory Consistency Model Task Group to define
the memory model. RISC-V implements a weak memory model.
For those not familiar, RISC-V is a modular ISA, with standard extensions
indicated by single letters. Baseline 'RV32I' or 'RV64I' instruction sets
don't support atomic operations beyond fences. However the RV32A and RV64A
instruction set extensions introduce AMOs (Atomic Memory Operations) and LR/SC
(load-linked/store-conditional on other architectures). 32-bit atomic
operations are supported natively on RV32, and both 32 and 64-bit atomic
operations support natively on RV64.
AMOs such as 'amoadd.w' implement simple fetch-and-dobinop behaviour. For
LR/SC: LR loads a word and registers a reservation on source memory address.
SC writes the given word to the memory address and writes success (zero) or
failure (non-zero) into the destination register. LR/SC can be used to
implement compare-and-exchange or to implement AMOs that don't have a native
instruction. To do so, you would typically perform LR and SC in a loop.
However, there are strict limits on the instructions that can be placed
between a LR and an SC while still guaranteeing forward progress:
"""
The static code for the LR/SC sequence plus the code to retry the sequence in
case of failure must comprise at most 16 integer instructions placed
sequentially in memory. For the sequence to be guaranteed to eventually
succeed, the dynamic code executed between the LR and SC instructions can only
contain other instructions from the base “I” subset, excluding loads, stores,
backward jumps or taken backward branches, FENCE, FENCE.I, and SYSTEM
instructions. The code to retry a failing LR/SC sequence can contain backward
jumps and/or branches to repeat the LR/SC sequence, but otherwise has the same
constraints.
"""
The native AMOs and LR/SC allow ordering constraints to be specified in the
instruction. This isn't possible for load/store instructions, so fences must
be inserted to represent the ordering constraints. 8 and 16-bit atomic
load/store are therefore supported using 8 and 16-bit load/store plus
appropriate fences.
See Table A.6 on page 187 in the linked specification for a mapping from C/C++
atomic constructs to RISC-V instructions.
## Background: Lowering atomics in LLVM
The AtomicExpandPass can help you support atomics for your taraget in a number
of ways. e.g. inserting fences around atomic loads/stores, or converting an
atomicrmw/cmpxchg to a LL/SC loop. It operates as an IR-level pass, meaning
the latter ability is problematic - there is no way to guarantee that the
invariants for the LL/SC loop required by the target architecture will be
maintained. This shows up most frequently when register spills are introduced
at O0, but spills could theoretically still occur at higher optimisation
levels and there are other potential sources of issues: inappropriate
instruction selection, machine block placement, machine outlining (though see
D47654 and D47655), and likely more.
I highly encourage you to read James Knight's previous post on this topic
which goes in to much more detail about the issues handling LL/SC
<http://lists.llvm.org/pipermail/llvm-dev/2016-May/099490.html>. The situation
remains pretty much the same:
* ARM and AArch64 expand to LL/SC loops in IR using AtomicExpandPass for O1
and above but use a custom post-regalloc expansion for O0
* MIPS doesn't use AtomicExpandPass, but selects atomic pseudoinstructions
which it expands to LL/SC loops in EmitInstrWithCustomInserter. This still has
the problems described above, so MIPS is in the process of moving towards a
two-stage lowering, with the LL/SC loop lowered after register allocation. See
D31287 <https://reviews.llvm.org/D31287>.
* Hexagon unconditionally expands to LL/SC loops in IR using AtomicExpandPass.
Lowering a word-size atomic operations to an LL/SC loop is typically trivial,
requiring little surrounding code. Part-word atomics require additional
shifting and masking as a word-size access is used. It would be beneficial if
the code to generate this shifting and masking could be shared between
targets, and if the operations that don't need to be in the LL/SC loop are
exposed for LLVM optimisation.
The path forwards is very clearly to treat the LL/SC loop as an indivisible
operation which is expanded as late as possible (and certainly after register
allocation). However, there are a few ways of achieving this.
If atomic operations of a given size aren't supported, then calls should be
created to the helper functions in libatomic, and this should be done
consistently for all atomic operations of that size. I actually found GCC is
buggy in that respect <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86005>.
## Proposed lowering strategy (RISC-V)
Basic principles:
* The LL/SC loop should be treated as a black box, and expanded post-RA.
* Don't introduce intrinsics that simply duplicate existing IR instructions
* If code can be safely expanded in the IR, do it there. [I'd welcome feedback
on this one - should I be taking a closer look at expansion in SelectionDAG
legalisation?]
The above can be achieved by extending AtomicExpandPass to support a 'Custom'
expansion method, which uses a TargetLowering call to expand to custom IR,
including an appropriate intrinsic representing the LL+SC loop.
Atomic operations are lowered in the following ways:
* Atomic load/store: Allow AtomicExpandPass to insert appropriate fences
* Word-sized AMO supported by a native instruction: Leave the IR unchanged and
use the normal instruction selection mechanism
* Word-sized AMO without a native instruction: Select a pseudo-instruction
using the normal instruction selection mechanism. This pseudo-instruction will
be expanded after register allocation.
* Part-word AMO without a native instruction: Shifting and masking that occurs
outside of the LL/SC loop is expanded in the IR, and a call to a
target-specific intrinsic to implement the LL/SC loop is inserted (e.g.
llvm.riscv.masked.atomicrmw.add.i32). The intrinsic is matched to a
pseudo-instruction which is expanded after register allocation.
* Part-word AMO without a native instruction that can be implemented by a
native word-sized AMO: 8 and 16-bit atomicrmw {and,or,xor} can be implemented
by 32-bit amoand, amoor, amoxor. Perform this conversion as an IR
transformation.
* Word-sized compared-and-exchange: Lower to a pseudo-instruction using the
normal instruction selection mechanism. This pseudo-instruction will be
expanded after register allocation.
* Part-word compare-and-exchange: Handled similarly to part-word AMOs, calling
llvm.riscv.masked.cmpxchg.i32.
Scratch registers for these pseudo-instructions are modelled as in ARM and
AArch64, by specifying multiple outputs and specifying an @earlyclobber
constraint to ensure the register allocator assigns unique registers. e.g.:
class PseudoCmpXchg
: Pseudo<(outs GPR:$res, GPR:$scratch),
(ins GPR:$addr, GPR:$cmpval, GPR:$newval, i32imm:$ordering), []> {
let Constraints = "@earlyclobber $res,@earlyclobber $scratch";
let mayLoad = 1;
let mayStore = 1;
let hasSideEffects = 0;
}
Note that there are additional complications with cmpxchg such as supporting
weak cmpxchg (which requires returning a success value), or supporting
different failure orderings. It looks like the differentiation between
strong/weak cmpxchg doesn't survive the translation to SelectionDAG right now.
Supporting only strong cmpxchg and using the success ordering for the failure
case is conservative but correct I believe.
In the RISC-V case, the LL/SC loop pseudo-instructions are lowered at the
latest possible moment. The RISCVExpandPseudoInsts pass is registered with
addPreEmitPass2.
The main aspect I'm unhappy with in this approach is the need to introduce new
intrinsics. Ideally these would be documented as not for use by frontends and
subject to future removal or alteration - is there precedent for this?
Alternatively, see the suggestion below to introduce target-independent masked
AMO intrinsics.
## Alternative options
1. Don't expand anything in IR, and lower to a single monolithic
pseudo-instruction that is expanded at the last minute.
2. Don't expand anything in IR, and lower to pseudo-instructions in stages.
Lower to a monolithic pseudo-instruction where any logic outside of the LL/SC
loop is expanded in EmitInstrWithCustomInserter but the LL/SC loop is
represented by a new pseudoinstruction. This final pseudoinstruction is then
expanded after register allocation. This minimises the possibility for sharing
logic between backends, but does mean we don't need to expose new intrinsics.
Mips adopts this approach in D31287.
3. Target-independent SelectionDAG expansion code converts unsupported atomic
operations. e.g. rather than converting `atomicrmw add i8` to AtomicLoadAdd,
expand to nodes that align the address and calculate the mask as well as an
AtomicLoadAddMasked node. I haven't looked at this in great detail.
4. Introducing masked atomic operations to the IR. Mentioned for completeness,
I don't think anyone wants this.
5. Introduce target-independent intrinsics for masked atomic operations. This
seems worthy of consideration.
For 1. and 2. the possibility for sharing logic between backends is minimised
and the address calculation, masking and shifting logic is mostly hidden from
optimisations (though option 2. allows e.g. MachineCSE). There is the
advantage of avoiding the need for new intrinsics.
## Patches up for review
I have patches up for review which implement the described strategy. More
could be done to increase the potential for code reuse across targets, but I
thought it would be worth getting feedback on the path forwards first.
* D47587: [RISCV] Codegen support for atomic operations on RV32I.
<https://reviews.llvm.org/D47587>. Simply adds support for lowering fences and
uses AtomicExpandPass to generate libatomic calls otherwise. Committed in
rL334590.
* D47589: [RISCV] Add codegen support for atomic load/stores with RV32A.
<https://reviews.llvm.org/D47589>. Use AtomicExpandPass to insert fences for
atomic load/store. Committed in rL334591.
* D47882: [RISCV] Codegen for i8, i16, and i32 atomicrmw with RV32A.
<https://reviews.llvm.org/D47882>. Implements the lowering strategy described
above for atomicrmw and adds a hook to allow custom atomicrmw expansion in IR.
Under review.
* D48129: [RISCV] Improved lowering for bit-wise atomicrmw {i8, i16} on RV32A.
<https://reviews.llvm.org/D48129>. Uses 32-bit AMO{AND,OR,XOR} with
appropriately manipulated operands to implement 8/16-bit AMOs. Under review.
* D48130: [AtomicExpandPass]: Add a hook for custom cmpxchg expansion in IR.
<https://reviews.llvm.org/D48130> Separated patch as this modifies the
existing shouldExpandAtomicCmpXchgInIR interface. Under review.
* D48141: [RISCV] Implement codegen for cmpxchg on RV32I.
<https://reviews.llvm.org/D48131> Implements the lowering strategy described
above. Under review.
## Questions
To pick a few to get started:
* How do you feel about the described lowering strategy? Am I unfairly
overlooking a SelectionDAG approach?
* How much enthusiasm is there for moving ARM, AArch64, Mips, Hexagon, and
other architectures to use such an approach?
* If there is enthusiasm, how worthwhile is it to share logic for generation
of masks+shifts needed for part-word atomics?
* I'd like to see ARM+AArch64+Hexagon move away from the problematic
expansion in IR and to have that code deleted from AtomicExpandPass. Are
there any objections?
* What are your thoughts on the introduction of new target-independent
intrinsics for masked atomics?
Many thanks for your feedback,
Alex Bradbury, lowRISC CIC
_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
I hadn't noticed that either - excellent point! The RISC-V requirement
for forward progress encompass the LR, the SC, the instructions
between it, and the code to retry the LR/SC. A weak cmpxchg by
definition separates the LR/SC from the code to retry it and so
suffers from the same sort problems that we see when exposing LL and
SC individually as IR intrinsics. I agree with your diagnosis as far
as RISC-V is concerned at least.
Best,
Alex
I'm not sure deprecation is justified. A "Sufficiently Smart Compiler"
could make use of the knowledge that a weak cmpxchg was considered
sufficient by the programmer, and use the weak cmpxchg form if the
surrounding loop can be shown to meet the forward progress guarantees
of the architecture. In LLVM, this could be done by analyzing the
surrounding basic block when expanding the cmpxchg pseudo after
regalloc. It's unlikely it would be worth the hassle, but it does seem
a viable approach.
Best,
Alex
I think it would be a great shame and I'd like to avoid it if at all
possible, though I'm also uncomfortable with the current situation.
The problem with late expansion is that even the simplest variants add
some rather unpleasant pseudo-instructions, and I've never even seen
anyone attempt to get good CodeGen out of it (simplest example being
"add xD, xD, #1" in the loop for an increment but there are obviously
more). Doing so would almost certainly involve duplicating a lot of
basic arithmetic instructions into AtomicRMW variants.
Added to that is the fact that actual CPU implementations are often a
lot less restrictive about what can go into a loop than is required
(for example even spilling is fine on ours as long as the atomic
object is not in the same cache-line; I suspect that's normal). That
casts the issue in a distinctly theoretical light -- we've been doing
this for years and as far as I'm aware nobody has ever hit the issue
in the real world, or even had CodeGen go wrong in a way that *could*
do so outside the -O0 situation.
OTOH that *is* an argument for performance over correctness when you
get right down to it, so I'm not sure I can be too forceful about it.
At least not without a counter-proposal to restore guaranteed
correctness.
Tim.
Thanks for the reply Tim. It's definitely fair to highlight that ARM
and AArch64 has few documented restrictions on code within an LL/SC
loop when compared to e.g. RISC-V or MIPS. I'm not sure about Hexagon.
> The problem with late expansion is that even the simplest variants add
> some rather unpleasant pseudo-instructions, and I've never even seen
> anyone attempt to get good CodeGen out of it (simplest example being
> "add xD, xD, #1" in the loop for an increment but there are obviously
> more). Doing so would almost certainly involve duplicating a lot of
> basic arithmetic instructions into AtomicRMW variants.
Let's separate the concerns here:
1) Quality of codegen within the LL/SC loop
* What is the specific concern here? The LL/SC loop contains a very
small number of instructions, even for the masked atomicrmw case. Are
you worried about an extra arithmetic instruction or two? Sub-optimal
control-flow? Something else?
2) Number of new pseudo-instructions which must be introduced
* You'd need new pseudos for each word-sized atomicrmw which expands
to an ll/sc loop, and an additional one for the masked form of the
operation. You could reduce the number of pseudos by taking the
AtomicRMWInst::BinOp as a parameter. The code to map the atomic op to
the appropriate instruction is tedious but straight-forward.
> Added to that is the fact that actual CPU implementations are often a
> lot less restrictive about what can go into a loop than is required
> (for example even spilling is fine on ours as long as the atomic
> object is not in the same cache-line; I suspect that's normal). That
> casts the issue in a distinctly theoretical light -- we've been doing
> this for years and as far as I'm aware nobody has ever hit the issue
> in the real world, or even had CodeGen go wrong in a way that *could*
> do so outside the -O0 situation.
That's true as long as the "Exclusives Reservation Granule" == 1
cache-line and you don't deterministically cause the reservation to be
cleared in some other way: e.g. repeatable geenrating a conflict miss
or triggering a trap etc. I don't think any Arm cores ship with
direct-mapped caches so I'll admit this is unlikely.
The possibility for issues increases if the Exclusives Reservation
Granule is larger. For the Cortex-M4, the ERG is the entire address
range <http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.100166_0001_00_en/ric1417175928887.html>.
In that case, spills will surely clear the reservation.
There also seem to be documented and very strict forward progress
constraints for ARMv8-M. See
https://static.docs.arm.com/ddi0553/ah/DDI0553A_h_armv8m_arm.pdf p207
"""
Forward progress can only be made using LoadExcl/StoreExcl loops if,
for any LoadExcl/StoreExcl loop within a single thread of execution if
both of the following are true:
• There are no explicit memory accesses, pre-loads, direct or indirect
register writes, cache maintenance instructions, SVC instructions, or
exception returns between the Load-Exclusive and the Store-Exclusive.
• The following conditions apply between the Store-Exclusive having
returned a fail result and the retry of the Load-Exclusive:
– There are no stores to any location within the same Exclusives
reservation granule that the StoreExclusive is accessing.
– There are no direct or indirect register writes, other than
changes to the flag fields in APSR or FPSCR, caused by data processing
or comparison instructions.
– There are no direct or indirect cache maintenance instructions,
SVC instructions, or exception returns
"""
Of course it also states that the upper limit for the Exclusives
Reservation Granule is 2048 bytes, but the Cortex-M33 has an ERG of
the entire address range
<http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.100230_0002_00_en/jfa1443092906126.html>
so something doesn't quite add up...
> OTOH that *is* an argument for performance over correctness when you
> get right down to it, so I'm not sure I can be too forceful about it.
> At least not without a counter-proposal to restore guaranteed
> correctness.
I suppose a machine-level pass could at least scan for any intervening
loads/stores in an LL/SC loop and check some other invariants, then
warn/fail if they occur. As you point out, this would be conservative
for many Arm implementations.
Best,
Alex
Oh I see what you're saying, it's the fact that by bypassing
instruction selection we're missing cases where an ADDI could be
selected rather than an ADD, which would potentially free up a
register and save the instruction generated for materialising the
constant. I don't like to see the compiler generate code that's
obviously dumber than what a human would write, but in this case do we
really think there would be any sort of measurable impact on
performance?
Yes.
> I don't like to see the compiler generate code that's
> obviously dumber than what a human would write, but in this case do we
> really think there would be any sort of measurable impact on
> performance?
It's certainly going to be marginal, but then so is the benefit of
late expansion.
There's also the barrier that this genuinely is a place where people
are willing to hand-code assembly, and go rooting around in the
compiler's output to check we're doing what they expect even if they
don't use assembly. The assembly is possibly mostly for historical
reasons, but it's still out there and we want to convince people to
move away from it.
I think I'm going to try to implement that verifier pass you mentioned
and run it across an iOS build.
Cheers.
Tim.
Consequently, the best way for us to handle LL/SC would be to expand
them early and let them be optimized as any other code. The usual
optimization restrictions should be sufficient to prevent introduction
of factors causing a loss of reservation.
With the constraints on LL/SC varying wildly between architectures,
maybe we should have several options available for different targets?
-Krzysztof
--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
hosted by The Linux Foundation
## Proposed lowering strategy (RISC-V)
Basic principles:
* The LL/SC loop should be treated as a black box, and expanded post-RA.
* If code can be safely expanded in the IR, do it there. [I'd welcome feedback
on this one - should I be taking a closer look at expansion in SelectionDAG
legalisation?
The above can be achieved by extending AtomicExpandPass to support a 'Custom'
expansion method, which uses a TargetLowering call to expand to custom IR,
including an appropriate intrinsic representing the LL+SC loop
* Part-word AMO without a native instruction that can be implemented by a
native word-sized AMO: 8 and 16-bit atomicrmw {and,or,xor} can be implemented
by 32-bit amoand, amoor, amoxor. Perform this conversion as an IR
transformation.
* Word-sized compared-and-exchange: Lower to a pseudo-instruction using the
normal instruction selection mechanism. This pseudo-instruction will be
expanded after register allocation.
* Part-word compare-and-exchange: Handled similarly to part-word AMOs, calling
llvm.riscv.masked.cmpxchg.i32.
Note that there are additional complications with cmpxchg such as supporting
weak cmpxchg (which requires returning a success value), or supporting
different failure orderings. It looks like the differentiation between
strong/weak cmpxchg doesn't survive the translation to SelectionDAG right now.
Supporting only strong cmpxchg and using the success ordering for the failure
case is conservative but correct I believe.
In the RISC-V case, the LL/SC loop pseudo-instructions are lowered at the
latest possible moment. The RISCVExpandPseudoInsts pass is registered with
addPreEmitPass2.
The main aspect I'm unhappy with in this approach is the need to introduce new
intrinsics. Ideally these would be documented as not for use by frontends and
subject to future removal or alteration - is there precedent for this?
Alternatively, see the suggestion below to introduce target-independent masked
AMO intrinsics.
## Alternative options
5. Introduce target-independent intrinsics for masked atomic operations. This
seems worthy of consideration.
> * I'd like to see ARM+AArch64+Hexagon move away from the problematic
> expansion in IR and to have that code deleted from AtomicExpandPass. Are
> there any objections?
I think it would be a great shame and I'd like to avoid it if at all
possible, though I'm also uncomfortable with the current situation.
The problem with late expansion is that even the simplest variants add
some rather unpleasant pseudo-instructions, and I've never even seen
anyone attempt to get good CodeGen out of it (simplest example being
"add xD, xD, #1" in the loop for an increment but there are obviously
more). Doing so would almost certainly involve duplicating a lot of
basic arithmetic instructions into AtomicRMW variants.
Added to that is the fact that actual CPU implementations are often a
lot less restrictive about what can go into a loop than is required
(for example even spilling is fine on ours as long as the atomic
object is not in the same cache-line; I suspect that's normal). That
casts the issue in a distinctly theoretical light -- we've been doing
this for years and as far as I'm aware nobody has ever hit the issue
in the real world, or even had CodeGen go wrong in a way that *could*
do so outside the -O0 situation
OTOH that *is* an argument for performance over correctness when you
get right down to it, so I'm not sure I can be too forceful about it.
At least not without a counter-proposal to restore guaranteed
correctness.
Hexagon has a very few restrictions on what will cause loss of a
reservation: those are stores to the same address (a 64-bit granule) or
any sort of exception/interrupt/system call. Other than that the
reservation should stay. The architecture doesn't explicitly guarantee
that though, but in the absence of the elements listed above, a program
with LL/SC can be expected to make progress.
Consequently, the best way for us to handle LL/SC would be to expand
them early and let them be optimized as any other code. The usual
optimization restrictions should be sufficient to prevent introduction
of factors causing a loss of reservation.
With the constraints on LL/SC varying wildly between architectures,
maybe we should have several options available for different targets?
I feel differently: to my mind there's a lot to gain by late expansion
and relatively little to lose. By expanding only the LL/SC loop late
and potentially making use of future 'asm goto' support in the future
(as James suggests) there should be minimal/no codegen impact. Output
that is guaranteed to meet the platform requirements for forward
progress with a codegen method that is resilient to the introduction
of new in-tree or out-of-tree passes is a huge win to me vs the status
quo. Although the current approach appears to work ok in practice we
know expansion in IR for LL/SC is fundamentally on shaky ground - at
least for archs other than Hexagon.
> There's also the barrier that this genuinely is a place where people
> are willing to hand-code assembly, and go rooting around in the
> compiler's output to check we're doing what they expect even if they
> don't use assembly. The assembly is possibly mostly for historical
> reasons, but it's still out there and we want to convince people to
> move away from it.
>
> I think I'm going to try to implement that verifier pass you mentioned
> and run it across an iOS build.
That would be interesting. My concern with the verifier approach is
that it may not leave the end-user with many options if it fails. A
huge improvement on having the possibility of generating questionable
code with no way of checking of course.
Best,
Alex
Thanks Krzysztof, it sounds like Hexagon gives so much freedom for
LL/SC loops that there really isn't a need to be concerned about
expansion in IR. As such, it would clearly make sense to retain that
codepath over the long-term, even if architectures are discouraged
from using it unless they're _really_ sure it's safe for their
architecture.
> With the constraints on LL/SC varying wildly between architectures, maybe we
> should have several options available for different targets?
Hexagon is the first architecutre I've encountered where there are
essentially no restrictions, so that certainly changes things. I was
mainly trying to determine consensus for the ultimate goal. If
everyone agreed that late expansion is the path forwards, we could
plan to remove and deprecate the current early expansion codepath at
some point.
Best,
Alex
Indeed, even if everyone agreed this was a good idea I wasn't
expecting to do this all at once.
>> The above can be achieved by extending AtomicExpandPass to support a
>> 'Custom'
>>
>> expansion method, which uses a TargetLowering call to expand to custom IR,
>> including an appropriate intrinsic representing the LL+SC loop
>
>
> I think it'd be better to sink more of the functionality into
> AtomicExpandPass itself (rather than adding a "Custom" hook). However, that
> ties into whether to introduce a common intrinsic that can be used across
> architectures...
Yes, I'd like to do more in AtomicExpandPass. Adding the 'Custom' hack
was the easiest way of prototyping this, and this thread will
hopefully give good guidance on the level of interest in using this in
a target-independent way.
>> * Word-sized compared-and-exchange: Lower to a pseudo-instruction using
>> the
>> normal instruction selection mechanism. This pseudo-instruction will be
>> expanded after register allocation.
>
>
> On RISCV, implementing the whole thing in the pseudo is probably right,
> since you only really have the inner-loop.
>
> But for other archs like ARMv7, I think it'll probably makes sense to
> continue to handle a bunch of the cmpxchg expansion in IR. There, the
> current cmpxchg expansion can be quite complex, but only loop really needs
> to be a primitive (we'd need two loop variants, both "strex, ldrex, loop" or
> "ldrex, strex, loop", depending on whether it generates an initial
> ldrex+barrier first). All the rest -- initial ldrex+barrier, clrex,
> barriers-- can all remain IR-level expanded.
Good point. As you say, the RISC-V expansion is much more
straight-forward. Although the barrier could be cleared eagerly after
compare failure by an SC to a dummy memory location, I don't currently
intend to do so:
1) GCC also doesn't intend to use such an expansion
2) No existing microarchitectural implementations have been shown to
benefit from this manual eager reservation clearing
3) Sticking to the simplest expansion is a good starting point, and
future microarchitects are most likely to optimise for code that is
out in the wild
> I'll note that in all cases, both for RISCV and ARM and others, we _really_
> would like to be able to have the compare-failure jump to a different
> address than success. That isn't possible for an intrinsic call at the
> moment, but I believe it will be possible to make that work soon, due to
> already ongoing work for "asm goto", which requires similar. Once we have
> that ability, I don't see any reason why the late-lowering cmpxchg pseudo
> should have any performance downside vs IR expansion, at least w.r.t. any
> correct optimizations.
I've seen periodically recurring discussions, but is someone actually
actively working on this?
>> The main aspect I'm unhappy with in this approach is the need to introduce
>> new
>> intrinsics. Ideally these would be documented as not for use by frontends
>> and
>> subject to future removal or alteration - is there precedent for this?
>> Alternatively, see the suggestion below to introduce target-independent
>> masked
>> AMO intrinsics.
>
>
> I agree -- it'd be great to somehow name/annotate these intrinsics, such
> that it's clear that they're a private implementation detail _INSIDE_ the
> llvm target codegen passes, with no stability guaranteed. Not even "opt"
> passes should be emitting them, so they should never end up in a bitcode
> file where we'd need to provide backwards compatibility.
>
> Maybe we can call them something like
> "llvm.INTERNAL_CODEGEN.f90d461eee5d32a1.masked.atomicrmw.add.i32" (where the
> random number in the middle is something that changes), and document that
> nobody must use intrinsics in the INTERNAL_CODEGEN, other than llvm CodeGen.
llvm.internal_use_only.masked.atomicrmw.add.i32 would get the point
across I think.
Is it not possible someone would generate a .bc after AtomicExpandPass
has run? Of course even now there's no guarantee such a file might
work on a future version of LLVM. e.g. the atomics lowering strategy
could change from one release to the next.
>> ## Alternative options
>
> [A bunch of alternatives I don't like, so i'm omitting them]
>
>>
>> 5. Introduce target-independent intrinsics for masked atomic operations.
>> This
>>
>> seems worthy of consideration.
>
>
> I think it's definitely worthwhile looking to see if the set of intrinsics
> for the atomicrmw operations (in particular, the set of additional arguments
> computed in the pre-loop section, and the return value) are likely going to
> be the same on the different architectures we have now. If so, I think it's
> definitely worthwhile making a commonly-named intrinsic. However, even so,
> it should remain for internal use only, and only implemented on targets
> which tell AtomicExpandPass to generate it. I'd like it to be common only to
> enhance code-reuse among the targets, not to provide a user-facing API.
Yes, giving the impression of providing a new user-facing API is my
concern. Particularly as we might define want to have a comprehensive
set of intrinsics but have targets support only the subset they
require for comprehensives atomics support (as some instructions may
go through the usual SDISel route without transformation to
intrinsics).
A common set of intrinsics is probably ok, but there are cases where
you might want a different interface. For instance clearing a
reservation requires an SC to a dummy address on RISC-V. Although we
have no intention of doing this in the RISC-V backend currently,
targets that wanted to implement such an approach might want to have
that dummy address as an argument to the intrinsic.
Best,
Alex
Hi Simon. Although expanding at the MC layer has the nice property of
being as close to just emitting inline ASM as possible, that's
actually not what I'm advocating. If you look at e.g.
https://reviews.llvm.org/D47882 you'll see that as much expansion
takes possible takes place in IR, while the LL/SC loop is expanded in
a very late stage MachineFunctionPass. For the reasons you stated, it
sounds like the Mips backend wouldn't want to perform the last-stage
expansion quite as late as I'm doing in RISC-V, which is a choice each
backend is free to make.
Surely the strategy described is no worse than the scheme in D31287 in
terms of delay slot filling etc, and possibly has minor advantages by
allowing a little more to be expanded in the IR level and thus sharing
a little more code with other backends. Or am I misunderstanding?
To update on this, everything has now landed other than D48141 which
implements cmpxchg for RISC-V and introduces target-independent code
to help lowering part-word cmpxchg. In response to the comments on
this thread and review comments I revised the patches so that as much
lowering as possible is done in AtomicExpandPass, thus maximising the
potential for code reuse across different targets.
I've additionally posted https://reviews.llvm.org/D52234 which updates
the AtomicExpandPass docs. If there is consensus, I think both James
and I would prefer to more strongly encourage that targets avoid
expanding LL/SC loops in IR on the grounds that LLVM has no way to
guarantee that the generated code complies with the architecture's
requirements for forward progress.
I believe it would be worthwhile for the Mips backend to evaluate
using AtomicExpandPass in the same way the RISC-V backend now does. It
wouldn't provide a correctness benefit over the current logic, but
should help reduce the amount of target-specific code for handling
atomics.
Tim: did you get a chance at looking at a verifier pass?
James suggested what the eventual solution for "asm goto" may well
help in cleaning up the suboptimal control-flow around a cmpxchg
lowered using the approach described in this RFC. Does anybody know of
any work in this area?
Best,
Alex
Just wanted to ping this thread given the ongoing Dev Meeting. If
anyone wants to discuss atomics lowering in person, I'd be keen to do
so. https://reviews.llvm.org/D48131 is still awaiting review, but once
landed atomics lowering should be "complete" using this proposed
strategy for the RISC-V backend.
I understand an RFC soon going to be posted for asm goto, which as
James noted may allow more-efficient branching in the cmpxchg case.