LDAXR (Rarg0), Rout
ADD Rarg1, Rout
STLXR Rout, (Rarg0), Rtmp
CBNZ Rtmp, -3(PC)
But with ARMv8.1, only 1 instruction is needed to implement it, as shown below:
LDADDA (Rarg0), Rtmp, Rout
So we are considering to support these new instructions introduced by ARMv8.1 for better performance and scalability. Generally, there are 2 solutions:
Solution 1: add variant, just like variants of GOARCH="arm". We can reuse GoArm (https://github.com/golang/go/wiki/GoArm) or add a new one for specifying CPU variant at compilation-time.
Solution 2: dynamic feature detection, just like what have done for some optional feature (such as CRC) of ARMv8.0. The feature detection happens when atomic package is imported.
Obviously, solution 1 has better performance since the variant selection happens at static compilation-time and compiler (gc) can inline the instruction into its caller. But it may result in other problems. E.g OCI (open container initiative) refers architectures and variants supported by Golang and adding more variants for arm64 may result in fragmentation on container images.
Solution 2 doesn't have fragmentation problem and user just needs to build one binary for both ARMv8.0 and ARMv8.1 CPU. But it can't exploit the best performance since additional instructions (something like C function pointer) are needed to choose the right implementation at run-time and the overhead should be nontrivial. There can even be performance regression for ARMv8.0 CPU due to the additional overhead incurred by dynamic feature detection.
Any other solution to void fragmentation and achieve best performance? If no, which one is more suitable to be implemented?