RV64 -> x86_64 binary translation POC (4000MIPS)

403 views
Skip to first unread message

Michael Clark

unread,
Mar 14, 2017, 2:10:43 AM3/14/17
to RISC-V SW Dev
Hi,

As some of you may know I have been working on an alternative RISC-V ISA simulator, with an eventual goal of exploring RISC-V binary translation to x86_64.


rv-sim ABI Proxy Simulator - small Linux ABI subset for emulation of simple static executables

rv-sys Privileged System Emulator - similar to Spike; can boot riscv-linux to a busybox shell

rv-jit Proof of Concept ABI Proxy Simulator with JIT binary translation (alpha code)

The idea has always been to explore a monomorphism (or perhaps epimorphism) between the RISC-V ISA and a well known legacy 16-register 64-bit CISC ISA, potentially coalescing RISC sequences into CISC sequences, to achieve a relatively dense packing in real time. The principle is to explore a unidirectional mapping in contrast to a mapping to a target independent IR. The internal representation is the RISC-V ISA and fused RISC-V pseudo instructions (call, tail, la, li, etc). In any case, there has been substantial work to get a framework and scaffolding in place for “return to interpreter” before starting on trace-based binary translation. A priv-1.9 interpreter was completed a couple of months ago and rv-sys can now boot riscv-linux. Next up is binary translation…

I have just added a new program “rv-jit” and am now exploring how to map from the RISC-V ISA to x86_64. The intention is to implement a “fusion matcher” to coalesce RISC-V instruction sequences and translate them into CISC instruction sequences, for example, potentially merging sequences into CISC memory operands (displacement, base, index, scale), only to be later be broken back into micro-ops by the CISC micro-architecture. The principle idea is to maintain instruction density so that instruction level parallelism can exist in the translation.


There are many interesting challenges:

- Lifting RISC-V assembler back into pseudos
call, tail, la, li, plus more complex patterns (PLT+GOT indirect jumps)
- Register allocation (31 registers to 16 registers)
dynamic spill slots used as CISC memory operands to ALU ops (L1 latency)
- Handling mapping 3 operands to 2 operand destructive ALU ops (ADD, SUB, IMUL, IDIV)
potentially perform lookahead liveness analysis to eliminate temporaries  
- Branches and mapping flags to registers, CMP, SETcc (zero extending for SLT[i][u])
- Indrect jumps and call stack emulation
- Efficient register save restore when jumping to/from JIT traces
only save and restore registers used in the trace
- Efficient JIT of instret and cycles, at trace tails.
potentially a cycle accurate translation mode

However I now am at the early stage where I have a very simple working proof of concept that can execute native code for simple loops. There was a reasonable amount of work to get the framework in place.

The interpreter (similarly to spike) has a memory backed array for registers e.g.

add       r[rd] = r[rs1] + r[rs2]

The JIT assigns commonly used target registers to host registers (currently using a static mapping). The JIT exploits CISC memory operands for spilt registers to maintain instruction density. The interpreter maintains a program counter execution count history (not optimised yet - uses std::map internally) and once an instruction meets a given threshold, the interpreter switches to trace and JIT compile, concurrently interpreting and emitting JIT code. The tracer exits on loop termination or when it finds an instruction it can’t translate (currently the proof of concept has just minimal instructions for optimising simple loops). Before executing any instruction the JIT trace cache is looked up by program counter and if a JIT trace exists, the native trace is executed from the trace cache.

- lookup trace cache for program counter, execute native trace if it exists
- fetch instruciton, update program counter stats, execute
- If execution threshold is met, run JIT codegen tracer and cache trace.

The abstraction is that the instruction fetch throws an exception (internally similar to a misaligned or any other fetch exception) when the instruction count threshold is met. The tracer is then invoked and concurrently interprets and JIT compiles code and exits when it detects loop termination. It could potentially do non interpretive opportunistic JIT of branch tails. Something to explore…

The proof of concept works, and there is now a scaffolding to explore RISC-V to x86_64 translation. The POC has just enough instructions to optimise simple loops (add, addi, bne, ld).  Encouraging early results show a 40X speed improvement over interpretation in a synthetic loop sum benchmark. This is not surprising as interpretation is dominated by decode and memory access to registers.

Interestingly the test loop gets translated 3 times as the code does not yet index on basic block. The interpreter simply starts a trace on any instruction that reaches the execution threshold, and on tail exit from JIT another instruction entry point might hit the threshold and cause re-translation. Even so, it achieves a 40X speed boost, likely with a reasonable amount of instruction level parallelism.

I still need to enable the fusion matcher and translate many more instructions… and once I have a more complete JIT, the JIT engine will be integrated with the full system emulator to accelerate riscv-linux on x86_64 hosts.

Note: in the asm below [rbp + (reg+1) * 8] refers to split registers with [rbp + 0] as the program counter. In the trace you can see the tails of branches loading the host code program counter for re-entry into the interpreter. The translated code below only happens to be using hot registers so the trace doesn’t show any [rbp + offset] accesses to spilt registers.

rbp -> u64 pc
u64 regs[32] /* translator drops writes to zero */

test-loop-2 (see below) has 24 million loop iterations or 128,000,176 RISC-V instructions retired

- interp 97.5 MIPS
- jit-interp 4000 MIPS

Here is the simple loop sum benchmark: 


The benchmark is run on a single core of a 3.1GHz Broadwell i7-5557U so the JITed RISC-V x86_64 code is retiring approximately 1.3 instructions per cycle. The prolog/epilog of the trace can be further optimised. This is version 0.0-prealpha. It will still be some time before the translator will be able to run useful workloads (like linux kernel and gcc) but the initial results are promising.

Michael.


$ time build/linux_x86_64/bin/rv-jit build/riscv64-unknown-elf/bin/test-loop-2  

rv-jit-0.0.0-prealpha-0

total=300000000

real 0m1.312s
user 0m1.308s
sys 0m0.000s

$ time build/linux_x86_64/bin/rv-jit build/riscv64-unknown-elf/bin/test-loop-2  --trace 

rv-jit-0.0.0-prealpha-0

total=300000000

real 0m0.032s
user 0m0.032s
sys 0m0.000s

$ time build/linux_x86_64/bin/rv-jit build/riscv64-unknown-elf/bin/test-loop-2 --trace --trace-iters 100 --log-jit-trace

rv-jit-0.0.0-prealpha-0

jit-trace-begin pc=0x00000000000100d4
# 0x00000000000100d4 add         a0, a3, a4
mov r8, r11
add r8, r12
# 0x00000000000100d8 ld          a0, 0(a0)
mov r8, qword ptr [r8 + 0]
# 0x00000000000100dc addi        a4, a4, 8
add r12, 8
# 0x00000000000100e0 add         a1, a1, a0
add r9, r8
# 0x00000000000100e4 bne         a4, a2, pc - 16
cmp r12, r10
jne 1f
mov [rbp + 0], 0x100e8
jmp term
1:
mov [rbp + 0], 0x100d4
jmp term
# 0x00000000000100e8 addi        a5, a5, -1
add r13, -1
# 0x00000000000100ec bne         a5, zero, pc - 28
cmp r13, 0
jne 1f
mov [rbp + 0], 0x100f0
jmp term
1:
mov [rbp + 0], 0x100d0
jmp term
# 0x00000000000100d0 addi        a4, zero, 0
mov r12, 0
term:
jit-trace-end   pc=0x00000000000100d4
jit-trace-begin pc=0x00000000000100d0
# 0x00000000000100d0 addi        a4, zero, 0
mov r12, 0
# 0x00000000000100d4 add         a0, a3, a4
mov r8, r11
add r8, r12
# 0x00000000000100d8 ld          a0, 0(a0)
mov r8, qword ptr [r8 + 0]
# 0x00000000000100dc addi        a4, a4, 8
add r12, 8
# 0x00000000000100e0 add         a1, a1, a0
add r9, r8
# 0x00000000000100e4 bne         a4, a2, pc - 16
cmp r12, r10
jne 0x00000000000100d4
mov [rbp + 0], 0x100e8
jmp term
term:
jit-trace-end   pc=0x00000000000100d4
jit-trace-begin pc=0x00000000000100e8
# 0x00000000000100e8 addi        a5, a5, -1
add r13, -1
# 0x00000000000100ec bne         a5, zero, pc - 28
cmp r13, 0
jne 1f
mov [rbp + 0], 0x100f0
jmp term
1:
mov [rbp + 0], 0x100d0
jmp term
# 0x00000000000100d0 addi        a4, zero, 0
mov r12, 0
# 0x00000000000100d4 add         a0, a3, a4
mov r8, r11
add r8, r12
# 0x00000000000100d8 ld          a0, 0(a0)
mov r8, qword ptr [r8 + 0]
# 0x00000000000100dc addi        a4, a4, 8
add r12, 8
# 0x00000000000100e0 add         a1, a1, a0
add r9, r8
# 0x00000000000100e4 bne         a4, a2, pc - 16
cmp r12, r10
jne 0x00000000000100d4
jmp 0x00000000000100e8
term:
jit-trace-end   pc=0x00000000000100d4
total=300000000

real 0m0.060s
user 0m0.056s
sys 0m0.000s



Andrew Waterman

unread,
Mar 14, 2017, 3:30:48 AM3/14/17
to Michael Clark, RISC-V SW Dev
4 BIPS is a very encouraging result. Good luck with the binary translator!
> --
> You received this message because you are subscribed to the Google Groups
> "RISC-V SW Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to sw-dev+un...@groups.riscv.org.
> To post to this group, send email to sw-...@groups.riscv.org.
> Visit this group at
> https://groups.google.com/a/groups.riscv.org/group/sw-dev/.
> To view this discussion on the web visit
> https://groups.google.com/a/groups.riscv.org/d/msgid/sw-dev/AE3D1933-A775-46A5-9148-EBD9C66E0F6B%40mac.com.

Michael Clark

unread,
Mar 14, 2017, 4:05:11 AM3/14/17
to Andrew Waterman, RISC-V SW Dev
Hi Andrew,

Thanks!

Yes, I was surprised. It’s going to be more interesting when I have completed IMA so we can make some more meaningful comparisons. I’ll then be curious how it will work on a more dynamic workload like linux-kernel (which is IMA centric). I'll likely need a custom trie for fast trace cache lookups. I don’t even have FENCE.I implemented yet… but I had closed the loop on getting simple native traces running, which was a milestone.

I’d really like to see kubernetes and Docker on RISC-V in the cloud. Ideally silicon, but translators will be essential for running RISC-V cloud software images on development machines until server silicon arrives.

Michael.
> To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/sw-dev/CA%2B%2B6G0BJqNq9hpWeq-1DRKGfwOHSaKiqF5%2BxKd0upd6NKkmcag%40mail.gmail.com.

Michael Clark

unread,
Mar 21, 2017, 10:16:10 PM3/21/17
to RISC-V SW Dev, Petr Kobalicek
Hi,

I’ve completed JIT translation of RVI so it is now possible to do a more meaningful comparison with riscv-qemu. I’m cc’ing Petr Kobalicek as I have used his excellent asmjit library for x86_64 codegen.

I wrote 114 JIT test cases to flesh out bugs in the translator (test-jit). There were some sign extension bugs, some issues with mapping 3 operands to 2 operand destructive ALU ops along with testing for the many combinations of register and memory operands for frame pointer relative access to statically spilt register slots (something the complex CISC memory operands enable to help us map 31 registers to the small amount of host registers). I finally added an --audit option to rv-jit that saves, restores then compares the processor state after executing JIT for one instruction followed by interpreting it. I then ran the riscv-qemu-tests. The JIT binary translation is now accurate enough to run complex integer codes.

I’ve run a benchmark “test-sha512” that computes a SHA-512 hash of 1 million 64-byte blocks, averaging the result of 10 runs, each run executing 3,773,554,857 instructions:

                        Runtime (seconds)
riscv-meta (interp)     29.10
riscv-meta (JIT)         1.19
qemu-user                1.21
native (gcc -O3)         0.28

                        Instructions/sec
riscv-meta (interp)        129,673,642 (129.6 MIPS)
riscv-meta (JIT)         3,158,080,129 (3.158 BIPS)
qemu-user                3,111,497,362 (3.111 BIPS)
native (gcc -O3)        13,493,044,780 (13.49 BIPS)

                        SHA-512 MB/sec
riscv-meta (interp)       2.10
riscv-meta (JIT)         51.08
qemu-user                50.33
native (gcc -O3)        218.24

This is the benchmark code:


This is an example translation of one translated trace from riscv64 to x86_64 using rv-jit’s —log-jit-trace option:


Notes:

- The riscv-meta JIT codegen is a first cut and has yet to be optimised.
- Currently supports x86_64 SysV ABI (Windows requires changes to the trace prolog/epilog).
- JALR is currently interpreted as it needs to look up the JIT trace cache. JALR/RET can be further optimised.
- The fusion coalescing state machine (LUI, AUPIC pairs for loads, stores, far jumps, etc) is disabled until I squash bugs.
- JIT trace cache lookup is unoptimised and uses out-of-the-box std::map<addr_t,trace_fn_t>
- RISC-V code is compiled with -Os riscv64-unknown-elf-gcc (GCC) 6.1.0
- x86_64 native code is compiled with -O3 gcc (Debian 6.3.0-6) 6.3.0 20170205
- The host is an Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz running Debian Stretch / Linux kernel 4.8.0
- rv-jit is in a common address space with the target executable (similar to qemu-user)
- loads and stores are translated directly (masking of loads/stores can be added for sandboxing).
- rv-sys will eventually be able to use the JIT engine with soft-mmu loads and stores, to accelerate linux-kernel
- rv-sys with shadow paging is the ultimate goal (hard-mmu to accelerate address translated loads and stores)
- QEMU command: time qemu-riscv64 build/riscv64-unknown-elf/bin/test-sha512
- riscv-meta command: time rv-jit --trace build/riscv64-unknown-elf/bin/test-sha512

I thought debugging a JIT engine would be ominously complex so was surprised how quickly it came together. I had no idea how it would compare with qemu, but I am encouraged by parity, as I have yet to perform any optimisations on the translation (nor have I looked at TCG). My first commit on rv-jit was 23-Feb-2017, at which point riscv-meta was just an interpreter. Petr Kobalicek’s asmjit made the translator very easy to write.

Michael.

Michael Clark

unread,
Apr 1, 2017, 4:32:41 PM4/1/17
to RISC-V SW Dev
A brief analysis of a SHA-512 implementation on RISC-V

Summary: SHA-512 has a large number of rotates and byte swaps in its inner loop. gcc translates rotates and byte swaps to ROR, ROL and BSWAP on x86. The rotates are expressed in C as a rotate right but gcc infers some as rotate right and some as rotate left. Each rotate immediate expands to 3 instructions on RISC-V (SRLI, SLLI, OR) as the subtract is constant (rotate with register operand requires at least 5 instructions, LI, SUB, SRLI, SLLI, OR).

The amortised saving of a rotate primitive for SHA-512 would likely be relatively high due to the frequency of use however we would need to modify the compiler or do a more detailed analysis to characterise the loop complexity to get accurate instruction counts with and without rotate and byte swap instructions. Similarly SHA-256, ChaCha20, NORX and many other cryptographic algorithms makes extensive use of rotate.

Also the __builtin_bswap64 call expands to a single inline instruction on x86 but expands to JAL to a 30 instruction function on RISC-V.

Side note: Some modern ciphers such as ChaCha20 are optimised for little-endian words and the algorithm reference implementations can be rewritten to work on little endian words, however the constants and algorithm in the analysed reference implementation of SHA-512 relies on big endian input (network byte order).

Here is a snippet of some of the inner loop primitives that make up SHA-512:

static inline uint64_t rotate_r(uint64_t x, int d) {
    return (x >> d) | (x << (64 - d));
}

static inline uint64_t S0 (uint64_t h1) {
    return rotate_r(h1, 28) ^ rotate_r(h1, 34) ^ rotate_r(h1, 39);
}

static inline uint64_t S1 (uint64_t h4) {
    return rotate_r(h4,14) ^ rotate_r(h4,18) ^ rotate_r(h4,41);
}

There are two loops
- one calls 16 rounds of bswap, S0 and a sequence of several ADD, AND and XOR
- one calls 80 rounds of S0, S1 and a sequence of several ADD, AND and XOR

S0 permute on x86
- 3 ROR and 2 XOR
- Total 4 instructions

S0 permute on RISC-V
- 3 SRLI, 3 SLLI, 3 OR, 2 XOR
- Total 11 instructions

BSWAP on x86
- __builtin_bswap64 which is called for all input data on x86 is a single inline BSWAP instruction.
- Total 1 instruction per 8-byte input word

BSWAP on RISC-V
- __builtin_bswap64 makes a JAL to __bswapdi2 for all input data
- Total 31 instructions including one direct jump and one indirect jump per 8-byte input word


0x0000000000010804: LOC_000051:<__bswapdi2>
               10804:   03851613             slli           a2, a0, 56
               10808:   03855793             srli           a5, a0, 56
               1080c:   8fd1                 or             a5, a5, a2
               1080e:   6641                 lui            a2, 65536
               10810:   02855693             srli           a3, a0, 40
               10814:   f0060613             addi           a2, a2, -256
               10818:   8ef1                 and            a3, a3, a2
               1081a:   0ff00713             addi           a4, zero, 255
               1081e:   01855613             srli           a2, a0, 24
               10822:   8fd5                 or             a5, a5, a3
               10824:   00ff06b7             lui            a3, 16711680
               10828:   8e75                 and            a2, a2, a3
               1082a:   01871593             slli           a1, a4, 24
               1082e:   00855693             srli           a3, a0, 8
               10832:   8eed                 and            a3, a3, a1
               10834:   8fd1                 or             a5, a5, a2
               10836:   02071593             slli           a1, a4, 32
               1083a:   00851613             slli           a2, a0, 8
               1083e:   8e6d                 and            a2, a2, a1
               10840:   8fd5                 or             a5, a5, a3
               10842:   02871593             slli           a1, a4, 40
               10846:   01851693             slli           a3, a0, 24
               1084a:   8fd1                 or             a5, a5, a2
               1084c:   8eed                 and            a3, a3, a1
               1084e:   1742                 slli           a4, a4, 48
               10850:   1522                 slli           a0, a0, 40
               10852:   8fd5                 or             a5, a5, a3
               10854:   8d79                 and            a0, a0, a4
               10856:   8d5d                 or             a0, a0, a5
               10858:   8082                 jalr           zero, ra, 0

Errata: the native instructions per second in the previous email are based on the RISC-V instruction count, not the native instruction count, so the native numbers misrepresent the actual number of instructions retired on x86.

Regarding binary translation. I have performed a number of peephole optimisations (e.g. use LEA as a 3 operand ADD) to increase JIT performance by about 15% since the previous numbers, but many optimisations remain (threading trace branches, learning/caching indirect call trace addresses, shadow stacks with trace addresses for RET acceleration, etc). I need to find some code that doesn’t exercise future ISA extensions to get a more favourable comparison. I think a reasonable proportion of the 3.5X slowdown compared to native x86 is due to rotate and byte swap.

I will likely try to benchmark the algorithms in Cifra. I note that it’s SHA-512 implementation is also formulated to use big endian input. It may be worthwhile to use these algorithms as benchmarks for the Bit Manipulation Extension. 


• AES in the GCM, CCM, EAX and OCB authenticated encryption modes.
• NORX authenticated encryption system.
• SHA224, SHA256, SHA384 and SHA512 hash functions (including HMAC and PBKDF2).
• SHA3-224, SHA3-256, SHA3-384, SHA3-512 hash functions (FIPS 202 compatible).
• ChaCha20 and Salsa20 stream ciphers.
• Poly1305 one time MAC.

It will be interesting to study the performance difference with XLEN-bit primitives versus separate 32-bit and 64-bit primitives (given the shifter already has 32-bit shifts on RV64, it may make sense to have 32-bit rotates, if it is just a loop bit feeding into the shifter). 

Andrew Waterman

unread,
Apr 1, 2017, 5:00:10 PM4/1/17
to Michael Clark, RISC-V SW Dev
That byte-swapping implementation is particularly bad. When swapping
bytes in bulk, it takes 13 instructions per word (with a lot of ILP).

Nevertheless, the point stands; a strong case can be made for BSWAP
(and even ROL and ROR) in the B extension.
> https://groups.google.com/a/groups.riscv.org/d/msgid/sw-dev/CAA4D34F-6903-4ECB-B7A5-2D8EEAEB55B4%40mac.com.

Michael Clark

unread,
Apr 1, 2017, 5:15:39 PM4/1/17
to Andrew Waterman, RISC-V SW Dev
Yap. We could think about implementing some intrinsics using inline asm. I hadn’t seen this asm until now. I just decided to exercise crypto MAC functions and started disassembling to see what was behind the numbers. Last time I was looking at bswap16 or bswap32.

Nevertheless, the point stands; a strong case can be made for BSWAP
(and even ROL and ROR) in the B extension.

ROL and ROR are certainly at the borderline of instruction count where one would think twice about their inclusion. Many other 3 instruction substitutions will not be justifiable but given the frequency of rotates in crypto and the relative simplicity in a barrel shifter, I think rotates should be in B. It would be nice however to quantify it. I might have to get more familiar with gcc internals to run some experiments.

Andrew Waterman

unread,
Apr 1, 2017, 5:39:18 PM4/1/17
to Michael Clark, RISC-V SW Dev
Indeed. These evaluations should consider the microarchitectures
likely to be used to execute these algorithms. In some situations,
reducing the instruction count will commensurately reduce ILP so
there's little performance gain.

Michael Clark

unread,
Apr 2, 2017, 6:42:42 PM4/2/17
to Andrew Waterman, RISC-V SW Dev
On 2 Apr 2017, at 8:59 AM, Andrew Waterman <and...@sifive.com> wrote:


Can we really bswap64 in 13 instructions?

a) this is the canonical bswap macro which I can hand write in 29 instructions (incl. ret) by shifting and reusing the mask, saving 1 instruction on gcc's builtin.

#if defined __GNUC__
#define bswap64(x) __builtin_bswap64(x)
#else    (
#define bswap64(x) \
    ((uint64_t)(x) << 56) | \
    (((uint64_t)(x) << 40) & 0xff000000000000ULL) | \
    (((uint64_t)(x) << 24) & 0x00ff0000000000ULL) | \
    (((uint64_t)(x) << 8)  & 0x0000ff00000000ULL) | \
    (((uint64_t)(x) >> 8)  & 0x000000ff000000ULL) | \
    (((uint64_t)(x) >> 24) & 0x00000000ff0000ULL) | \
    (((uint64_t)(x) >> 40) & 0x0000000000ff00ULL) | \
    ((uint64_t)(x)  >> 56))
#endif

b) this is likely getting close to the 13 instruction variant but it requires large constants, i.e. loads

uint64_t bswap64(uint64_t val)
{
    val = ((val << 8) & 0xFF00FF00FF00FF00ULL ) | ((val >> 8) & 0x00FF00FF00FF00FFULL );
    val = ((val << 16) & 0xFFFF0000FFFF0000ULL ) | ((val >> 16) & 0x0000FFFF0000FFFFULL );
    return (val << 32) | (val >> 32);
}

I think I see what you mean regards bulk byte-swapping, as option b) works well in bulk mode, but a compiler is never going to be able to use the latter code sequence for the vast majority of code out there which use the standard header macros that evaluate to  __builtin_bswap64. i.e. the common case.

I think the __bswapdi2 is practically optimal. We could save 1 instruction.

Andrew Waterman

unread,
Apr 2, 2017, 7:00:59 PM4/2/17
to Michael Clark, RISC-V SW Dev
Yes, this is what I was getting at. (Though you only need 2 large
constants, not 4, because the mask can be applied before the shift,
i.e., ((val << 8) & mask) | ((val & mask) >> 8).)

Loading large constants from memory isn't inherently bad; they will be
cache hits if bswapping frequently (which this discussion
presupposes).
> https://groups.google.com/a/groups.riscv.org/d/msgid/sw-dev/C5C4F079-BBE7-4C52-A5F4-3D97A7B16215%40mac.com.

Michael Clark

unread,
Apr 2, 2017, 7:18:27 PM4/2/17
to Andrew Waterman, RISC-V SW Dev
OK. Right.

It may be worthwhile to change those intrinsics. It’s likely that the C codegen using two constants is not too bad.

I’ll give try it…

Bruce Hoult

unread,
Apr 2, 2017, 7:36:09 PM4/2/17
to Michael Clark, Andrew Waterman, RISC-V SW Dev
This is one of the interesting little subtle and clever points on Aarch64, where the immediate forms of the logical operators (but not arithmetic operators) use 12 bits in the instruction to specify *patterns*, rather than values.

uint64_t bswap64(uint64_t val)
{
    val = ((val << 8) & 0xFF00FF00FF00FF00ULL ) | ((val >> 8) & 0x00FF00FF00FF00FFULL );
    val = ((val << 16) & 0xFFFF0000FFFF0000ULL ) | ((val >> 16) & 0x0000FFFF0000FFFFULL );
    return (val << 32) | (val >> 32);
}

0000000000000000 <bswap64>:
   0: d378dc01 lsl x1, x0, #8
   4: 92089c21 and x1, x1, #0xff00ff00ff00ff00
   8: d348fc00 lsr x0, x0, #8
   c: 92009c00 and x0, x0, #0xff00ff00ff00ff
  10: aa000020 orr x0, x1, x0
  14: d370bc01 lsl x1, x0, #16
  18: 92103c21 and x1, x1, #0xffff0000ffff0000
  1c: d350fc00 lsr x0, x0, #16
  20: 92003c00 and x0, x0, #0xffff0000ffff
  24: aa000020 orr x0, x1, x0
  28: 93c08000 ror x0, x0, #32
  2c: d65f03c0 ret

11 instructions (omitting the ret), 44 bytes, no memory references, no preloading of 64 bit literals -- they're all included inside the 32 bit instructions (magic).

That would be 13 instructions if the ror had to be expanded, of course.

(Note: at -O2 or above this function compiles to simply "rev x0, x0")




To post to this group, send email to sw-...@groups.riscv.org.
Visit this group at
https://groups.google.com/a/groups.riscv.org/group/sw-dev/.
To view this discussion on the web visit
https://groups.google.com/a/groups.riscv.org/d/msgid/sw-dev/AE3D1933-A775-46A5-9148-EBD9C66E0F6B%40mac.com.



--
You received this message because you are subscribed to the Google Groups
"RISC-V SW Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an

To post to this group, send email to sw-...@groups.riscv.org.
Visit this group at
https://groups.google.com/a/groups.riscv.org/group/sw-dev/.
To view this discussion on the web visit
https://groups.google.com/a/groups.riscv.org/d/msgid/sw-dev/CAA4D34F-6903-4ECB-B7A5-2D8EEAEB55B4%40mac.com.


--
You received this message because you are subscribed to the Google Groups
"RISC-V SW Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an

To post to this group, send email to sw-...@groups.riscv.org.
Visit this group at
https://groups.google.com/a/groups.riscv.org/group/sw-dev/.
To view this discussion on the web visit
https://groups.google.com/a/groups.riscv.org/d/msgid/sw-dev/CA%2B%2B6G0D2SGDqedAZxmKX9jjO1XSiw2pfFEr6Bzp93xrdh%2BdTRw%40mail.gmail.com.


--
You received this message because you are subscribed to the Google Groups
"RISC-V SW Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an

-- 
You received this message because you are subscribed to the Google Groups "RISC-V SW Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sw-dev+unsubscribe@groups.riscv.org.

--
You received this message because you are subscribed to the Google Groups "RISC-V SW Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sw-dev+unsubscribe@groups.riscv.org.

To post to this group, send email to sw-...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/sw-dev/.

Michael Clark

unread,
Apr 2, 2017, 7:47:50 PM4/2/17
to Bruce Hoult, Andrew Waterman, RISC-V SW Dev
On 3 Apr 2017, at 11:36 AM, Bruce Hoult <br...@hoult.org> wrote:

This is one of the interesting little subtle and clever points on Aarch64, where the immediate forms of the logical operators (but not arithmetic operators) use 12 bits in the instruction to specify *patterns*, rather than values.

uint64_t bswap64(uint64_t val)
{
    val = ((val << 8) & 0xFF00FF00FF00FF00ULL ) | ((val >> 8) & 0x00FF00FF00FF00FFULL );
    val = ((val << 16) & 0xFFFF0000FFFF0000ULL ) | ((val >> 16) & 0x0000FFFF0000FFFFULL );
    return (val << 32) | (val >> 32);
}

0000000000000000 <bswap64>:
   0: d378dc01 lsl x1, x0, #8
   4: 92089c21 and x1, x1, #0xff00ff00ff00ff00
   8: d348fc00 lsr x0, x0, #8
   c: 92009c00 and x0, x0, #0xff00ff00ff00ff
  10: aa000020 orr x0, x1, x0
  14: d370bc01 lsl x1, x0, #16
  18: 92103c21 and x1, x1, #0xffff0000ffff0000
  1c: d350fc00 lsr x0, x0, #16
  20: 92003c00 and x0, x0, #0xffff0000ffff
  24: aa000020 orr x0, x1, x0
  28: 93c08000 ror x0, x0, #32
  2c: d65f03c0 ret

11 instructions (omitting the ret), 44 bytes, no memory references, no preloading of 64 bit literals — they're all included inside the 32 bit instructions (magic).

That’s quite neat!

Annoyingly gcc is performing constant folding, even when I try to rewrite the code to foil the constant folding. The costs are slightly wrong as there is more memory bandwidth for 4 x 8-byte constants and an AUIPC+LD pair versus a single SHIFT to reuse the constant. It probably has to be written in assembler unless I can figure out how to foil the constant folding. This should be two instructions less and have two loads. I’ll try inline asm.

0000000000010222 <__bswap64>:
   10222: 00007717           auipc a4,0x7
   10226: 13673703           ld a4,310(a4) # 17358 <__trunctfdf2+0x32e>
   1022a: 00851793           slli a5,a0,0x8
   1022e: 8ff9                 and a5,a5,a4
   10230: 8121                 srli a0,a0,0x8
   10232: 00007717           auipc a4,0x7
   10236: 12e73703           ld a4,302(a4) # 17360 <__trunctfdf2+0x336>
   1023a: 8d79                 and a0,a0,a4
   1023c: 8fc9                 or a5,a5,a0
   1023e: 00007717           auipc a4,0x7
   10242: 12a73703           ld a4,298(a4) # 17368 <__trunctfdf2+0x33e>
   10246: 01079513           slli a0,a5,0x10
   1024a: 8d79                 and a0,a0,a4
   1024c: 83c1                 srli a5,a5,0x10
   1024e: 00007717           auipc a4,0x7
   10252: 12273703           ld a4,290(a4) # 17370 <__trunctfdf2+0x346>
   10256: 8ff9                 and a5,a5,a4
   10258: 8fc9                 or a5,a5,a0
   1025a: 02079513           slli a0,a5,0x20
   1025e: 9381                 srli a5,a5,0x20
   10260: 8d5d                 or a0,a0,a5
   10262: 8082                 ret


To unsubscribe from this group and stop receiving emails from it, send an email to sw-dev+un...@groups.riscv.org.

To post to this group, send email to sw-...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/sw-dev/.

Michael Clark

unread,
Apr 2, 2017, 8:42:49 PM4/2/17
to Andrew Waterman, RISC-V SW Dev, Bruce Hoult
This 19 instruction bswap64 implementation should be faster on Rocket. I have confirmed it is faster under a translated model. It would need to be re-jigged to work as inline asm however I think it is easier to express intrinsics like these as assembly files versus trying to futz with gcc inline asm register specifiers for in, out and clobbers.

I could have a look how libgcc handles architecture specific intrinsics and a 32-bit version (with a single 32-bit constant) if you think its worthwhile…

# bswap64.s

.section .text
.globl __bswap64

__bswap64:
1: auipc   a4, %pcrel_hi(__bswap64_c1)
ld      a4, %pcrel_lo(1b)(a4)
slli    a5, a0, 8
and     a5, a5, a4
srli    a0, a0, 8
srli    a4, a4, 8
and     a0, a0, a4
or      a5, a5, a0
1: auipc   a4, %pcrel_hi(__bswap64_c2)
ld      a4, %pcrel_lo(1b)(a4)
slli    a0, a5, 16
and     a0, a0, a4
srli    a5, a5, 16
srli    a4, a4, 16
and     a5, a5, a4
or      a5, a5, a0
slli    a0, a5, 32
srli    a5, a5, 32
or      a0, a0, a5
ret

.section .rodata
__bswap64_c1: .8byte 0xFF00FF00FF00FF00ULL
__bswap64_c2: .8byte 0xFFFF0000FFFF0000ULL


Michael Clark

unread,
Apr 2, 2017, 9:42:04 PM4/2/17
to Andrew Waterman, RISC-V SW Dev, Bruce Hoult
It appears the approach is only a win for bswap64 (aka __bswapdi2)

David Chisnall

unread,
Apr 3, 2017, 4:45:37 AM4/3/17
to Michael Clark, Andrew Waterman, RISC-V SW Dev, Bruce Hoult
On 3 Apr 2017, at 01:42, Michael Clark <michae...@mac.com> wrote:
>
> This 19 instruction bswap64 implementation should be faster on Rocket. I have confirmed it is faster under a translated model. It would need to be re-jigged to work as inline asm however I think it is easier to express intrinsics like these as assembly files versus trying to futz with gcc inline asm register specifiers for in, out and clobbers.
>
> I could have a look how libgcc handles architecture specific intrinsics and a 32-bit version (with a single 32-bit constant) if you think its worthwhile…

Note that the loads of the constant are likely to be in a different page and so take up an extra TLB entry and cache line. This kind of implementation generally does much better in microbenchmarks, but can cause performance cliffs in larger code to appear earlier unless the constants end up on a page that a number of other hot code paths are using. If they end up in a constant island near the code, then they have other negative effects (being mapped executable and so potential sources of gadgets for exploits, filling up i-cache with data that is never executed, appearing as potential branch-predictor targets, and so on). Many of these effects only manifest on more complex pipeline designs.

David

Reply all
Reply to author
Forward
0 new messages