Interpreting fused AUIPC+JALR as a direct jump (with or without register side effect)

1,672 views
Skip to first unread message

Michael Clark

unread,
Mar 2, 2017, 10:55:49 PM3/2/17
to RISC-V ISA Dev
Just sharing thoughts on fused AUIPC+JALR being considered a direct jump and link…

In static analysis of large binaries, many library calls are in the form of AUIPC+JALR versus JAL (which has +/-1MiB range). It is common for modern applications to have from 10MB to 100MB of text and most function calls expressed as AUIPC+JALR.

$ stat -f %z “/Applications/Google Chrome.app//Contents/Versions/56.0.2924.87/Google Chrome Framework.framework/Google Chrome Framework”
112012064

JALR is as we know a register indirect jump and link return address instruction, and interestingly register indirect calls and returns are particularly hard for dynamic binary translation (my specific interest). A translator typically needs to inject a stub at the translation point that looks up the address of the translation for the ‘dynamic’ target address, and as the target address is not known at the time of translation, a translator can’t always translate past indirect jumps (and obviously return). There are some interesting techniques in the literature, such that the inserted stubs can learn a static address and later rewrite the indirect jump as a direct jump (for the indirect call case, but obviously not return).

Indirect jumps are also likely harder for microarchitectures due to requiring a register read to decode the target address for instruction prefetch. i.e. there may be a higher latency to resolve the jump target address further down the pipeline versus decoding it early as an immediate.

While JALR is technically a register indirect jump, the fused adjacent combination of AUIPC+JALR can be seen as a direct PC relative jump and link with load target address (as a side effect) and on the contrary can be efficiently translated, or in a microarchitecture, the jump target instruction address prefetch can be started before register commit (of the side effect).

The observation (thought experiment) is that one of these AUIPC+JALR can later be split, and the JALR can potentially be used as a ROP gadget given enough diversity of offsets one might be able to get a return address onto the stack pointing to an adjacent function given a known value for the temporary (the t1 temporary from the last indirect call in code that is being exploited). From a binary translation perspective, the trace for the basic block target address for the split entry point would not exist and would need to be re-translated starting with JALR, and the JALR would be treated as an unfused indirect JALR and require a runtime translation stub.

auipc t1, pc + 1589248
1:
jalr ra, t1, 324 # <memset>



auipc t1, pc + 1576960
jalr ra, t1, -164 # <strcmp>
ret



jal x0, 1b # or ra value restored from xyz(sp)

I am just mentally questioning the safety of treating the AUIPC+JALR pair as a direct jump (with register side-effect) instead of as an indirect call.

With parallel instruction decode, the immediate for the AUIPC+JALR jump could be decoded in one step, however the address temporary still needs to be committed to the register file for consistency.

Given the register side effect is redundant in a fused decode implementation it leads to the possibility of an extension like this:

auipc zero, pc + 1576960
jalr ra, zero, -164 # <strcmp>

AUIPC with rd=zero is a nop. I’m not suggesting this is a good idea; it just came to mind when considering the register side effect redundant with the fused variant. i.e. we just need to decode the immediate over two instructions.

Michael

Sober Liu

unread,
Mar 2, 2017, 11:12:52 PM3/2/17
to Michael Clark, RISC-V ISA Dev
I am not sure I get u idea fully. But are u expected that both code/data in 32bits range?
And for "a direct PC relative jump", do u expect for static libs instead of dynamic libs?
--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/367A58E1-7532-409F-AB6B-5A762411BC2E%40mac.com.

-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may contain
confidential information. Any unauthorized review, use, disclosure or distribution
is prohibited. If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------

Michael Clark

unread,
Mar 2, 2017, 11:15:55 PM3/2/17
to Sober Liu, RISC-V ISA Dev
Two meanings for direct. Direct relative vs absolute indirect.

Michael Clark

unread,
Mar 2, 2017, 11:18:57 PM3/2/17
to RISC-V ISA Dev
The kind of off-topic stuff which is possible translations for AUIPC+JALR is here:

https://github.com/michaeljclark/riscv-meta/blob/master/doc/src/jumps.md

I am discovering that there are many potential ways to translate AUIPC+JALR. RET (jalr zero, ra) may need a hidden stack that contains pair<ra,translated_ra> and a fallback hash table lookup for misses and indirect JALR for function pointers will need hash table lookups. Likely to be dozens of cycles for truly indirect calls. RET (jalr zero, ra) can be optimised assuming a normal call stack is being used.

Michael Clark

unread,
Mar 2, 2017, 11:20:16 PM3/2/17
to Sober Liu, RISC-V ISA Dev

> On 3 Mar 2017, at 5:12 PM, Sober Liu <sob...@nvidia.com> wrote:
>
> I am not sure I get u idea fully. But are u expected that both code/data in 32bits range?

Yes. +-32 as per the AUIPC+JALR pair.

> And for “a direct PC relative jump", do u expect for static libs instead of dynamic libs?

Yes. I am thinking about the static case. I need to analyse GOT offset calls is dynamic libs. Next…

Michael Clark

unread,
Mar 2, 2017, 11:39:54 PM3/2/17
to Sober Liu, RISC-V ISA Dev
On 3 Mar 2017, at 5:20 PM, Michael Clark <michae...@mac.com> wrote:


On 3 Mar 2017, at 5:12 PM, Sober Liu <sob...@nvidia.com> wrote:

I am not sure I get u idea fully. But are u expected that both code/data in 32bits range?

Yes. +-32 as per the AUIPC+JALR pair.

And for  “a direct PC relative jump", do u expect for static libs instead of dynamic libs?

Yes. I am thinking about the static case. I need to analyse GOT offset calls is dynamic libs. Next…

Sorry I meant in dynamic libs. I am presently analysing vmlinux.

I will have to think about PLT stubs to GOT offsets and lazy resolution:

               1aec0:   00018e17             auipc          t3, pc + 98304
               1aec4:   880e3e03             ld             t3, -1920(t3)       # 0x0000000000032740
               1aec8:   000e0367             jalr           t1, t3, 0

We can assuming a dynamic linker doesn’t unlink a GOT entry by tracing AUIPC+LD+JALR a few times (after resolve has populated the GOT entry) to avoid a hash table lookup for translated code, although it would likely be possible to make a test case that changes a GOT entry and demonstrates the processor is not a RISC-V, rather makes some assumptions about jump targets. A translator should be able to pass tests. This would be an interesting tricky test.

Michael Clark

unread,
Mar 3, 2017, 12:04:20 AM3/3/17
to RISC-V ISA Dev
On 3 Mar 2017, at 5:39 PM, Michael Clark <michae...@mac.com> wrote:


On 3 Mar 2017, at 5:20 PM, Michael Clark <michae...@mac.com> wrote:


On 3 Mar 2017, at 5:12 PM, Sober Liu <sob...@nvidia.com> wrote:

I am not sure I get u idea fully. But are u expected that both code/data in 32bits range?

Yes. +-32 as per the AUIPC+JALR pair.

And for  “a direct PC relative jump", do u expect for static libs instead of dynamic libs?

Yes. I am thinking about the static case. I need to analyse GOT offset calls is dynamic libs. Next…

Sorry I meant in dynamic libs. I am presently analysing vmlinux.

I will have to think about PLT stubs to GOT offsets and lazy resolution:

               1aec0:   00018e17             auipc          t3, pc + 98304
               1aec4:   880e3e03             ld             t3, -1920(t3)       # 0x0000000000032740
               1aec8:   000e0367             jalr           t1, t3, 0

We can assuming a dynamic linker doesn’t unlink a GOT entry by tracing AUIPC+LD+JALR a few times (after resolve has populated the GOT entry) to avoid a hash table lookup for translated code, although it would likely be possible to make a test case that changes a GOT entry and demonstrates the processor is not a RISC-V, rather makes some assumptions about jump targets. A translator should be able to pass tests. This would be an interesting tricky test.

We would need to fault on writes to the GOT to make a translator pass tests, i.e. watch writes to pages containing function pointers referenced in translated code and recognise the pattern (AUIPC+LD+JALR). Quite complex to translate shared library calls efficiently. Sorry, diverging now.

Jacob Bachmeyer

unread,
Mar 3, 2017, 12:33:16 AM3/3/17
to Michael Clark, RISC-V ISA Dev
Michael Clark wrote:
> With parallel instruction decode, the immediate for the AUIPC+JALR jump could be decoded in one step, however the address temporary still needs to be committed to the register file for consistency.
>
> Given the register side effect is redundant in a fused decode implementation it leads to the possibility of an extension like this:
>
> auipc zero, pc + 1576960
> jalr ra, zero, -164 # <strcmp>
>
> AUIPC with rd=zero is a nop. I’m not suggesting this is a good idea; it just came to mind when considering the register side effect redundant with the fused variant. i.e. we just need to decode the immediate over two instructions.

No extension needed and perfectly consistent:

AUIPC ra, 1576960
JALR ra, ra, -164 # <strcmp>


In fact, this is the *only* way for AUIPC+JALR as a function call to be
a valid fusion pair. The example you give is a no-op followed by an
absolute jump of the type originally envisioned as an SBI call.

Remember: macro-op fusion requires that all side-effects of earlier
instructions be clobbered by later instructions in the fusion group.


-- Jacob

Michael Clark

unread,
Mar 3, 2017, 12:36:10 AM3/3/17
to jcb6...@gmail.com, RISC-V ISA Dev
Yes. Good idea. I like your version.

zero was just what immediately came to mind as a no side effect version. ra is perfect.

Michael Clark

unread,
Mar 3, 2017, 12:37:46 AM3/3/17
to jcb6...@gmail.com, RISC-V ISA Dev
In fact it’s such a good idea that CALL should emit it. The pseudo is hard-coded to use `t1`.

Michael Clark

unread,
Mar 3, 2017, 1:11:16 AM3/3/17
to jcb6...@gmail.com, RISC-V ISA Dev
On 3 Mar 2017, at 6:37 PM, Michael Clark <michae...@mac.com> wrote:


On 3 Mar 2017, at 6:36 PM, Michael Clark <michae...@mac.com> wrote:


On 3 Mar 2017, at 6:33 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:

Michael Clark wrote:
With parallel instruction decode, the immediate for the AUIPC+JALR jump could be decoded in one step, however the address temporary still needs to be committed to the register file for consistency.

Given the register side effect is redundant in a fused decode implementation it leads to the possibility of an extension like this:

auipc zero, pc + 1576960
jalr  ra, zero, -164 # <strcmp>

AUIPC with rd=zero is a nop. I’m not suggesting this is a good idea; it just came to mind when considering the register side effect redundant with the fused variant. i.e. we just need to decode the immediate over two instructions.

No extension needed and perfectly consistent:

AUIPC ra, 1576960
JALR ra, ra, -164 # <strcmp>


In fact, this is the *only* way for AUIPC+JALR as a function call to be a valid fusion pair.  The example you give is a no-op followed by an absolute jump of the type originally envisioned as an SBI call.

Remember:  macro-op fusion requires that all side-effects of earlier instructions be clobbered by later instructions in the fusion group.

I had been thinking about this liveness constraint earlier. 

Yes. Good idea. I like your version.

zero was just what immediately came to mind as a no side effect version. ra is perfect.

In fact it’s such a good idea that CALL should emit it. The pseudo is hard-coded to use `t1`.

CALL is a non-invasive change as nothing should depend on t1 for the no argument version. The 2 argument rd version of CALL would need to use rd as the temporary and the change would be more invasive. gcc emits the no argument version by default. TAIL can’t really be changed as we’d be clobbering ra. CALL is the common case.

Something like this in riscv-bintuils-gdb 

diff --git a/opcodes/riscv-opc.c b/opcodes/riscv-opc.c
index cc39390ec8..0c87b735dd 100644
--- a/opcodes/riscv-opc.c
+++ b/opcodes/riscv-opc.c
@@ -147,7 +147,7 @@ const struct riscv_opcode riscv_opcodes[] =
 {"jal",       "32C", "Ca",  MATCH_C_JAL, MASK_C_JAL, match_opcode, INSN_ALIAS },
 {"jal",       "I",   "a",  MATCH_JAL | (X_RA << OP_SH_RD), MASK_JAL | MASK_RD, match_opcode, INSN_ALIAS },
 {"call",      "I",   "d,c", (X_T1 << OP_SH_RS1), (int) M_CALL,  match_never, INSN_MACRO },
-{"call",      "I",   "c", (X_T1 << OP_SH_RS1) | (X_RA << OP_SH_RD), (int) M_CALL,  match_never, INSN_MACRO },
+{"call",      "I",   "c", (X_RA << OP_SH_RS1) | (X_RA << OP_SH_RD), (int) M_CALL,  match_never, INSN_MACRO },
 {"tail",      "I",   "c", (X_T1 << OP_SH_RS1), (int) M_CALL,  match_never, INSN_MACRO },
 {"jump",      "I",   "c,s", 0, (int) M_CALL,  match_never, INSN_MACRO },
 {"nop",       "C",   "",  MATCH_C_ADDI, 0xffff, match_opcode, INSN_ALIAS },

Allen J. Baum

unread,
Mar 3, 2017, 2:31:00 AM3/3/17
to Michael Clark, RISC-V ISA Dev
OK, I'm feeling dense.
I don't understand the statement:
macro-op fusion requires that all side-effects of earlier
instructions be clobbered by later instructions in the fusion group.

Why? The side effect of not modifying a register in a fused pair isn't an architectural requirement - but it could be an implementation requiremnt (since it saves register file ports - but perhaps not always). There are other side effects (exceptions) that must not be "clobbered". I suppose you could make a requirement that the first op of fused pair can never cause a trap or exception; that would solve part of the problem, or you could modify the statement to be specific about register side effects.

Regarding the zero case:
Macro op fusion is a nice to have, but not a guarantee, in a dynamic sense.
It's possible that sometimes ops will be fused - and other times the same op won't be fused (or that some pair of ops wil be fused while other pairs of the same ops won't be). An obvious case is when the pair cross a cache or page boundary - in which case using zero is broken.

Am I missing something?
--
**************************************************
* Allen Baum tel. (908)BIT-BAUM *
* 248-2286 *
**************************************************

Andrew Waterman

unread,
Mar 3, 2017, 3:10:04 AM3/3/17
to Michael Clark, Jacob Bachmeyer, RISC-V ISA Dev
We don't use rs1=ra here because that serves as a hint to pop the
return-address stack for some implementations.

Many instruction fusion opportunities will involve writing multiple
registers. For example, to reduce latency you'd also want to fuse
things like

lui t0, sym
ld t1, offset(t0)

which shows up in cases that t0 is later reused.

Superscalars typically over-provision write ports, so this is a matter
of control complexity, not extra datapath.
> --
> You received this message because you are subscribed to the Google Groups
> "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to isa-dev+u...@groups.riscv.org.
> To post to this group, send email to isa...@groups.riscv.org.
> Visit this group at
> https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
> To view this discussion on the web visit
> https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/9B54C10C-00C2-44C7-AC86-BF4A04A673AA%40mac.com.

Bruce Hoult

unread,
Mar 3, 2017, 3:47:50 AM3/3/17
to Andrew Waterman, Michael Clark, Jacob Bachmeyer, RISC-V ISA Dev
From the 2.1 spec:

"Return-address prediction stacks are a common feature of high-performance instruction-fetch units. We note that rd and rs1 can be used to guide an implementation’s instruction-fetch pre-diction logic, indicating whether JALR instructions should push (rd=x1), pop (rd=x0, rs1=x1), or not touch (otherwise) a return-address stack."

Here, rs1 is ra, but rd is not zero, so return/pop return buffer should not be assumed.

On the contrary, rd *is* ra, so call/push return buffer should be matched.

Seems ok to me, and in fact a very good idea.


--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+unsubscribe@groups.riscv.org.

To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

Andrew Waterman

unread,
Mar 3, 2017, 4:05:52 AM3/3/17
to Bruce Hoult, Michael Clark, Jacob Bachmeyer, RISC-V ISA Dev
My comment is consistent with the current spec's commentary: "push
(rd=x1/x5), pop (rs1=x1/x5), or not touch (otherwise)"

Both pushing and popping the RAS is useful for coroutines. Alpha has
a similar hint on its JSR instruction.
>> > email to isa-dev+u...@groups.riscv.org.
>> > To post to this group, send email to isa...@groups.riscv.org.
>> > Visit this group at
>> > https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
>> > To view this discussion on the web visit
>> >
>> > https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/9B54C10C-00C2-44C7-AC86-BF4A04A673AA%40mac.com.
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "RISC-V ISA Dev" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to isa-dev+u...@groups.riscv.org.

Bruce Hoult

unread,
Mar 3, 2017, 4:19:03 AM3/3/17
to Andrew Waterman, Michael Clark, Jacob Bachmeyer, RISC-V ISA Dev
2.1 is the most recent spec on riscv.org. It's not the current spec?



>> > To post to this group, send email to isa...@groups.riscv.org.
>> > Visit this group at
>> > https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
>> > To view this discussion on the web visit
>> >
>> > https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/9B54C10C-00C2-44C7-AC86-BF4A04A673AA%40mac.com.
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "RISC-V ISA Dev" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+unsubscribe@groups.riscv.org.

To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

Michael Clark

unread,
Mar 3, 2017, 7:10:04 AM3/3/17
to Andrew Waterman, Jacob Bachmeyer, RISC-V ISA Dev

> On 3/03/2017, at 9:09 PM, Andrew Waterman <and...@sifive.com> wrote:
>
> We don't use rs1=ra here because that serves as a hint to pop the
> return-address stack for some implementations.

I see. The stack hint is useful for handling fast indirect return. It would work if the pop hint was rs1=ra and rd=zero (assuming tail still uses rs1=t1).

> Many instruction fusion opportunities will involve writing multiple
> registers. For example, to reduce latency you'd also want to fuse
> things like
>
> lui t0, sym
> ld t1, offset(t0)
>
> which shows up in cases that t0 is later reused.

Yes we can avoid the load of t0 if the register is killed in near site after this expression. I will check again but I didn't remember seeing the absolute load pattern (well, not in vmlinux; I saw it only for constants).

This is the form that we can safely fuse into a single load without looking too far ahead:

lui t0, sym
ld t0, offset(t0)

Although I do remember seeing the lui pattern where the register is reused but it is not PIC or PIE, but of course there is also the equivalent auipc pattern for GOT references.

More complex fuse patterns might need an 8 to 12 instruction window. The problem is registers that are possibly not killed until after branches and jumps. The register has to not be live to avoid the partially formed address being committed (when it's a temporary for forming an address for a load / store).
> To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CA%2B%2B6G0D-L3r4OAT6kBgebAre5%2B6MNi0YJHEtq_ZZ4mm2u2eWhQ%40mail.gmail.com.

Jacob Bachmeyer

unread,
Mar 3, 2017, 5:23:04 PM3/3/17
to Allen J. Baum, Michael Clark, RISC-V ISA Dev
Allen J. Baum wrote:
> OK, I'm feeling dense.
> I don't understand the statement:
> macro-op fusion requires that all side-effects of earlier
> instructions be clobbered by later instructions in the fusion group.
>
> Why? The side effect of not modifying a register in a fused pair isn't an architectural requirement - but it could be an implementation requiremnt (since it saves register file ports - but perhaps not always). There are other side effects (exceptions) that must not be "clobbered". I suppose you could make a requirement that the first op of fused pair can never cause a trap or exception; that would solve part of the problem, or you could modify the statement to be specific about register side effects.

I was assuming that RISC-V implementations would do at most one regfile
write per macro-op. Under this constraint, my statement is correct, but
it has been mentioned that implementations complex enough to fuse
AUIPC+JALR are probably complex enough to have multiple regfile write
ports. Another requirement of macro-op fusion is that the result of a
fusion group must be identical to the same instructions executed
individually. Exceptions at the first instruction are easy, just trap
at the entire group, but exceptions on subsequent instructions get
complicated. I believe that speculative execution has been offered as a
solution to this.

> Regarding the zero case:
> Macro op fusion is a nice to have, but not a guarantee, in a dynamic sense.
> It's possible that sometimes ops will be fused - and other times the same op won't be fused (or that some pair of ops wil be fused while other pairs of the same ops won't be). An obvious case is when the pair cross a cache or page boundary - in which case using zero is broken.
>

Using zero in AUIPC+JALR is broken in any case, since the result of
macro-op fusion must be identical to the result of executing the
individual fused instructions in sequence.

> Am I missing something?

I do not believe so, but I clearly missed that there are side effects
other than regfile writes.


-- Jacob

Michael Clark

unread,
Mar 3, 2017, 5:35:47 PM3/3/17
to Andrew Waterman, Bruce Hoult, Jacob Bachmeyer, RISC-V ISA Dev
It’s unfortunate the pop constraint doesn’t also contain rd=0. I guess making the call stack hint for push and pop simple, simplifies the call stack implementation.

Jacob’s version allows a register write elision and potentially decoding the CALL target immediate in early decode.

For me, from a binary translation perspective, it lets me elide a redundant mov of the target address that I will need to populate into a temporary. One less instruction. Although I wouldn’t use binary translation as an argument. I would use the potential for micro-architectural optimisation of inter-module calls in large executables.

The weight I guess is the between the pop constraint simplicity and whether or not the register write elision on inter-module calls is worth it.


Jacob Bachmeyer

unread,
Mar 3, 2017, 6:35:42 PM3/3/17
to Michael Clark, Andrew Waterman, Bruce Hoult, RISC-V ISA Dev
Michael Clark wrote:
> It’s unfortunate the pop constraint doesn’t also contain rd=0. I guess
> making the call stack hint for push and pop simple, simplifies the
> call stack implementation.

Andrew Waterman hinted at a more-important reason: coroutines need to
both push *and* pop the return address stack on the same instruction.

> Jacob’s version allows a register write elision and potentially
> decoding the CALL target immediate in early decode.

It does, but it also breaks the return-stack hints on JALR for
coroutines (which are not mentioned in spec v2.1).

> For me, from a binary translation perspective, it lets me elide a
> redundant mov of the target address that I will need to populate into
> a temporary. One less instruction. Although I wouldn’t use binary
> translation as an argument. I would use the potential for
> micro-architectural optimisation of inter-module calls in large
> executables.
>
> The weight I guess is the between the pop constraint simplicity and
> whether or not the register write elision on inter-module calls is
> worth it.

Another factor is support for coroutines--the push constraint and pop
constraint must be compatible in that case. The return-address stack
hints in v2.1 do not meet this criteria and also do not mention using x5
as an alternate link register for millicode.

-- Jacob

Michael Clark

unread,
Mar 3, 2017, 7:29:42 PM3/3/17
to jcb6...@gmail.com, Andrew Waterman, Bruce Hoult, RISC-V ISA Dev

> On 4 Mar 2017, at 12:35 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>
> Michael Clark wrote:
>> It’s unfortunate the pop constraint doesn’t also contain rd=0. I guess making the call stack hint for push and pop simple, simplifies the call stack implementation.
>
> Andrew Waterman hinted at a more-important reason: coroutines need to both push *and* pop the return address stack on the same instruction.

Okay I understand now.

>> Jacob’s version allows a register write elision and potentially decoding the CALL target immediate in early decode.
>
> It does, but it also breaks the return-stack hints on JALR for coroutines (which are not mentioned in spec v2.1).

Interesting trade off.

>> For me, from a binary translation perspective, it lets me elide a redundant mov of the target address that I will need to populate into a temporary. One less instruction. Although I wouldn’t use binary translation as an argument. I would use the potential for micro-architectural optimisation of inter-module calls in large executables.
>>
>> The weight I guess is the between the pop constraint simplicity and whether or not the register write elision on inter-module calls is worth it.
>
> Another factor is support for coroutines--the push constraint and pop constraint must be compatible in that case. The return-address stack hints in v2.1 do not meet this criteria and also do not mention using x5 as an alternate link register for millicode.

The call stack and coroutine hints will also be useful for binary translators, for fast coroutines and procedure returns.

So we can’t elide the AUIPC+JALR temporary unless we trace past a CALL and see it used.

The microarchitecture can still decode the target address before register decoding if it looks at the opcode and immediate of the two adjacent instructions, it just needs to write the jump target register (optimally once).

Michael.

Rogier Brussee

unread,
Mar 8, 2017, 4:35:05 PM3/8/17
to RISC-V ISA Dev, michae...@mac.com, and...@sifive.com, br...@hoult.org, jcb6...@gmail.com
For all intents and purposes 

auipc ra imm20
jalr ra ra imm11*

is indistinguishable from "jal ra imm20imm11" except that such long immediates can't be done (the last bit in the immediate of jalr is ignored) and it seems completely to handle them the same internally i.e. after fusion.  It should be by far the  common case (in fact in Xcondensed I envisioned jalr_ra_ra imm11 to be a 2 byte Xcondensed instruction). The normal return is jalr zero ra 0 and fwiw for many implementations that instruction is compressed (C.jr ra ) instruction.

In fact the coroutine case seems not the case to optimise for, and surely if simultaneous  popping and pushing the call stack really is important one can use a different calling convention for them and do

auipc x5 imm20
jalr x5 x5 imm11*

I know this is swearing in church, but x5 was added as a register that may manipulate the call stack if any only in v2.1 (if I understood correctly for implementing register saving and restoring calls that would be called with jalr x5 zero imm), and this is a quality of implementation issue. Can't  the spec be simply updated to insist on jalr x0 ra 0 aka C.jr ra  or jalr rd x5 imm to pop the return stack (so jalr x5 x5 imm would push and pop the return stack) ? 

Rogier

Op zaterdag 4 maart 2017 00:35:42 UTC+1 schreef Jacob Bachmeyer:

kr...@berkeley.edu

unread,
Apr 24, 2017, 3:02:00 AM4/24/17
to Rogier Brussee, RISC-V ISA Dev, michae...@mac.com, and...@sifive.com, br...@hoult.org, jcb6...@gmail.com

We agree that the fusion case is more important to optimize for than
the coroutine case, but we do want to support both and to allow fused
calls with one reg write port when using the alternate link register
(it was added to call register save code for -Os case, so will be
frequently used in that code)

The new proposal is that push+pop is hinted only when rs1!=rd (and
rs1=x1/x5,rd=x1/x5), so

auipc ra, imm20; jalr ra, ra, imm11

can be fused with either x1 or x5, writes only a single value, and
only pushes the RAS.

Coroutines would have to use

jalr x1, x5, imm11
or
jalr x5, x1, imm11

to hint push+pop.

New text:
"JALR instructions should:
push only (rd=x1/x5, rs1!=x1/x5 or rs1=rd)
pop only (rs1=x1/x5, rd!=x1/x5),
push and pop (rd=x1/x5, rs1=x1/x5, and rs1!=rd)
or not touch (otherwise) a return-address stack."

Truth table

rd rs1 rd==rs1
!x1/x5 !x1/x5 X nothing
x1/x5 !x1/x5 X push
x1/x5 x1/x5 0 push+pop
x1/x5 x1/x5 1 push
!x1/x5 x1/x5 X pop

Krste
| --
| You received this message because you are subscribed to the Google Groups
| "RISC-V ISA Dev" group.
| To unsubscribe from this group and stop receiving emails from it, send an email
| to isa-dev+u...@groups.riscv.org.
| To post to this group, send email to isa...@groups.riscv.org.
| Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/
| .
| To view this discussion on the web visit https://groups.google.com/a/
| groups.riscv.org/d/msgid/isa-dev/
| 9ea8ebf6-63b6-4314-979e-c12292b54673%40groups.riscv.org.

Bruce Hoult

unread,
Apr 24, 2017, 8:44:29 AM4/24/17
to Krste Asanovic, Rogier Brussee, RISC-V ISA Dev, Michael Clark, Andrew Waterman, Jacob Bachmeyer
Perfect! Thank you. +1


| To post to this group, send email to isa...@groups.riscv.org.
| Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/
| .
| To view this discussion on the web visit https://groups.google.com/a/
| groups.riscv.org/d/msgid/isa-dev/
| 9ea8ebf6-63b6-4314-979e-c12292b54673%40groups.riscv.org.

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+unsubscribe@groups.riscv.org.

To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

Michael Clark

unread,
Apr 24, 2017, 8:16:42 PM4/24/17
to Bruce Hoult, Krste Asanovic, Rogier Brussee, RISC-V ISA Dev, Andrew Waterman, Jacob Bachmeyer
Excellent!

We’ll need to test a change to the gcc CALL macro to emit the following to take full advantage of the change as per Jacob’s original suggestion:

1: AUIPC ra, %pcrel_hi(symbol)
JALR ra, %pcrel_lo(1b)(ra)

TAIL uses zero as the link register so it can’t have its address register write elided.

Something like this in binutils (needs testing):

mclark@minty:~/src/riscv-gnu-toolchain/riscv-binutils-gdb$ git diff

diff --git a/opcodes/riscv-opc.c b/opcodes/riscv-opc.c
index cc39390ec8..0c87b735dd 100644
--- a/opcodes/riscv-opc.c
+++ b/opcodes/riscv-opc.c
@@ -147,7 +147,7 @@ const struct riscv_opcode riscv_opcodes[] =
 {"jal",       "32C", "Ca",  MATCH_C_JAL, MASK_C_JAL, match_opcode, INSN_ALIAS },
 {"jal",       "I",   "a",  MATCH_JAL | (X_RA << OP_SH_RD), MASK_JAL | MASK_RD, match_opcode, INSN_ALIAS },
 {"call",      "I",   "d,c", (X_T1 << OP_SH_RS1), (int) M_CALL,  match_never, INSN_MACRO },
-{"call",      "I",   "c", (X_T1 << OP_SH_RS1) | (X_RA << OP_SH_RD), (int) M_CALL,  match_never, INSN_MACRO },
+{"call",      "I",   "c", (X_RA << OP_SH_RS1) | (X_RA << OP_SH_RD), (int) M_CALL,  match_never, INSN_MACRO },
 {"tail",      "I",   "c", (X_T1 << OP_SH_RS1), (int) M_CALL,  match_never, INSN_MACRO },
 {"jump",      "I",   "c,s", 0, (int) M_CALL,  match_never, INSN_MACRO },
 {"nop",       "C",   "",  MATCH_C_ADDI, 0xffff, match_opcode, INSN_ALIAS },

Rogier Brussee

unread,
Apr 25, 2017, 4:07:20 PM4/25/17
to RISC-V ISA Dev, rogier....@gmail.com, michae...@mac.com, and...@sifive.com, br...@hoult.org, jcb6...@gmail.com
That seems optimal!

Thanks Rogier


Op maandag 24 april 2017 09:02:00 UTC+2 schreef krste:

Andrew Waterman

unread,
Apr 26, 2017, 4:26:12 AM4/26/17
to Michael Clark, Bruce Hoult, Krste Asanovic, Rogier Brussee, RISC-V ISA Dev, Jacob Bachmeyer
Yeah, there's nothing to be done for TAIL.

Can you PR this change against the riscv-binutils github repo? I'm
pretty sure it will "just work," so we can test it in conjunction with
other toolchain improvements sometime in May.
>> | to isa-dev+u...@groups.riscv.org.
>> | To post to this group, send email to isa...@groups.riscv.org.
>> | Visit this group at
>> https://groups.google.com/a/groups.riscv.org/group/isa-dev/
>> | .
>> | To view this discussion on the web visit https://groups.google.com/a/
>> | groups.riscv.org/d/msgid/isa-dev/
>> | 9ea8ebf6-63b6-4314-979e-c12292b54673%40groups.riscv.org.
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "RISC-V ISA Dev" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to isa-dev+u...@groups.riscv.org.
> --
> You received this message because you are subscribed to the Google Groups
> "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to isa-dev+u...@groups.riscv.org.
> To post to this group, send email to isa...@groups.riscv.org.
> Visit this group at
> https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
> To view this discussion on the web visit
> https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/F9FF315C-F2C4-47F2-A999-4D54B65581D2%40mac.com.

Michael Clark

unread,
Apr 26, 2017, 4:28:20 AM4/26/17
to Andrew Waterman, Bruce Hoult, Krste Asanovic, Rogier Brussee, RISC-V ISA Dev, Jacob Bachmeyer
Hi Andrew,

No problem. Yes, it’s a simple change, but I’ll test compile the toolchain, compile something and do an objdump and then make a pull request…

Cheers,
Michael.
> To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CA%2B%2B6G0AkwH2v7GU_GXrbBPuMaaoWupQiVSRoBjjtr7r5hSxV0g%40mail.gmail.com.

kr...@berkeley.edu

unread,
Apr 26, 2017, 5:09:43 AM4/26/17
to Andrew Waterman, Michael Clark, Bruce Hoult, Krste Asanovic, Rogier Brussee, RISC-V ISA Dev, Jacob Bachmeyer

Though the TAIL case can still be fused with a single register write,
given that JALR writes x0.

Krste

Rogier Brussee

unread,
Apr 26, 2017, 5:40:15 AM4/26/17
to RISC-V ISA Dev, br...@hoult.org, kr...@berkeley.edu, rogier....@gmail.com, and...@sifive.com, jcb6...@gmail.com
There are now effectively two canonical, potentially  hardware optimised, ways to do a call which can be replaced by jal without observable effects (if the immediate is small enough): with ra == x1 and with  t0 == x5 as link register and temporary. The t0 version is supposed to be used for storing and restoring registers, but it seems more generally useful for guaranteed leaf calls, (or more generally, for calls with a calling convention where ra is callee saved).  I don't  know how difficult it is to teach gcc "leaf_call" (or "call_ra_callee_saved"  calls) but it seems like just another calling convention. They should also be useful for static leaf functions that do not leave file scope as function pointers, but in any case, it seems like a useful self documenting asm macro. 

Perhaps call_absolute and leaf_call_absolute (with auipc replaced with lui) should also be canonicalised as macros?

Likewise, perhaps call_coroutine_ra and call_coroutine_t0 should also be canonicalised as macros (arguably with names that reflect that they trash t0 respectively ra)?


Rogier

Op dinsdag 25 april 2017 02:16:42 UTC+2 schreef michaeljclark:
Message has been deleted
Message has been deleted

Michael Clark

unread,
Jun 18, 2017, 12:57:55 PM6/18/17
to Rogier Brussee, RISC-V ISA Dev, br...@hoult.org, kr...@berkeley.edu, and...@sifive.com, jcb6...@gmail.com

On 19 Jun 2017, at 1:09 AM, Rogier Brussee <rogier....@gmail.com> wrote:

Recently, the CALL macro has been changed in the assembly and ELF spec. I wondered whether it would not make sense to also change the TAIL macro from

1:
AUIPC t0, %pcrel_hi(symbol)
JALR  ra, t0 %pcrel_hi(1b)

AUIPC t1, %pcrel_hi(symbol)
JALR  zero, t1 %pcrel_hi(1b)

to

AUIPC t1, %pcrel_hi(symbol)
JALR t1 , t1 %pcrel_hi(1b)

We take t1 = x6  so as to leave callstacks alone. Sure this sets t1, but who cares: if callstacks are left alone that does no harm. It may even be useful for stack unwinding for exception handling and debugging to have a chance to know where a call came from even if it is a tail call. The main point is that hardware needs only match one pattern for call fusion, as in the absence of tail calls, the original tail-call == jump pattern would only be useful for very very long jumps in a function which should be rare enough to be worth the trouble.

CALL no longer has a target address side effect, but changing TAIL has no benefit as we can’t eliminate one side effect like we could for CALL, and in fact it just introduces a different side effect.

There is nothing that can be elided like there is with CALL which previously had two side effects (ra and t1). In the fusion case for TAIL, we trade a write of t1 with the target address with a write of the link address. In fact for simple implementations it introduces a redundant write (two writes and one read from t1 instead of the current case which is a single write and single read to/from t1). I don’t think we should change TAIL.

Michael


Op dinsdag 25 april 2017 02:16:42 UTC+2 schreef michaeljclark:
Excellent!
Perfect! Thank you. +1


| To post to this group, send email to isa...@groups.riscv.org.
| Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/
| .
| To view this discussion on the web visit https://groups.google.com/a/
| groups.riscv.org/d/msgid/isa-dev/
| 9ea8ebf6-63b6-4314-979e-c12292b54673%40groups.riscv.org.

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
-- 
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

Rogier Brussee

unread,
Jun 18, 2017, 1:39:39 PM6/18/17
to RISC-V ISA Dev, rogier....@gmail.com, br...@hoult.org, kr...@berkeley.edu, and...@sifive.com, jcb6...@gmail.com
I thought I had deleted this message and put it in the thread CALL macro in the assembler cookbook and ELF spec, explaining the slightly non obvious rationale better. Here is the gist:


The main point of the change would be that hardware would only need to match _one_ pattern for a macro-op fuse

AUIPC rd Imm1       # with rd != zero
JALR rd  rd imm2 

mapping to the _standard_ JAL behaviour (except for the longer immediate, but including the side effect on callstacks)  that covers _both_ the CALL and the TAIL use case,.  

It would make a macro op-fuse of 

AUIPC rd Imm1       # with rd != zero
JALR zero  rd imm2 

mapping to a strange _new_ "jump with 'link' the upper part of the destination address as output"  superfluous as the only remaining use case of the pattern would be very long (±2GB) jumps within a subroutine.

I recognise that in the non-fuse case it requires two writes and could be a trifle more expensive,

Ciao
Rogier

Op zondag 18 juni 2017 18:57:55 UTC+2 schreef michaeljclark:

Michael Clark

unread,
Jun 18, 2017, 1:42:49 PM6/18/17
to Rogier Brussee, RISC-V ISA Dev, br...@hoult.org, kr...@berkeley.edu, and...@sifive.com, jcb6...@gmail.com
Sorry to hijack this thread, but it prompted me to properly test the CALL macro register elision case in the binary translator I am working on <https://rv8.io/> that actually implements macro-op fusion today. I’ve recently enabled the macro-op fusion code as it is now stable. I need to write about it because the current description belies the translators current capabilities which is now 2.4X QEMU performance (or over 4X QEMU ARM).

I have translations for 3 types of macro-op fusion below, including the register write elision that is made possible by the changes to the CALL macro. It is somewhat interesting as this is technically a working (albeit software) implementation of macro-op fusion.

The first two cases are JALR (indirect jumps) that due to macro-op fusion are able to be translated as direct jumps (the original plan as is the subject of this thread), or in the case of this hot path translator, are in fact inline cached.

The third case calculates an address separately before the JALR and the translator is not yet smart enough to optimise this case and it is treated as an indirect jump which requires a jump target cache lookup to find the translated address (also accelerated). In the case of a GOT load for something in the PLT, acceleration would require a comparison against the learned function address, to find the translated code address, e.g. if we wanted to accelerate shared library calls. It would be possible but quite some work.

In any case when I started this thread I had not yet started the translator and was just thinking of possible optimisations. Now the translator is stable on quite complex codes and macro-op fusion works. Most of the deficiencies are now due to lack of syscall coverage in the user-mode translator.

Here is a test program for macro-op fusion, including the register elision case:

#include <stdio.h>

size_t add(size_t a, size_t b)
{
return a + b;
}

int main()
{
size_t total = 0;
for (size_t i = 0; i < 1000; i++) {
#if defined (MACRO_FUSION)
__asm__ __volatile__(
" mv a0, %1\n"
" mv a1, %2\n"
"1: auipc t1, %%pcrel_hi(add)\n"
" jalr ra, %%pcrel_lo(1b)(t1)\n"
"   mv %0, a0\n"
: "=r"(total)
: "r"(total), "r"(i)
);
#elif defined (MACRO_FUSION_ELISION)
__asm__ __volatile__(
" mv a0, %1\n"
" mv a1, %2\n"
"1: auipc ra, %%pcrel_hi(add)\n"
" jalr ra, %%pcrel_lo(1b)(ra)\n"
"   mv %0, a0\n"
: "=r"(total)
: "r"(total), "r"(i)
);
#elif defined (MACRO_INDIRECT)
__asm__ __volatile__(
" mv a0, %1\n"
" mv a1, %2\n"
"1: auipc t1, %%pcrel_hi(add)\n"
" addi t1, t1, %%pcrel_lo(1b)\n"
"   jalr ra, t1\n"
"   mv %0, a0\n"
: "=r"(total)
: "r"(total), "r"(i)
);
#else
total = add(total, i);
#endif
}
printf("total=%lu\n", total);
return 0;
}


macro fusion of auipc+jalr into a direct jump (inlined cached during translation)

$ riscv64-unknown-elf-gcc -DMACRO_FUSION -Os -c src/test/test-fusion.c -o build/riscv64-unknown-elf/obj/test-fusion-macro.o
$ riscv64-unknown-elf-gcc -Os build/riscv64-unknown-elf/obj/test-fusion-macro.o -o build/riscv64-unknown-elf/bin/test-fusion-macro
$ rv-jit --log-jit-trace build/riscv64-unknown-elf/bin/test-fusion-macro

Note: the translator has coalesced (or perhaps lifted) the AUIPC+JALR into a CALL (see 0x100c0)  with two side effects. Notice the PC jumps by 8. The target address is written to rdi (t1) and the link address is written to rdx (ra), and is later compared in RET (if it differs which is unlikely, e.g. setjmp/longjmp, it branches out of the trace. The call has been inlined by the JIT.

L2:
# 0x00000000000100bc add         a0, zero, a1
mov r8, r9                              ; 4D8BC1
L3:
# 0x00000000000100be add         a1, zero, a5
mov r9, r13                             ; 4D8BCD
L4:
# 0x00000000000100c0 call t1, 0x188
mov rdi, 10248                          ; BF48020100
mov rdx, 100C8                          ; BAC8000100
L5:
# 0x0000000000010248 add         a0, a0, a1
add r8, r9                              ; 4D03C1
L6:
# 0x000000000001024a jalr        zero, ra, 0
cmp rdx, 100C8                          ; 4881FAC8000100
je L7                                   ; 0F84........
mov qword [rbp], 1024A                  ; 48C745004A020100
jmp L0                                  ; E9........
L7:
L8:
# 0x00000000000100c8 add         a1, zero, a0
mov r9, r8                              ; 4D8BC8
L9:
# 0x00000000000100ca addi        a5, a5, 1
add r13, 1                              ; 4983C501
L10:
# 0x00000000000100cc bne         a5, a4, pc - 16
cmp r13, r12                            ; 4D3BEC
short jne L2                            ; 75C7


macro fusion of auipc+jalr into a direct jump (inlined cached during translation) with elision of address temporary

$ riscv64-unknown-elf-gcc -DMACRO_FUSION_ELISION -Os -c src/test/test-fusion.c -o build/riscv64-unknown-elf/obj/test-fusion-elision.o
$ riscv64-unknown-elf-gcc -Os build/riscv64-unknown-elf/obj/test-fusion-elision.o -o build/riscv64-unknown-elf/bin/test-fusion-elision
$ rv-jit --log-jit-trace build/riscv64-unknown-elf/bin/test-fusion-elision

Note: the translator has lifted the AUIPC+JALR into a CALL (see 0x100c0)  and elided the address temporary rdi (t1). Only the link address is written to rdx (ra), and is later compared in RET (if it differs which is unlikely, e.g. setjmp/longjmp, it branches out of the trace. The call has been inlined by the JIT.

L2:
# 0x00000000000100bc add         a0, zero, a1
mov r8, r9                              ; 4D8BC1
L3:
# 0x00000000000100be add         a1, zero, a5
mov r9, r13                             ; 4D8BCD
L4:
# 0x00000000000100c0 call ra, 0x188
mov rdx, 100C8                          ; BAC8000100
L5:
# 0x0000000000010248 add         a0, a0, a1
add r8, r9                              ; 4D03C1
L6:
# 0x000000000001024a jalr        zero, ra, 0
cmp rdx, 100C8                          ; 4881FAC8000100
je L7                                   ; 0F84........
mov qword [rbp], 1024A                  ; 48C745004A020100
jmp L0                                  ; E9........
L7:
L8:
# 0x00000000000100c8 add         a1, zero, a0
mov r9, r8                              ; 4D8BC8
L9:
# 0x00000000000100ca addi        a5, a5, 1
add r13, 1                              ; 4983C501
L10:
# 0x00000000000100cc bne         a5, a4, pc - 16
cmp r13, r12                            ; 4D3BEC
short jne L2                            ; 75CC


macro fusion of la with coalescing of two writes for aupic+add into a single write

$ riscv64-unknown-elf-gcc -DMACRO_INDIRECT -Os -c src/test/test-fusion.c -o build/riscv64-unknown-elf/obj/test-fusion-indirect.o
$ riscv64-unknown-elf-gcc -Os build/riscv64-unknown-elf/obj/test-fusion-indirect.o -o build/riscv64-unknown-elf/bin/test-fusion-indirect
$ rv-jit --log-jit-trace build/riscv64-unknown-elf/bin/test-fusion-indirect

Note: the translator has lifted the AUIPC+ADDI into LA (see 0x100c0) and emits a single register write. It has moved the target address into the program counter backing store [rbp] and the link address into rdx (ra) and is unconditionally branching to the jump target cache stub (part of the JALR acceleration mechanism).

L2:
# 0x00000000000100cc add         a1, zero, a0
mov r9, r8                              ; 4D8BC8
L3:
# 0x00000000000100ce addi        a5, a5, 1
add r13, 1                              ; 4983C501
L4:
# 0x00000000000100d0 bne         a5, a4, pc - 20
cmp r13, r12                            ; 4D3BEC
jne L5                                  ; 0F85........
mov qword [rbp], 100D4                  ; 48C74500D4000100
jmp 7FFF02001000                        ; 40E900000000
L5:
L6:
# 0x00000000000100bc add         a0, zero, a1
mov r8, r9                              ; 4D8BC1
L7:
# 0x00000000000100be add         a1, zero, a5
mov r9, r13                             ; 4D8BCD
L8:
# 0x00000000000100c0 la t1, 0x18c
mov rdi, 1024C                          ; BF4C020100
L9:
# 0x00000000000100c8 jalr        ra, t1, 0
mov qword [rbp], rdi                    ; 48897D00
mov rdx, 100CC                          ; BACC000100
jmp 7FFF02001000                        ; 40E900000000


Andrew Waterman

unread,
Jun 19, 2017, 1:52:08 AM6/19/17
to Rogier Brussee, RISC-V ISA Dev, Bruce Hoult, Krste Asanovic, Jacob Bachmeyer
On Sun, Jun 18, 2017 at 6:09 AM, Rogier Brussee
<rogier....@gmail.com> wrote:
> Recently, the CALL macro has been changed in the assembly and ELF spec. I
> wondered whether it would not make sense to also change the TAIL macro from
>
> 1:
> AUIPC t0, %pcrel_hi(symbol)
> JALR ra, t0 %pcrel_hi(1b)
>
> to
>
> AUIPC t1, %pcrel_hi(symbol)
> JALR t1 , t1 %pcrel_hi(1b)
>
> We take t1 = x6 so as to leave callstacks alone. Sure this sets t1, but who
> cares: if callstacks are left alone that does no harm. It may even be useful
> for stack unwinding for exception handling and debugging to have a chance to
> know where a call came from even if it is a tail call. The main point is
> that hardware needs only match one pattern for call fusion, as in the
> absence of tail calls, the original tail-call == jump pattern would only be
> useful for very very long jumps in a function which should be rare enough to
> be worth the trouble.

TAIL is currently defined as

auipc t1, ...
jalr x0, t1, ...

We already avoid using t0, to avoid messing up the call stacks.

DWARF information provides sufficient information to recover the call
graph. Furthermore, PLTs can destroy the value in t1, so this isn't
helpful in the general case.

Finally, some low-end unpipelined implementations will execute
JAL/JALR more slowly when rd != 0.

On balance, I don't support linking to t1 in the tail-call case.

>
>
>
> Op dinsdag 25 april 2017 02:16:42 UTC+2 schreef michaeljclark:
>>
>>> | to isa-dev+u...@groups.riscv.org.
>>> | To post to this group, send email to isa...@groups.riscv.org.
>>> | Visit this group at
>>> https://groups.google.com/a/groups.riscv.org/group/isa-dev/
>>> | .
>>> | To view this discussion on the web visit https://groups.google.com/a/
>>> | groups.riscv.org/d/msgid/isa-dev/
>>> | 9ea8ebf6-63b6-4314-979e-c12292b54673%40groups.riscv.org.
>>>
>>> --
>>> You received this message because you are subscribed to the Google Groups
>>> "RISC-V ISA Dev" group.
>>> To unsubscribe from this group and stop receiving emails from it, send an
>>> email to isa-dev+u...@groups.riscv.org.

Rogier Brussee

unread,
Jun 19, 2017, 6:42:23 AM6/19/17
to RISC-V ISA Dev, rogier....@gmail.com, br...@hoult.org, kr...@berkeley.edu, jcb6...@gmail.com
TL; DR  
With my proposed change of the TAIL macro, opcode fusion of TAIL _comes for free_ if you opcode fuse CALL, unlike the current definition of TAIL.   

In the non-opcode fused case for non pipelined processors the proposed TAIL macro may be a trifle more expensive than the current version, however.


Op maandag 19 juni 2017 07:52:08 UTC+2 schreef andrew:
On Sun, Jun 18, 2017 at 6:09 AM, Rogier Brussee
<rogier....@gmail.com> wrote:
> Recently, the CALL macro has been changed in the assembly and ELF spec. I
> wondered whether it would not make sense to also change the TAIL macro from
>
> 1:
> AUIPC t0, %pcrel_hi(symbol)
> JALR  ra, t0 %pcrel_hi(1b)
>
> to
>
> AUIPC t1, %pcrel_hi(symbol)
> JALR t1 , t1 %pcrel_hi(1b)
>
> We take t1 = x6  so as to leave callstacks alone. Sure this sets t1, but who
> cares: if callstacks are left alone that does no harm. It may even be useful
> for stack unwinding for exception handling and debugging to have a chance to
> know where a call came from even if it is a tail call. The main point is
> that hardware needs only match one pattern for call fusion, as in the
> absence of tail calls, the original tail-call == jump pattern would only be
> useful for very very long jumps in a function which should be rare enough to
> be worth the trouble.

TAIL is currently defined as

  auipc t1, ...
  jalr x0, t1, ...

We already avoid using t0, to avoid messing up the call stacks.

OK my bad, but that was not really the point. 


DWARF information provides sufficient information to recover the call
graph.  

I see.  Anyway I should not even have mentioned this point
 
Furthermore, PLTs can destroy the value in t1, so this isn't
helpful in the general case.
 

Finally, some low-end unpipelined implementations will execute
JAL/JALR more slowly when rd != 0.

This I recognise and is a real downside in the non op-fused case, but would it really matter?


I repeat and slightly extend my answer to Michael Clark, because I may not have been sufficiently clear. My apologies in advance if I am merely being pedantic in spelling it out in excruciating detail.


The point of the change would be that hardware would only need to match _one_  pattern to macro-op fuse both CALL and TAIL


I assume that the typical RV processor pipeline the hardware equivalent of Internal.JAL does something like  


JAL rd imm20
--> Internal.JAL SEXT(imm20)<<1

Where internal.JAL has semantics something like

Internal.JAL rd,  immXLEN :

      if(rd == ra || rd == t0)
           CALSTACK.PUSH(PC)
      rd <-- PC
      PC <-- PC +  immXLEN

For a CALL macro -op fuse you would there want to macro op fuse

AUIPC rd Imm1       # with rd != zero
JALR rd  rd imm2 

--> Internal.JAL rd SEXT(imm1)<<12 + SEXT(imm2)

But that _also_ implements fusing the  proposed TAIL macro.

On the other hand one can macro op-fuse the current TAIL macro

AUIPC rd Imm1       # with rd != zero
JALR zero  rd imm2 

--> Internal.JALUD SEXT(imm1)<<12 + SEXT(imm2)

but it is an additional pattern to match, and it necessarily involves a strange new otherwise useless internal JALUD ( 'Jump and 'link' the upper part of the destination") instruction. 
the semantics of  Internal.JALUD would be something like 

Internal.JALUD rd immXLEN:

      if(rd == ra || rd == t0)
           CALSTACK.POP(PC)

       rd <-- PC + immXLEN - SEXT(immXLEN & ((1 <<12) -1)
       PC <-- PC +  immXLEN

even though in practice the rd value would never actually be used because of the calling convention, it still has to be set for correctness, and so we cannot reuse an internal.J instruction that only jumps.

Ciao
Rogier
 

Michael Clark

unread,
Jun 19, 2017, 8:22:26 PM6/19/17
to Rogier Brussee, RISC-V ISA Dev, br...@hoult.org, kr...@berkeley.edu, jcb6...@gmail.com
Hi Rogier,

I understand what you mean regarding sharing the macro-fusion pattern now. That wasn’t clear to me, however I still think the JALR should avoid the redundant register write as simple implementations won’t be able to do anything about this extra write, which they don’t have now (and that was not the case for CALL), and macro-op implementations, being more sophisticated, are more able to bear the cost of having two patterns.

It’s interesting that you point this out as my macro-fusion pattern for CALL is as follows:

(auipc, rd=x }, { jalr rd=ra, rs1=x }

The call macro explicitly sets rd to ra and in my implementation

I need to add a second macro-op pattern match for TAIL:

(auipc, rd=x }, { jalr rd=zero, rs1=x }

This is a small price. I still think the price of having two pattern matches in the macro-op fusion case is a better trade than giving all simple implementations the cost of the redundant write. That aside, it is novel. I had not thought about the rationale of having a single pattern match.

Michael.

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

Andrew Waterman

unread,
Jun 19, 2017, 9:23:36 PM6/19/17
to Michael Clark, Rogier Brussee, RISC-V ISA Dev, Bruce Hoult, Krste Asanovic, Jacob Bachmeyer
On Mon, Jun 19, 2017 at 5:22 PM, Michael Clark <michae...@mac.com> wrote:
> Hi Rogier,
>
> I understand what you mean regarding sharing the macro-fusion pattern now.
> That wasn’t clear to me, however I still think the JALR should avoid the
> redundant register write as simple implementations won’t be able to do
> anything about this extra write, which they don’t have now (and that was not
> the case for CALL), and macro-op implementations, being more sophisticated,
> are more able to bear the cost of having two patterns.
>
> It’s interesting that you point this out as my macro-fusion pattern for CALL
> is as follows:
>
> (auipc, rd=x }, { jalr rd=ra, rs1=x }
>
> The call macro explicitly sets rd to ra and in my implementation
>
> I need to add a second macro-op pattern match for TAIL:
>
> (auipc, rd=x }, { jalr rd=zero, rs1=x }
>
> This is a small price. I still think the price of having two pattern matches
> in the macro-op fusion case is a better trade than giving all simple
> implementations the cost of the redundant write.

This is still my POV. The main cost in fusion is supporting the first
pair; the incremental cost for additional patterns is minor by
comparison.

(Incidentally, the current TAIL fusion has some logic in common with
fusing AUIPC + SW, in that neither pair writes a second register, so
both need to write the AUIPC result to a register.)
> https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/A7087602-3755-4A12-9AD9-FC4A589F990A%40mac.com.

Rogier Brussee

unread,
Jun 20, 2017, 3:31:54 AM6/20/17
to RISC-V ISA Dev, rogier....@gmail.com, br...@hoult.org, kr...@berkeley.edu, jcb6...@gmail.com
Hi Michael and Andrew,

thanks for your response. I trust your judgement now that my point is clear, you have far more experience than I have. But some concluding remarks inline anyway. 

Op dinsdag 20 juni 2017 02:22:26 UTC+2 schreef michaeljclark:
Hi Rogier,

I understand what you mean regarding sharing the macro-fusion pattern now. That wasn’t clear to me, however I still think the JALR should avoid the redundant register write as simple implementations won’t be able to do anything about this extra write, which they don’t have now (and that was not the case for CALL), and macro-op implementations, being more sophisticated, are more able to bear the cost of having two patterns.

 
TAIL is still just an assembly macro so the linker could change it to the more simple implementation friendly version if needed. The two versions are semantically equivalent on the calling convention level.  The default just determines what the natural fuse op pairs are. 

 
It’s interesting that you point this out as my macro-fusion pattern for CALL is as follows:

(auipc, rd=x }, { jalr rd=ra, rs1=x }

The call macro explicitly sets rd to ra and in my implementation 

I figured that.  It seemed to me that {auipc rd = x  where (x != zero) }{jalr rd = x , rs = x} would be just as easy because they would map uniformly to JAL with rd! != zero and a 32 bit immediate.
 
I need to add a second macro-op pattern match for TAIL:

(auipc, rd=x }, { jalr rd=zero, rs1=x }


and a separate implementation, the internal "JALUD" instruction (see my response to Andrew). Andrew points out that it shares some logic with AUIPC + SW. I had not thought of that.

This is a small price.



 
I still think the price of having two pattern matches in the macro-op fusion case is a better trade than giving all simple implementations the cost of the redundant write. That aside, it is novel. I had not thought about the rationale of having a single pattern match.


I trust your judgements.

Ciao
Rogier

Andrew Waterman

unread,
Jun 20, 2017, 3:44:45 AM6/20/17
to Rogier Brussee, RISC-V ISA Dev, Bruce Hoult, Krste Asanovic, Jacob Bachmeyer
Thanks for making the suggestion, and for identifying the errors in the specs.

On Tue, Jun 20, 2017 at 12:31 AM, Rogier Brussee
> https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/4d0ac66a-32b8-451e-8c56-dd1040eb1e43%40groups.riscv.org.

Albert Cahalan

unread,
Jun 20, 2017, 3:38:02 PM6/20/17
to Allen J. Baum, Michael Clark, RISC-V ISA Dev
On 3/3/17, Allen J. Baum <allen...@esperantotech.com> wrote:

> Regarding the zero case:
> Macro op fusion is a nice to have, but not a guarantee, in a dynamic sense.
> It's possible that sometimes ops will be fused - and other times the same op
> won't be fused (or that some pair of ops wil be fused while other pairs of
> the same ops won't be). An obvious case is when the pair cross a cache or
> page boundary - in which case using zero is broken.

If fusion is required, zero is perfectly fine when crossing a page boundary.
Fault on the address of the missing page, but without having advanced the
instruction pointer.

This is likely where you are headed anyway. It happened for ARM Thumb,
which did a similar thing for jumps. Originally, prior to Thumb2, all opcodes
were 16 bits in size. Jumps and calls would be preceded by an extra opcode
that would load 13 bits into a register. The assembler hid this, so it looked a
bit like a double-wide opcode, but you could actually split it and put other
opcodes in the middle. The two 16-bit halves of a jump ran separately. Well,
along comes Thumb2 and the need for more opcodes. Five of the bits in the
fused opcode pair were redundant if fusion were required, so they could be
reused for additional opcodes. Unfortunately, the bits were non-zero, so the
resulting encoding got really nasty. To extract a number from the opcode can
require pulling out 5 fields, XORing a couple of them together (!!!), and then
concatenating the bits in some screwy order.

So, to summarize: ARM Thumb also insisted on fixed-size instructions with
fusion to handle large offsets. This was a failure, and a side-effect of being
forced to give up on the idealism is a nasty instruction encoding. There was
also some slight compatibility breakage involving page fault addresses and
involving "improper" code that took advantage of the ability to put stuff into
the middle of a fused pair. It looks like you're headed toward the same mess.

Bruce Hoult

unread,
Jun 20, 2017, 4:38:06 PM6/20/17
to Albert Cahalan, Allen J. Baum, Michael Clark, RISC-V ISA Dev
While what you say about ARM and thumb2 is true [1], that is irrelevant to RISC-V. Fusion is NOT required.

In RISC-V fusing two instructions must *always* be optional, and the results precisely the same whether it happens or not.

The Thumb2 situation is relevant to another aspect of RISC-V, which is that both have variable-length instructions that need not be aligned or contained in a single VM page or cache line. In the case of Thumb2 this is any time the high three bits of the first 16 bits of an instruction are 0x111, indicating a 32 bit instruction. In the case of RISC-V this is any time the lowest two bits of the first 16 bits of an instruction are 0x11, indicating a 32 bit (or longer, if supported) instruction.

These two aspects are directly comparable.

Fused pairs of 16 bit RISC-V instructions are *not* comparable, because the first 16 bits indicate a valid 16 bit instruction, just as instructions with high bits of 0x111 were valid individual 16 bit instructions in Thumb1 (but not Thumb2 ... a forward incompatibility).

[1] I haven't verified the XOR etc details


--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+unsubscribe@groups.riscv.org.

To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

Albert Cahalan

unread,
Jun 20, 2017, 5:38:36 PM6/20/17
to Bruce Hoult, Allen J. Baum, Michael Clark, RISC-V ISA Dev
On 6/20/17, Bruce Hoult <br...@hoult.org> wrote:

> While what you say about ARM and thumb2 is true [1], that is irrelevant to
> RISC-V. Fusion is NOT required.

RISC-V and original Thumb are thus the same.

> In RISC-V fusing two instructions must *always* be optional, and the
> results precisely the same whether it happens or not.

Yep, that was how Thumb was originally specified.

> Fused pairs of 16 bit RISC-V instructions are *not* comparable, because the
> first 16 bits indicate a valid 16 bit instruction, just as instructions
> with high bits of 0x111 were valid individual 16 bit instructions in Thumb1
> (but not Thumb2 ... a forward incompatibility).

For RISC-V, a fused pair of AUIPC+JALR looks mighty comparable to the
situation with Thumb and Thumb2. At some point, as with ARM, you may
decide that the bits to represent JALR are partly redundant. As ARM did,
you will want to reclaim those bits by mandating that AUIPC+JALR be fused.
Then stuff that is now AUIPC+AUIPC for example can be interpreted as
some other new instruction which is completely unrelated to AUIPC.

(divide opcodes into two sets, those that makes sense after an AUIPC and
those that do not, and then for any X in the latter set make AUIPC+X be a
completely unrelated instruction)

This is a path you appear to be headed down. The encodings will be gross.

Bruce Hoult

unread,
Jun 20, 2017, 5:56:47 PM6/20/17
to Albert Cahalan, Allen J. Baum, Michael Clark, RISC-V ISA Dev
The results of re-interpreting two currently valid individual consecutive RISC-V instructions to mean something different and incompatible in future would indeed be gross.

Fortunately, it is NOT the path anyone is headed down.

It goes directly against the clearly stated philosophy that programs consisting of a frozen RISC-V instruction set (which C is on the verge of) will remain forward compatible forever.

It would have no purpose, as RISC-V already defines perfectly good 32 bit, 48 bit, and longer encodings with plenty of room for extension within the system, unlike Thumb1.

All anyone is discussing is advanced implementations opportunistically recognising certain instruction patterns in order to execute them in fewer uOps/cycles than a simpler implementation would. With 100% forward and backward compatibility. 

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+unsubscribe@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

Rogier Brussee

unread,
Jun 21, 2017, 6:09:43 AM6/21/17
to RISC-V ISA Dev, rogier....@gmail.com, br...@hoult.org, kr...@berkeley.edu, jcb6...@gmail.com
Hi Michael,


Op dinsdag 20 juni 2017 02:22:26 UTC+2 schreef michaeljclark:
Hi Rogier,

I understand what you mean regarding sharing the macro-fusion pattern now. That wasn’t clear to me, however I still think the JALR should avoid the redundant register write as simple implementations won’t be able to do anything about this extra write, which they don’t have now (and that was not the case for CALL), and macro-op implementations, being more sophisticated, are more able to bear the cost of having two patterns.

It’s interesting that you point this out as my macro-fusion pattern for CALL is as follows:

(auipc, rd=x }, { jalr rd=ra, rs1=x }

The call macro explicitly sets rd to ra and in my implementation

On second thought, this pattern does not fit the (max) two input one output mould of a standard RV instruction as it has two outputs (ra and x as a clobber) and one input. This is no problem for the CALL macro (and I guess for your software implementation), but if you insist on fixing the link register to ra and want to stay in the general mould, the  fusion pattern would have to be

{auipc rd=ra }, { jalr rd=ra, rs1=ra }

ciao 

Rogier


 

I need to add a second macro-op pattern match for TAIL:

(auipc, rd=x }, { jalr rd=zero, rs1=x }



It would perhaps be interesting to know how in your software implementation the two fusions for CALL  {auipc rd=ra }, { jalr rd=ra, rs1=ra } and TAIL {auipc rd=x }, { jalr rd=ra, rs1=x } (perhaps with x = t1 hardwired ?) compare to a JAL with long immediate {auipc rd=x }, { jalr rd=x, rs1=x } fusion!

Michael Clark

unread,
Jun 21, 2017, 5:45:06 PM6/21/17
to Rogier Brussee, RISC-V ISA Dev, br...@hoult.org, kr...@berkeley.edu, jcb6...@gmail.com
On 21 Jun 2017, at 10:09 PM, Rogier Brussee <rogier....@gmail.com> wrote:

Hi Michael,

Op dinsdag 20 juni 2017 02:22:26 UTC+2 schreef michaeljclark:
Hi Rogier,

I understand what you mean regarding sharing the macro-fusion pattern now. That wasn’t clear to me, however I still think the JALR should avoid the redundant register write as simple implementations won’t be able to do anything about this extra write, which they don’t have now (and that was not the case for CALL), and macro-op implementations, being more sophisticated, are more able to bear the cost of having two patterns.

It’s interesting that you point this out as my macro-fusion pattern for CALL is as follows:

(auipc, rd=x }, { jalr rd=ra, rs1=x }

The call macro explicitly sets rd to ra and in my implementation

On second thought, this pattern does not fit the (max) two input one output mould of a standard RV instruction as it has two outputs (ra and x as a clobber) and one input. This is no problem for the CALL macro (and I guess for your software implementation), but if you insist on fixing the link register to ra and want to stay in the general mould, the  fusion pattern would have to be

{auipc rd=ra }, { jalr rd=ra, rs1=ra }

Yes you are correct. This is a concern for a micro-architecture with a single write port, however it’s not really applicable to my case which is a binary translator.

The binary translator has an implementation specific constraint of a single explicit destination operand due to the fused psuedo-ops sharing the decode structure of regular RISC-V ops, so I can only have one explicit destination operand. The pseudo-op in the case of the old CALL macro has an implicit link register operand that is always ra and rs1 is the variable target address. I do this so I can coalesce the AUIPC and subsequent addition for the target address in the case of the old macro. Given this is an internal implementation detail that helps me improve codegen, I think its reasonable to do. Likewise a micro-architecture that has more than one write port may have more fusion opportunities e,g, address calculation for far LOADs, which have two outputs:

{auipc rd=x }, { ld rd=y, rs1=x }

An architecture with two write-ports could do a far-load in one less cycle. I think the 2-in 1-out constraint is really designed to be imposed on the ISA itself and not necessarily on fused instructions. The single write port microarchitecture constraint may only apply to conventional hardware implementations and is a lower bound that the ISA enforces on implementations, not an upper-bound on more sophisticated implementations that are cable of macro-fusion. We can reason about implicit or more than one explicit destination operands in custom fused sequences to make more efficient translation. In the case of far relative load/store, a store only requires one write port but optimising a auipd+load requires two. I don’t know if anyone would do this - it requires a 4th operand lane in the reorder buffer for OoO (e.g. rs3 from FMA), but as a write operand, not a read operand. Someone might consider it to make far loads execute in one cycle; assuming it is a common enough sequence to be profitable.

Cheers,
Michael.

Jacob Bachmeyer

unread,
Jun 21, 2017, 6:31:41 PM6/21/17
to Rogier Brussee, RISC-V ISA Dev, br...@hoult.org, kr...@berkeley.edu
Rogier Brussee wrote:
> Hi Michael,
>
> Op dinsdag 20 juni 2017 02:22:26 UTC+2 schreef michaeljclark:
>
> Hi Rogier,
>
> I understand what you mean regarding sharing the macro-fusion
> pattern now. That wasn’t clear to me, however I still think the
> JALR should avoid the redundant register write as simple
> implementations won’t be able to do anything about this extra
> write, which they don’t have now (and that was not the case for
> CALL), and macro-op implementations, being more sophisticated, are
> more able to bear the cost of having two patterns.
>
> It’s interesting that you point this out as my macro-fusion
> pattern for CALL is as follows:
>
> (auipc, rd=x }, { jalr rd=ra, rs1=x }
>
> The call macro explicitly sets rd to ra and in my implementation
>
>
> On second thought, this pattern does not fit the (max) two input one
> output mould of a standard RV instruction as it has two outputs (ra
> and x as a clobber) and one input. This is no problem for the CALL
> macro (and I guess for your software implementation), but if you
> insist on fixing the link register to ra and want to stay in the
> general mould, the fusion pattern would have to be
>
> {auipc rd=ra }, { jalr rd=ra, rs1=ra }


Am I correct that we are assuming that millicode calls will always use
intramodule JAL?


-- Jacob

Jacob Bachmeyer

unread,
Jun 21, 2017, 6:49:50 PM6/21/17
to Albert Cahalan, Bruce Hoult, Allen J. Baum, Michael Clark, RISC-V ISA Dev
This is where RISC-V diverges from ARM -- *all* instructions make sense
after AUIPC, even AUIPC itself. (Example: loading pointers to multiple,
distant, PC-relative data objects.) The latter set that you describe is
empty.

Also, unlike ARM, we have a standard encoding for variable-length
instructions. Fused "not really" AUIPC+AUIPC gives 2 20-bit immediates
for a 40-bit encoding space in a 64-bit instruction; a sub-minor opcode
(one of 2^12) in the standard 64-bit encoding gives a 45-bit encoding
space. There is no reason to even consider using AUIPC+AUIPC to perform
some other operation.


-- Jacob

Bruce Hoult

unread,
Jun 21, 2017, 6:54:55 PM6/21/17
to Jacob Bachmeyer, Rogier Brussee, RISC-V ISA Dev, Krste Asanovic
There is a suggestion in commentary in section 2.5 of the RISC-V ISA specification that millicode could be called by JALR with ZERO as the base register. This would require it to fit into the first and/or last 2 KB of the address space, which seems quite reasonable size-wise -- load/store multiple function prologue/epilogue routines are under 200 bytes in total, but this does not seem to have been persued in actual millicode implementations so far.


--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

Jacob Bachmeyer

unread,
Jun 21, 2017, 7:16:12 PM6/21/17
to Bruce Hoult, Rogier Brussee, RISC-V ISA Dev, Krste Asanovic
Bruce Hoult wrote:
> There is a suggestion in commentary in section 2.5 of the RISC-V ISA
> specification that millicode could be called by JALR with ZERO as the
> base register. This would require it to fit into the first and/or last
> 2 KB of the address space, which seems quite reasonable size-wise --
> load/store multiple function prologue/epilogue routines are under 200
> bytes in total, but this does not seem to have been persued in actual
> millicode implementations so far.

That was also the first plan I saw for SBI linkage -- put the SBI entry
points in the topmost 2KiB and somehow arrange for the supervisor to
find which is which. I still favor it.

Putting millicode in the first page has the price that NULL pointers no
longer cause traps, but I suppose that page zero could be execute only
and address zero would hold a EBREAK instruction to catch calls through
NULL pointers. At least with the "S-mode cannot execute from user
pages" rule, supervisor calls through a NULL pointer are guaranteed to
still trap even if user code puts millicode in that page.

The important part is that millicode calls will not use AUIPC, otherwise
we would need another fusion pattern that recognizes the millicode link
register. Or we could generalize the fusion pattern for "far call" to
"far jump-and-link" as:

{ auipc rd=X }, { jalr rd=X, rs1=X }

This would have the same effects as independent AUIPC/JALR instructions,
including pushing a return stack if X is either x1 or x5.


-- Jacob

Michael Clark

unread,
Jun 21, 2017, 8:10:21 PM6/21/17
to jcb6...@gmail.com, Albert Cahalan, Bruce Hoult, Allen J. Baum, RISC-V ISA Dev
Agreed.

With respect to AUIPC, redundant calculations of the higher address part can be optimised away. e.g.

1: AUIPC a0, %pcrel_hi(sym)
LD a0, %pcrel_lo(1b)(a0) # can be fused and side effect can be elided
ADDI a0,a0,1
2: AUIPC a1, %pcrel_hi(sym) # can’t be eliminated because a0 is lost, but can be fused
SD a0, %pcrel_lo(2b)(a1)

The optimisation requires the load to use a register allocation where the high part of the address is preserved in another register

1: AUIPC a1, %pcrel_hi(sym)
LD a0, %pcrel_lo(1b)(a1) # address temporary side effect can’t be elided as its later reused
ADDI a0,a0,1
SD a0, %pcrel_lo(1b)(a1)

The interesting thing to note is that the store is more easy to fuse (as there is only one register write for auipc+store versus two for fused auipc+load) but the opportunity is missed due to the optimisation, assuming simple fusion implementations have a sequential instruction constraint. RISC-V has enough registers to not make rematerialization a win however fusion may make alternative code gen better on some future micro-architectures.

I would like to comment that other architectures have sequential instruction constraints for their fusion implementations. e.g. this x86 sequence is fused if a number of constraints are met (newer architectures can handle fusion of both signed and unsigned comparisons, but earlier micro-architectures were limited to unsigned):

CMP rax, rdx
JB foo

These instructions can only be fused if the two instructions are sequential. This fusion is useful because it means that RISC-V branches (BLTU in this case) can emit a sequence of two instructions that is fused into a single micro-op on x86. i.e. the internal micro-op architecture of x86 matches the RISC-V ISA. This is perhaps why binary translation of RISC-V performs better than ARM which has to emulate condition codes (Bruce’s suggestion as to why QEMU ARM is slower than the much newer QEMU RISC-V).

Anecdotally it seems that x86 sign and zero extension need to immediately follow the op it is to be fused with. I tried an experiment of eliminating redundant sign extension by moving them to trace exit, but slowed down even though there were less instructions likely due to missed fusion opportunities. e.g.

SETNE al
MOVZX eax, al

or

ADD eax, edx
MOVSX rax, eax

Essentially a MOVZX or MOVSX suffix can be fused and recorded in the micro-op that the 32-bit operation is to be sign extended instead of zero extended. These opcodes are listed as 0-1 cycles, so I guess they are 0 cycles in the case they are fused.

RISC-V Zero extension requires detecting two shift operations and seems to be done near use due to the ABI passing around sign extended forms. e.g. a cast to unsigned int requires zero extension like this.

SLLI a1,a1,32
SRLI a1,a1,32

I haven’t looked at the frequency of such shift_shift operations to know whether they are worthwhile optimisation candidates. In fact I think it may well be a worthwhile fusion pair on RV64 e.g.:

{ SLLI x,x,32 } { SRLI x,x,32 } -> movzx r10, r10d

It’s relatively easy for me to try this, along with aupic+load and auipc+store fusion…

Michael.

Jacob Bachmeyer

unread,
Jun 21, 2017, 9:33:45 PM6/21/17
to Michael Clark, Albert Cahalan, Bruce Hoult, Allen J. Baum, RISC-V ISA Dev
This optimization is implementation-dependent for hardware -- AUIPC/LD,
ADDI, AUIPC/SD would execute in three cycles, while AUIPC, LD, ADDI, SD
requires four cycles. Oddly enough, the latter sequence is faster on
hardware that doesn't fuse AUIPC/LD and AUPIC/SD, since the first
sequence needs five cycles without fusion. Binary translation, however,
should be able to recognize the equivalence of these sequences and
adjust the %pcrel_lo offsets to match.

> RISC-V Zero extension requires detecting two shift operations and seems to be done near use due to the ABI passing around sign extended forms. e.g. a cast to unsigned int requires zero extension like this.
>
> SLLI a1,a1,32
> SRLI a1,a1,32
>

As long as one of the "common" registers x8 - x15 is used, this sequence
fits in 32-bits with RVC. An RV64 hardware implementation could
recognize it and simply clear the upper half of the register, but RV128
will need to shift by 96 bits both ways. Interestingly, C.SRLI can
encode this, but C.SLLI cannot. The best for 32-bit zero extension on
RV128 is 48-bits: either a 32-bit SLLI or two C.SLLI, followed by C.SRLI.


-- Jacob

Michael Clark

unread,
Jun 21, 2017, 9:41:39 PM6/21/17
to jcb6...@gmail.com, Albert Cahalan, Bruce Hoult, Allen J. Baum, RISC-V ISA Dev

On 22 Jun 2017, at 1:33 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:

Agreed.

With respect to AUIPC, redundant calculations of the higher address part can be optimised away. e.g.

1: AUIPC a0, %pcrel_hi(sym)
LD a0, %pcrel_lo(1b)(a0)   # can be fused and side effect can be elided
ADDI a0,a0,1
2: AUIPC a1, %pcrel_hi(sym) # can’t be eliminated because a0 is lost, but can be fused
SD a0, %pcrel_lo(2b)(a1)

The optimisation requires the load to use a register allocation where the high part of the address is preserved in another register

1: AUIPC a1, %pcrel_hi(sym)
LD a0, %pcrel_lo(1b)(a1)  # address temporary side effect can’t be elided as its later reused
ADDI a0,a0,1
SD a0, %pcrel_lo(1b)(a1)
 

This optimization is implementation-dependent for hardware -- AUIPC/LD, ADDI, AUIPC/SD would execute in three cycles, while AUIPC, LD, ADDI, SD requires four cycles.  Oddly enough, the latter sequence is faster on hardware that doesn't fuse AUIPC/LD and AUPIC/SD, since the first sequence needs five cycles without fusion.  Binary translation, however, should be able to recognize the equivalence of these sequences and adjust the %pcrel_lo offsets to match.

Yes. The first case is 3 potentially fused ops, plus load store latency. I suspect -mtune will be required when there is silicon with fusion.

Rogier Brussee

unread,
Jun 22, 2017, 4:34:26 AM6/22/17
to RISC-V ISA Dev, br...@hoult.org, rogier....@gmail.com, kr...@berkeley.edu, jcb6...@gmail.com


Op donderdag 22 juni 2017 01:16:12 UTC+2 schreef Jacob Bachmeyer:
<snip>
 
The important part is that millicode calls will not use AUIPC, otherwise
we would need another fusion pattern that recognizes the millicode link
register.  Or we could generalize the fusion pattern for "far call" to
"far jump-and-link" as:

    { auipc rd=X }, { jalr rd=X, rs1=X }

This would have the same effects as independent AUIPC/JALR instructions,
including pushing a return stack if X is either x1 or x5.


 

A jal with a 32 bit immediate fits nicely with the rest of the RV ISA and this fuse is the natural way to express  it.

E.g. with X = t1 it implements TAIL :-) . 
 
-- Jacob

Rogier Brussee

unread,
Jun 22, 2017, 4:55:48 AM6/22/17
to RISC-V ISA Dev, michae...@mac.com, acah...@gmail.com, br...@hoult.org, allen...@esperantotech.com, jcb6...@gmail.com


Op donderdag 22 juni 2017 03:33:45 UTC+2 schreef Jacob Bachmeyer:
As long as one of the "common" registers x8 - x15 is used, this sequence
fits in 32-bits with RVC.  An RV64 hardware implementation could
recognize it and simply clear the upper half of the register, but RV128
will need to shift by 96 bits both ways.  Interestingly, C.SRLI can
encode this, but C.SLLI cannot.  The best for 32-bit zero extension on
RV128 is 48-bits:  either a 32-bit SLLI or two C.SLLI, followed by C.SRLI.


 
That seems to be just an unfortunate formulation in the RVC spec that should be clarified. I think that the intention of _all_ the shift immediates is to be signed 6 bit numbers with  -32 <= imm == shamt <0 or 0< imm == shamt <31  with for SV128 the special  casing imm =0 --> shamt = 64 == -64.   

 
-- Jacob

Andrew Waterman

unread,
Jun 22, 2017, 5:20:37 AM6/22/17
to Rogier Brussee, RISC-V ISA Dev, Michael Clark, Albert Cahalan, Bruce Hoult, Allen Baum, Jacob Bachmeyer
This all presupposes that RV128 systems are going to spend a lot of
time zero-extending 32-bit quantities to 128 bits. In RV64I, it
primarily shows up in bad C code that doesn't use size_t to index
arrays, or in ancient bad code that supposed unsigned types were
faster than signed ones.

If this is the biggest problem facing RV128C, we are in great shape.
> --
> You received this message because you are subscribed to the Google Groups
> "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to isa-dev+u...@groups.riscv.org.
> To post to this group, send email to isa...@groups.riscv.org.
> Visit this group at
> https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
> To view this discussion on the web visit
> https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/ef1b5384-c625-4ad3-b4c0-696c6cfd8384%40groups.riscv.org.

Michael Clark

unread,
Jun 25, 2017, 4:13:41 AM6/25/17
to Jacob Bachmeyer, Albert Cahalan, Bruce Hoult, Allen J. Baum, RISC-V ISA Dev
On 22 Jun 2017, at 1:33 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:

RISC-V Zero extension requires detecting two shift operations and seems to be done near use due to the ABI passing around sign extended forms. e.g. a cast to unsigned int requires zero extension like this.

SLLI a1,a1,32
SRLI a1,a1,32
 

As long as one of the "common" registers x8 - x15 is used, this sequence fits in 32-bits with RVC.  An RV64 hardware implementation could recognize it and simply clear the upper half of the register, but RV128 will need to shift by 96 bits both ways.  Interestingly, C.SRLI can encode this, but C.SLLI cannot.  The best for 32-bit zero extension on RV128 is 48-bits:  either a 32-bit SLLI or two C.SLLI, followed by C.SRLI.

I implemented the zero extend fusion pattern; for both compressed and non-compressed opcodes (any combination); which is one pattern because I pattern match after decompressing into canonical opcodes:

SLLI a1,a1,32
SRLI a1,a1,32

Now I’ve noticed new and interesting patterns.

This this one; zero extended add register immediate (and potentially the register register form too):

# 0x0000000000010994 addiw       a1, a1, -1
add r9d, -1                                 ; 4183C1FF
movsxd r9, r9d                          ; 4D63C9
L134:
# 0x0000000000010996 zext a1
movzx r9, r9d                           ; 4D0FB7C9
L135:
# 0x000000000001099a addi        a1, a1, 1
add r9, 1                               ; 4983C101
L136:

You can see it has matched a compressed sequence as the PC increments by 4 on zext.

So now I am going to pattern match “zero extended 32-bit add”, a 48-bit fusion pattern:

ADDIW a1,a1, n
SLLI a1,a1,32
SRLI a1,a1,32

ADDIWZ will translate to this:

add r9d, -1                                 ; 4183C1FF

Given the amount of code that has int or unsigned int as loop induction variables, I wouldn’t be surprised if these patterns are relatively common.

This is inside an AES cipher so it may make a noticeable difference. A lot of ciphers use 32-bit unsigned integers.

Michael Clark

unread,
Jun 25, 2017, 4:21:26 AM6/25/17
to Jacob Bachmeyer, Albert Cahalan, Bruce Hoult, Allen J. Baum, RISC-V ISA Dev
On 25 Jun 2017, at 8:13 PM, Michael Clark <michae...@mac.com> wrote:


On 22 Jun 2017, at 1:33 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:

RISC-V Zero extension requires detecting two shift operations and seems to be done near use due to the ABI passing around sign extended forms. e.g. a cast to unsigned int requires zero extension like this.

SLLI a1,a1,32
SRLI a1,a1,32
 

As long as one of the "common" registers x8 - x15 is used, this sequence fits in 32-bits with RVC.  An RV64 hardware implementation could recognize it and simply clear the upper half of the register, but RV128 will need to shift by 96 bits both ways.  Interestingly, C.SRLI can encode this, but C.SLLI cannot.  The best for 32-bit zero extension on RV128 is 48-bits:  either a 32-bit SLLI or two C.SLLI, followed by C.SRLI.

I implemented the zero extend fusion pattern; for both compressed and non-compressed opcodes (any combination); which is one pattern because I pattern match after decompressing into canonical opcodes:

SLLI a1,a1,32
SRLI a1,a1,32

Now I’ve noticed new and interesting patterns.

This this one; zero extended add register immediate (and potentially the register register form too):

# 0x0000000000010994 addiw       a1, a1, -1
add r9d, -1                                 ; 4183C1FF
movsxd r9, r9d                          ; 4D63C9
L134:
# 0x0000000000010996 zext a1
movzx r9, r9d                           ; 4D0FB7C9
L135:
# 0x000000000001099a addi        a1, a1, 1
add r9, 1                               ; 4983C101
L136:

You can see it has matched a compressed sequence as the PC increments by 4 on zext.

Note: It should be zext.w to be consistent with the sext.w pseudo.

So now I am going to pattern match “zero extended 32-bit add”, a 48-bit fusion pattern:

ADDIW a1,a1, n
SLLI a1,a1,32
SRLI a1,a1,32

ADDIWZ will translate to this:

add r9d, -1                                 ; 4183C1FF

Given the amount of code that has int or unsigned int as loop induction variables, I wouldn’t be surprised if these patterns are relatively common.

This is inside an AES cipher so it may make a noticeable difference. A lot of ciphers use 32-bit unsigned integers.

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

Andrew Waterman

unread,
Jun 25, 2017, 4:46:30 AM6/25/17
to Jacob Bachmeyer, Michael Clark, Albert Cahalan, Allen J. Baum, Bruce Hoult, RISC-V ISA Dev
Many of these represent missed optimization opportunities in the compiler; it may be premature to optimize implementations around bad code generation. Understanding and fixing this in GCC will benefit all implementations.

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

Michael Clark

unread,
Jun 25, 2017, 5:10:03 AM6/25/17
to Andrew Waterman, Jacob Bachmeyer, Albert Cahalan, Allen J. Baum, Bruce Hoult, RISC-V ISA Dev
I will try to find the lines of code so we can look at the RTL and have context for the codegen in question.

However I did have a discussion with someone regarding upcasting unsigned integers and the trade off between sign extending and zero extending, given zero extension may indeed be necessary, valid and not a missed optimisation. I'll find the code so we can have a reasoned discussion about what's going on...

There are three distinct issues here. One is there may very well be an issue with the codegen, the second is how to handle valid cases of zero extended 32-bit additions on RISC-V which may simply require additional instructions, and thirdly (less important for RISC-V), my case which is mapping to an architecture that zero extends by default (and has 0 cost sign extension instructions that can be fused with the preceding ALU op). It could be that I need to detect this case and what the compiler is doing is fine.

It makes one pause to think about zero cost zero extension given architectures with zero extension semantics have implemented zero cost sign extension as a fusion suffix.

Sent from my iPhone

Bruce Hoult

unread,
Jun 25, 2017, 5:33:50 AM6/25/17
to Andrew Waterman, Jacob Bachmeyer, Michael Clark, Albert Cahalan, Allen J. Baum, RISC-V ISA Dev
It's very unfortunate that the more or less universal choice of LP64 has made int a bad choice for loop variables used to index arrays, as it's a royal PITA to have to remember to put ptrdiff_t everywhere. And very very few individual arrays ever actually have more than 2B elements. If it wasn't for @#$%^ LLP64 Windows then "long" would be a perfectly fine choice.

The good thing about "int" though, as far as I can tell, is that as it is UB to allow the int to wrap it's perfectly legal for the compiler to silently replace int with long/ptrdiff_t, so it really is just a question of quality of optimisation.

To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+unsubscribe@groups.riscv.org.

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+unsubscribe@groups.riscv.org.

To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

Rogier Brussee

unread,
Jun 25, 2017, 7:25:54 AM6/25/17
to RISC-V ISA Dev, jcb6...@gmail.com, acah...@gmail.com, br...@hoult.org, allen...@esperantotech.com


Op zondag 25 juni 2017 10:13:41 UTC+2 schreef michaeljclark:

Now I’ve noticed new and interesting patterns.

This this one; zero extended add register immediate (and potentially the register register form too):

# 0x0000000000010994 addiw       a1, a1, -1
add r9d, -1                                 ; 4183C1FF
movsxd r9, r9d                          ; 4D63C9
L134:
# 0x0000000000010996 zext a1
movzx r9, r9d                           ; 4D0FB7C9
L135:
# 0x000000000001099a addi        a1, a1, 1
add r9, 1                               ; 4983C101
L136:

You can see it has matched a compressed sequence as the PC increments by 4 on zext.

So now I am going to pattern match “zero extended 32-bit add”, a 48-bit fusion pattern:

ADDIW a1,a1, n
SLLI a1,a1,32
SRLI a1,a1,32

ADDIWZ will translate to this:

add r9d, -1                                 ; 4183C1FF

Given the amount of code that has int or unsigned int as loop induction variables, I wouldn’t be surprised if these patterns are relatively common.


addw and addiw do a perfectly good job at adding uint32_t (i.e. numbers mod 2^32) because uint32_t and int32_t are both represented as signed XLEN sized integers. It is only adding them to (u)int64_t or pointers (i.e. numbers mod 2^64) that you get a problem. Hence I think you actually want to pattern match


SLLI a1,a1,32
SRLI a1,a1,32
ADDI a1,a1, n
 
(or if full generality and RV128 bit clean :

addzxwi rd rs1 imm (add zero extended word to xlen immediate) 
SLLI rd, rs1,-32
SRLI rd, rd, -32
ADDI rd, rd, imm
)

which also shows up in your 

Michael Clark

unread,
Jun 25, 2017, 7:01:18 PM6/25/17
to Andrew Waterman, Jacob Bachmeyer, Albert Cahalan, Allen J. Baum, Bruce Hoult, RISC-V ISA Dev
Hi Andrew,

The weird zero extend is around the loop exit in two places here:


It’s weird because the loop induction variable is an int, assuming this is r being adjusted at loop exit. I don’t know why 1 is subtracted, the variable zero extended then 1 added. I guess it is being normalised to zero extended form at loop exit (which would be the compiler fighting against the machine semantics)? It gets called for every block which is every 16 bytes of data in a loop over a 32MiB buffer so it affects the profile slightly.

I’ve confirmed the double shift is being produced with two compiler snapshots from April 3rd and June 8th.

$ /opt/riscv/toolchain-master-20170608/bin/riscv64-unknown-elf-gcc --version
riscv64-unknown-elf-gcc (GCC) 7.1.1 20170509

$ /opt/riscv/toolchain-master-20170608/bin/riscv64-unknown-elf-gcc -Os -g -march=rv64imafdc -Wall -fpie -ffunction-sections -fdata-sections src/test/test-aes.c -S -o build/riscv64-unknown-elf/asm/test-aes.s -dP -dA -dD

If you check out the rv8 repo, you can do a “make test-build” and assuming there is a toolchain in your path it will build all the samples I am testing with.

Michael.

# SUCC: 3 [100.0%]  (DFS_BACK,CAN_FALLTHRU)
        # src/test/test-aes.c:894
        .loc 1 894 0
#(jump_insn 730 775 731 (set (pc)
#        (label_ref 411)) "src/test/test-aes.c":894 205 {jump}
#     (nil)
# -> 411)
        j       .L16    # 730   jump    [length = 4]
.LVL49:
# BLOCK 5 freq:1500 seq:3
# PRED: 3 [15.0%]  (CAN_FALLTHRU,LOOP_EXIT)
.L15:
#(insn 416 607 661 (set (reg:SI 11 a1 [690])
#        (plus:SI (reg:SI 11 a1 [408])
#            (const_int -1 [0xffffffffffffffff]))) 3 {addsi3}
#     (nil))
        addiw   a1,a1,-1        # 416   addsi3/2        [length = 4]
#(insn 661 416 662 (set (reg:DI 11 a1 [691])
#        (ashift:DI (reg:DI 11 a1 [690])
#            (const_int 32 [0x20]))) 149 {ashldi3}
#     (nil))
        slli    a1,a1,32        # 661   ashldi3 [length = 4]
#(insn 662 661 418 (set (reg:DI 11 a1 [691])
#        (lshiftrt:DI (reg:DI 11 a1 [691])
#            (const_int 32 [0x20]))) 151 {lshrdi3}
#     (nil))
        srli    a1,a1,32        # 662   lshrdi3 [length = 4]
#(insn 418 662 419 (set (reg:DI 11 a1 [692])
#        (plus:DI (reg:DI 11 a1 [691])
#            (const_int 1 [0x1]))) 4 {adddi3}
#     (nil))
        addi    a1,a1,1 # 418   adddi3/2        [length = 4]
#(insn 419 418 420 (set (reg:DI 11 a1 [693])
#        (ashift:DI (reg:DI 11 a1 [692])
#            (const_int 5 [0x5]))) 149 {ashldi3}
#     (nil))
        slli    a1,a1,5 # 419   ashldi3 [length = 4]
#(insn 420 419 423 (set (reg/v/f:DI 10 a0 [orig:319 rk ] [319])
#        (plus:DI (reg/v/f:DI 10 a0 [orig:344 rk ] [344])
#            (reg:DI 11 a1 [693]))) 4 {adddi3}
#     (expr_list:REG_DEAD (reg:DI 11 a1 [693])
#        (nil)))
        add     a0,a0,a1        # 420   adddi3/1        [length = 4]

Andrew Waterman

unread,
Jun 26, 2017, 1:56:48 AM6/26/17
to Michael Clark, RISC-V ISA Dev
I don't think I'll have time to look into this in depth for a while,
but there does seem to be a pattern here.

In general, GCC seems not to know that conversion from uint32 to
uint64 is a no-op when it knows that bit 31 is clear. In your test
case, this shows up several times (e.g., it right-shifts a uint32 by
24, so bit 31 must be clear, but it still emits SLLI/SRLI to
redundantly clear the upper 32 bits). Solving this problem in our GCC
port should get rid of the bulk of these extra instructions.

Michael Clark

unread,
Jun 26, 2017, 3:12:52 PM6/26/17
to Andrew Waterman, RISC-V ISA Dev
No worries. I may investigate a little further if I get time. It is confusing as the IR annotated asm for this case appears in the loop exit basic block, the subtract -1 corresponds to the decrement of the loop induction variable, which is an int. I’ll see if I can dump information from earlier passes to make it more clear as the IR has already been lowered to a point where some of the context with the original lines of C code has been lost which makes it harder to analyse.


I know the loop induction variable analysis is called scalar evolution but I don’t know what touched bit tracking is called internally within gcc. As you mention, if the compiler is able to prove that bit 31 is not set by any previous operation, then it can promote to a wider unsigned type without zero extending. In any case, the zero extension does seem like a redundant operation as the round temporary variables are unsigned int and the loop induction variable is int. i.e. all variables in that block of code besides pointers are signed or unsigned 32-bit and not apparently promoted, and we can also be certain its not a pointer variable.

Bruce Hoult

unread,
Jun 26, 2017, 3:21:39 PM6/26/17
to Michael Clark, Andrew Waterman, RISC-V ISA Dev
llvm has the computeKnownBits() function -- and spends an inordinate amount of time in it, overall. But I don't know the gcc internals.



--
You received this message because you are subscribed to the Google Groups
"RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+unsubscribe@groups.riscv.org.

To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

Michael Clark

unread,
Jun 26, 2017, 6:22:53 PM6/26/17
to Andrew Waterman, RISC-V ISA Dev, Bruce Hoult
I can confirm it is the LOOP_EXIT block because when I compile with -DAES_FULL_UNROLL the compiler does not emit the zero extension.

I’ve fully analysed it. There are two scalars that evolve in the loop, one of them is const u32 rk[] and the other is int r.

void aes_rijndael_encrypt(const u32 rk[], int Nr, const u8 pt[16], u8 ct[16])
{
/* Nr - 1 full rounds: */
r = Nr >> 1;
for (;;) {
ROUND(1,t,s);
rk += 8;
if (--r == 0)
break;
ROUND(0,s,t);
}
}

The loop expression has rk += 8, however SCEV has determined an expression to adjust rk in one go at loop exit, and is adjusting rk in terms of int Nr (signed 32-bit integer) and has turned it into a subtract 1 from Nr, zero extend, shift by 5. e.g.

rk += (Nr-1) << 5;

Given int Nr is passed in a register the compiler can’t really make any assumptions about bit 31. I looked closer at the IR below and spotted the reference to rk, and it is now clear it has to promote int Nr -1 to unsigned 64-bit (zext.w) before adding it to rk, which is a pointer.

Due to RISC-V’s large supply of registers, a0 and a1 still map to the argument registers which is nice.

I think the compiler is doing the right thing here.

Michael Clark

unread,
Jun 26, 2017, 6:29:05 PM6/26/17
to Andrew Waterman, RISC-V ISA Dev, Bruce Hoult
On 27 Jun 2017, at 10:22 AM, Michael Clark <michae...@mac.com> wrote:

I can confirm it is the LOOP_EXIT block because when I compile with -DAES_FULL_UNROLL the compiler does not emit the zero extension.

I’ve fully analysed it. There are two scalars that evolve in the loop, one of them is const u32 rk[] and the other is int r.

void aes_rijndael_encrypt(const u32 rk[], int Nr, const u8 pt[16], u8 ct[16])
{
/* Nr - 1 full rounds: */
r = Nr >> 1;
for (;;) {
ROUND(1,t,s);
rk += 8;
if (--r == 0)
break;
ROUND(0,s,t);
}
}

The loop expression has rk += 8, however SCEV has determined an expression to adjust rk in one go at loop exit, and is adjusting rk in terms of int Nr (signed 32-bit integer) and has turned it into a subtract 1 from Nr, zero extend, shift by 5. e.g.

rk += (Nr-1) << 5;

semi-correction (accounting for pointer arithmetic):

rk += (Nr-1) << 3;

/* or 5 if rk is uintptr_t which it is in a register */

To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

Michael Clark

unread,
Jun 26, 2017, 7:53:07 PM6/26/17
to Andrew Waterman, RISC-V ISA Dev, Bruce Hoult
So I think we should consider ZEXT.W as a fusion pair on RV64. It was a win.

ZEXT.W (rv64)

SLLI a1,a1,32
SRLI a1,a1,32


Fusing ADDIW followed by ZEXT.W was a win too.

ADDIW.ZX (rv64)


ADDIW a1,a1, n
SLLI a1,a1,32
SRLI a1,a1,32


ADDIW.ZX let me remove a redundant MOZSX, MOVZX pair, however eliminating both explicit sign and zero extend instructions introduced a stall for a subsequent 64-bit operation using a register holding a 32-bit result, so as a result, ADDIW.ZX does a 32-bit add followed by MOVZX vs ADDIW which does a 32-bit add followed by MOVSX. It seems the Intel ALU can either zero or sign extend 32-bit adds via fusion of MOVSX/MOVZX suffixes. Any architecture that implements 48-bit decode could use a 48-bit ADDIW.ZX pattern to supply a zero extend bit to the ALU; potentially converting 3 cycles to either 1 (ADDIW.ZX) or 2 (ZEXT.W) cycle ops. I guess the fusion pattern matches in the decoder would need to override the instruction length and function as it is similar to an RVC op where rd=rs1.

I’m done with fusion for the moment. The JIT codegen at -O3 is now averaging between 2.0X slowdown (RV32) and 2.5X slowdown (RV64) over native on x86 and x86_64 respectively with best cases where the slowdown being only 1.4X. It was your mentioning that the difficulty was mainly in adding the first pair which made me hunt for some simple fusion opportunities.

Tommy Thorn

unread,
Jun 26, 2017, 8:39:41 PM6/26/17
to Michael Clark, Andrew Waterman, RISC-V ISA Dev, Bruce Hoult
I'm happy you found the root cause but this is concerning as I see this pattern quite a bit (mixing pointers with int).  I wonder if a case could had for an ADD[I]WZ instruction?

Even so, for an implementation without fusion, the two shifts seems an expensive way to zero extend when a "simple" AND could suffice (I'm assume that the mask generation/loading will be hoisted out of loops).
Of course such would be harder to fuse on.

Tommy

To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+unsubscribe@groups.riscv.org.

To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.


-- 
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+unsubscribe@groups.riscv.org.

To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+unsubscribe@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

Andrew Waterman

unread,
Jun 26, 2017, 8:49:31 PM6/26/17
to Tommy Thorn, Michael Clark, RISC-V ISA Dev, Bruce Hoult
On Mon, Jun 26, 2017 at 5:39 PM, Tommy Thorn
<tommy...@esperantotech.com> wrote:
> I'm happy you found the root cause but this is concerning as I see this
> pattern quite a bit (mixing pointers with int). I wonder if a case could
> had for an ADD[I]WZ instruction?

It's still quite premature to propose an ISA solution to this problem,
as improvements to the compiler have barely been explored. Even this
case resulted from a compiler optimization and wasn't fundamental.
Many cases I've seen are truly unnecessary.

Also, RV64I is frozen :-) Could stuff something like this into the B extension

>
> Even so, for an implementation without fusion, the two shifts seems an
> expensive way to zero extend when a "simple" AND could suffice (I'm assume
> that the mask generation/loading will be hoisted out of loops).
> Of course such would be harder to fuse on.

When the zero-extension is in a loop, it makes sense to AND with a
mask. (This is 3 static instructions, and 2 + n, dynamic
instructions, vs. 2 and 2*n.)
>>> email to isa-dev+u...@groups.riscv.org.
>>> To post to this group, send email to isa...@groups.riscv.org.
>>> Visit this group at
>>> https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
>>> To view this discussion on the web visit
>>>
>>> https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/C8572BB6-B9B1-48E6-B6E1-E56A6BAF6AC1%40mac.com.
>>>
>>>
>>>
>>> --
>>> You received this message because you are subscribed to the Google Groups
>>> "RISC-V ISA Dev" group.
>>> To unsubscribe from this group and stop receiving emails from it, send an
>>> email to isa-dev+u...@groups.riscv.org.
>>> To post to this group, send email to isa...@groups.riscv.org.
>>> Visit this group at
>>> https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
>>> To view this discussion on the web visit
>>>
>>> https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CA%2B%2B6G0DR727zsddnvXZJ2pkjEm43JVejHen4fVUSGpLNCPPWFA%40mail.gmail.com.
>>>
>>>
>>>
>>>
>>> --
>>> You received this message because you are subscribed to the Google Groups
>>> "RISC-V ISA Dev" group.
>>> To unsubscribe from this group and stop receiving emails from it, send an
>>> email to isa-dev+u...@groups.riscv.org.
>>> To post to this group, send email to isa...@groups.riscv.org.
>>> Visit this group at
>>> https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
>>> To view this discussion on the web visit
>>> https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/1ACC2257-4D98-4B2E-83DF-4AC9D106876F%40mac.com.
>>
>>
>>
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "RISC-V ISA Dev" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to isa-dev+u...@groups.riscv.org.
>> To post to this group, send email to isa...@groups.riscv.org.
>> Visit this group at
>> https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
>> To view this discussion on the web visit
>> https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/0529AED9-D606-46B1-999C-63B8B32EE928%40mac.com.
>>
>>
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "RISC-V ISA Dev" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to isa-dev+u...@groups.riscv.org.
>> To post to this group, send email to isa...@groups.riscv.org.
>> Visit this group at
>> https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
>> To view this discussion on the web visit
>> https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/7A908574-494E-48CA-9922-4F839E205DA2%40mac.com.
>>
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "RISC-V ISA Dev" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to isa-dev+u...@groups.riscv.org.

Michael Clark

unread,
Jun 26, 2017, 10:23:02 PM6/26/17
to Andrew Waterman, Tommy Thorn, RISC-V ISA Dev, Bruce Hoult
On 27 Jun 2017, at 12:49 PM, Andrew Waterman <and...@sifive.com> wrote:

On Mon, Jun 26, 2017 at 5:39 PM, Tommy Thorn
<tommy...@esperantotech.com> wrote:
I'm happy you found the root cause but this is concerning as I see this
pattern quite a bit (mixing pointers with int).  I wonder if a case could
had for an ADD[I]WZ instruction?

or equivalently ADD[I]WU which has uint32 underflow/overflow semantics.

SCEV is promoting the value synthesized from int Nr - 1 as uint32 to uint64 before adding it to the pointer. If it were int32 to int64, the zero extension would be unnecessary. 

Thinking twice about it, I am not sure about the defined behaviour here and whether or not the zero extension can be elided and whether or not this really is a gcc issue. Changing the Nr argument to unsigned Nr generates the same code (which is a hint) but using size_t Nr of course fixes the issue. I wonder if gcc is defaulting to uint32 promotion because it is a no-op on other architectures, otherwise it would need to explicitly sign extend on other architectures (that zero extend narrower operations by default). i.e. an optimisation for one platform ends up creating extra code for another platform. The underflow semantics are more clearly defined when adding a sign extended value. In any case I’m pretty sure it’s in SCEV somewhere (scalar evolution). Integer promotion rules in SCEV perhaps.

It's still quite premature to propose an ISA solution to this problem,
as improvements to the compiler have barely been explored.  Even this
case resulted from a compiler optimization and wasn't fundamental.
Many cases I've seen are truly unnecessary.

Also, RV64I is frozen :-) Could stuff something like this into the B extension

Yes. I’m fine with them remaining as fusion patterns.

However B extension could help

Constant synthesis using parameters compressed into a 12-bit immediate. Help with the hoisted case.

- 0x00000000FFFFFFFF
- 0xFFFFFFF000000000
- 0x0000FFFF0000FFFF
- 0xFFFF0000FFFF0000
- 0xFF00FF00FF00FF00
- 0x00FF00FF00FF00FF
- 0xF0F0F0F0F0F0F0F0
- 0x0F0F0F0F0F0F0F0F

Arbitrary offset bit extend (or perhaps some sort of signed bitfield extract and insert).

BEXT a0, a0, 31 # zext.w
BEXT a0, a0, 15 # zext.h
BEXT a0, a0, 7 # zext.b

I had an arbitrary offset bit extend in my BMI proposal but removed it because it can be done with two shifts, and thus was a low instruction count, however thinking about it, it’s pretty heavyweight to crank up a barrel shifter twice energy and area wise versus copying a bit or broadcasting a bit depending on an input parameter. A barrel shifter has to be wire every bit to every other bit position. A bit extend just needs 5, 6 or 7 wires to select the bit to broadcast. n^2 vs n.log(n)

Even so, for an implementation without fusion, the two shifts seems an
expensive way to zero extend when a "simple" AND could suffice (I'm assume
that the mask generation/loading will be hoisted out of loops).
Of course such would be harder to fuse on.

When the zero-extension is in a loop, it makes sense to AND with a
mask.  (This is 3 static instructions, and 2 + n, dynamic
instructions, vs. 2 and 2*n.)

In this case it is in a loop but behind the function call ABI so it can’t be hoisted. i.e. if it was inlined it could be.

Jacob Bachmeyer

unread,
Jun 26, 2017, 10:40:49 PM6/26/17
to Michael Clark, Andrew Waterman, Tommy Thorn, RISC-V ISA Dev, Bruce Hoult
Michael Clark wrote:
>> On 27 Jun 2017, at 12:49 PM, Andrew Waterman <and...@sifive.com
>> <mailto:and...@sifive.com>> wrote:
>>
>> It's still quite premature to propose an ISA solution to this problem,
>> as improvements to the compiler have barely been explored. Even this
>> case resulted from a compiler optimization and wasn't fundamental.
>> Many cases I've seen are truly unnecessary.
>>
>> Also, RV64I is frozen :-) Could stuff something like this into the B
>> extension
>
> Yes. I’m fine with them remaining as fusion patterns.
>
> However B extension could help
>
> Constant synthesis using parameters compressed into a 12-bit
> immediate. Help with the hoisted case.
>
> - 0x00000000FFFFFFFF
> - 0xFFFFFFF000000000
> - 0x0000FFFF0000FFFF
> - 0xFFFF0000FFFF0000
> - 0xFF00FF00FF00FF00
> - 0x00FF00FF00FF00FF
> - 0xF0F0F0F0F0F0F0F0
> - 0x0F0F0F0F0F0F0F0F

How to encode these constants into 12 bits? (I would suggest using the
sign bit to invert the generated mask, leaving 11 bits for choosing a
pattern.)
An idea: "item width" (number of consecutive set bits in the pattern)
as a power of two, so the immediates for your examples would be 5, -5,
4, -4, -3, 3, -2, 2. This leads to a very small range, needing only
four bits for alternating masks with up to 128 bit item width (-7/7
would be all-bits-set/all-bits-clear on RV128). Could this even fit
into RVC?

> Arbitrary offset bit extend (or perhaps some sort of signed bitfield
> extract and insert).
>
> BEXT a0, a0, 31 # zext.w
> BEXT a0, a0, 15 # zext.h
> BEXT a0, a0, 7 # zext.b
>
> I had an arbitrary offset bit extend in my BMI proposal but removed it
> because it can be done with two shifts, and thus was a low instruction
> count, however thinking about it, it’s pretty heavyweight to crank up
> a barrel shifter twice energy and area wise versus copying a bit or
> broadcasting a bit depending on an input parameter. A barrel shifter
> has to be wire every bit to every other bit position. A bit extend
> just needs 5, 6 or 7 wires to select the bit to broadcast. n^2 vs n.log(n)

Once you have the barrel shifter for SLL/SRL/SRA, the area cost has
already been paid. A bit extend element would be additional hardware
and more area on top of the barrel shifter. General field
extract/deposit instructions do not have the same problem, since the ALU
must already have both the shifter and masking logic (for AND/OR/XOR),
so field access instructions only add mask generation and a constraint
on ALU topology that the shifter output must feed into the masking logic.



-- Jacob

Bruce Hoult

unread,
Jun 27, 2017, 8:43:42 AM6/27/17
to Andrew Waterman, Tommy Thorn, Michael Clark, RISC-V ISA Dev
If the B extension gets an "extract field" instruction then zero and/or sign extension from an arbitrary bit position could fall out of that, as it does with Aarch64's Bitfield Move instruction. i.e. extracting a bitfield into a full register sized signed or unsigned integer, when the bitfield just happens to already be right aligned.


>>> To post to this group, send email to isa...@groups.riscv.org.
>>> Visit this group at
>>> https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
>>> To view this discussion on the web visit
>>>
>>> https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/C8572BB6-B9B1-48E6-B6E1-E56A6BAF6AC1%40mac.com.
>>>
>>>
>>>
>>> --
>>> You received this message because you are subscribed to the Google Groups
>>> "RISC-V ISA Dev" group.
>>> To unsubscribe from this group and stop receiving emails from it, send an

>>> To post to this group, send email to isa...@groups.riscv.org.
>>> Visit this group at
>>> https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
>>> To view this discussion on the web visit
>>>
>>> https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CA%2B%2B6G0DR727zsddnvXZJ2pkjEm43JVejHen4fVUSGpLNCPPWFA%40mail.gmail.com.
>>>
>>>
>>>
>>>
>>> --
>>> You received this message because you are subscribed to the Google Groups
>>> "RISC-V ISA Dev" group.
>>> To unsubscribe from this group and stop receiving emails from it, send an

>>> To post to this group, send email to isa...@groups.riscv.org.
>>> Visit this group at
>>> https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
>>> To view this discussion on the web visit
>>> https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/1ACC2257-4D98-4B2E-83DF-4AC9D106876F%40mac.com.
>>
>>
>>
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "RISC-V ISA Dev" group.
>> To unsubscribe from this group and stop receiving emails from it, send an

>> To post to this group, send email to isa...@groups.riscv.org.
>> Visit this group at
>> https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
>> To view this discussion on the web visit
>> https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/0529AED9-D606-46B1-999C-63B8B32EE928%40mac.com.
>>
>>
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "RISC-V ISA Dev" group.
>> To unsubscribe from this group and stop receiving emails from it, send an

>> To post to this group, send email to isa...@groups.riscv.org.
>> Visit this group at
>> https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
>> To view this discussion on the web visit
>> https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/7A908574-494E-48CA-9922-4F839E205DA2%40mac.com.
>>
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "RISC-V ISA Dev" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+unsubscribe@groups.riscv.org.

To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

Bruce Hoult

unread,
Jun 27, 2017, 8:47:39 AM6/27/17
to Michael Clark, Andrew Waterman, Tommy Thorn, RISC-V ISA Dev
On Tue, Jun 27, 2017 at 5:22 AM, Michael Clark <michae...@mac.com> wrote:
However B extension could help

Constant synthesis using parameters compressed into a 12-bit immediate. Help with the hoisted case.

- 0x00000000FFFFFFFF
- 0xFFFFFFF000000000
- 0x0000FFFF0000FFFF
- 0xFFFF0000FFFF0000
- 0xFF00FF00FF00FF00
- 0x00FF00FF00FF00FF
- 0xF0F0F0F0F0F0F0F0
- 0x0F0F0F0F0F0F0F0F

Aarch64's immediate bit patterns for the bitwise immeidate instructions are very nice. So nice that one imagines there is a good chance they've patented it.  

Bruce Hoult

unread,
Jun 27, 2017, 9:03:37 AM6/27/17
to Jacob Bachmeyer, Michael Clark, Andrew Waterman, Tommy Thorn, RISC-V ISA Dev
On Tue, Jun 27, 2017 at 5:40 AM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
Michael Clark wrote:
On 27 Jun 2017, at 12:49 PM, Andrew Waterman <and...@sifive.com <mailto:and...@sifive.com>> wrote:

It's still quite premature to propose an ISA solution to this problem,
as improvements to the compiler have barely been explored.  Even this
case resulted from a compiler optimization and wasn't fundamental.
Many cases I've seen are truly unnecessary.

Also, RV64I is frozen :-) Could stuff something like this into the B extension

Yes. I’m fine with them remaining as fusion patterns.

However B extension could help

Constant synthesis using parameters compressed into a 12-bit immediate. Help with the hoisted case.

- 0x00000000FFFFFFFF
- 0xFFFFFFF000000000
- 0x0000FFFF0000FFFF
- 0xFFFF0000FFFF0000
- 0xFF00FF00FF00FF00
- 0x00FF00FF00FF00FF
- 0xF0F0F0F0F0F0F0F0
- 0x0F0F0F0F0F0F0F0F

How to encode these constants into 12 bits?  (I would suggest using the sign bit to invert the generated mask, leaving 11 bits for choosing a pattern.)
An idea:  "item width" (number of consecutive set bits in the pattern) as a power of two, so the immediates for your examples would be 5, -5, 4, -4, -3, 3, -2, 2.  This leads to a very small range, needing only four bits for alternating masks with up to 128 bit item width (-7/7 would be all-bits-set/all-bits-clear on RV128).  Could this even fit into RVC?


imms and immr are each 6 bit fields extracted from the instruction. immn is a further 1 bit.

So it's actually 13 bits, not 12.

 static bool logic_imm_decode_wmask(uint64_t *result, unsigned int immn,
                                   unsigned int imms, unsigned int immr)
{
    uint64_t mask;
    unsigned e, levels, s, r;
    int len;

    assert(immn < 2 && imms < 64 && immr < 64);

    /* The bit patterns we create here are 64 bit patterns which
     * are vectors of identical elements of size e = 2, 4, 8, 16, 32 or
     * 64 bits each. Each element contains the same value: a run
     * of between 1 and e-1 non-zero bits, rotated within the
     * element by between 0 and e-1 bits.
     *
     * The element size and run length are encoded into immn (1 bit)
     * and imms (6 bits) as follows:
     * 64 bit elements: immn = 1, imms = <length of run - 1>
     * 32 bit elements: immn = 0, imms = 0 : <length of run - 1>
     * 16 bit elements: immn = 0, imms = 10 : <length of run - 1>
     *  8 bit elements: immn = 0, imms = 110 : <length of run - 1>
     *  4 bit elements: immn = 0, imms = 1110 : <length of run - 1>
     *  2 bit elements: immn = 0, imms = 11110 : <length of run - 1>
     * Notice that immn = 0, imms = 11111x is the only combination
     * not covered by one of the above options; this is reserved.
     * Further, <length of run - 1> all-ones is a reserved pattern.
     *
     * In all cases the rotation is by immr % e (and immr is 6 bits).
     */

    /* First determine the element size */
    len = 31 - clz32((immn << 6) | (~imms & 0x3f));
    if (len < 1) {
        /* This is the immn == 0, imms == 0x11111x case */
        return false;
    }
    e = 1 << len;

    levels = e - 1;
    s = imms & levels;
    r = immr & levels;

    if (s == levels) {
        /* <length of run - 1> mustn't be all-ones. */
        return false;
    }

    /* Create the value of one element: s+1 set bits rotated
     * by r within the element (which is e bits wide)...
     */
    mask = bitmask64(s + 1);
    if (r) {
        mask = (mask >> r) | (mask << (e - r));
        mask &= bitmask64(e);
    }
    /* ...then replicate the element over the whole 64 bit value */
    mask = bitfield_replicate(mask, e);
    *result = mask;
    return true;
}

/* The input should be a value in the bottom e bits (with higher
 * bits zero); returns that value replicated into every element
 * of size e in a 64 bit integer.
 */
static uint64_t bitfield_replicate(uint64_t mask, unsigned int e)
{
    assert(e != 0);
    while (e < 64) {
        mask |= mask << e;
        e *= 2;
    }
    return mask;
}

/* Return a value with the bottom len bits set (where 0 < len <= 64) */
static inline uint64_t bitmask64(unsigned int length)
{
    assert(length > 0 && length <= 64);
    return ~0ULL >> (64 - length);
}

Reply all
Reply to author
Forward
0 new messages