Interpreting fused AUIPC+JALR as a direct jump (with or without register side effect)

1,747 views
Skip to first unread message

Michael Clark

unread,
Mar 2, 2017, 10:55:49 PM3/2/17
to RISC-V ISA Dev
Just sharing thoughts on fused AUIPC+JALR being considered a direct jump and link…

In static analysis of large binaries, many library calls are in the form of AUIPC+JALR versus JAL (which has +/-1MiB range). It is common for modern applications to have from 10MB to 100MB of text and most function calls expressed as AUIPC+JALR.

$ stat -f %z “/Applications/Google Chrome.app//Contents/Versions/56.0.2924.87/Google Chrome Framework.framework/Google Chrome Framework”
112012064

JALR is as we know a register indirect jump and link return address instruction, and interestingly register indirect calls and returns are particularly hard for dynamic binary translation (my specific interest). A translator typically needs to inject a stub at the translation point that looks up the address of the translation for the ‘dynamic’ target address, and as the target address is not known at the time of translation, a translator can’t always translate past indirect jumps (and obviously return). There are some interesting techniques in the literature, such that the inserted stubs can learn a static address and later rewrite the indirect jump as a direct jump (for the indirect call case, but obviously not return).

Indirect jumps are also likely harder for microarchitectures due to requiring a register read to decode the target address for instruction prefetch. i.e. there may be a higher latency to resolve the jump target address further down the pipeline versus decoding it early as an immediate.

While JALR is technically a register indirect jump, the fused adjacent combination of AUIPC+JALR can be seen as a direct PC relative jump and link with load target address (as a side effect) and on the contrary can be efficiently translated, or in a microarchitecture, the jump target instruction address prefetch can be started before register commit (of the side effect).

The observation (thought experiment) is that one of these AUIPC+JALR can later be split, and the JALR can potentially be used as a ROP gadget given enough diversity of offsets one might be able to get a return address onto the stack pointing to an adjacent function given a known value for the temporary (the t1 temporary from the last indirect call in code that is being exploited). From a binary translation perspective, the trace for the basic block target address for the split entry point would not exist and would need to be re-translated starting with JALR, and the JALR would be treated as an unfused indirect JALR and require a runtime translation stub.

auipc t1, pc + 1589248
1:
jalr ra, t1, 324 # <memset>



auipc t1, pc + 1576960
jalr ra, t1, -164 # <strcmp>
ret



jal x0, 1b # or ra value restored from xyz(sp)

I am just mentally questioning the safety of treating the AUIPC+JALR pair as a direct jump (with register side-effect) instead of as an indirect call.

With parallel instruction decode, the immediate for the AUIPC+JALR jump could be decoded in one step, however the address temporary still needs to be committed to the register file for consistency.

Given the register side effect is redundant in a fused decode implementation it leads to the possibility of an extension like this:

auipc zero, pc + 1576960
jalr ra, zero, -164 # <strcmp>

AUIPC with rd=zero is a nop. I’m not suggesting this is a good idea; it just came to mind when considering the register side effect redundant with the fused variant. i.e. we just need to decode the immediate over two instructions.

Michael

Sober Liu

unread,
Mar 2, 2017, 11:12:52 PM3/2/17
to Michael Clark, RISC-V ISA Dev
I am not sure I get u idea fully. But are u expected that both code/data in 32bits range?
And for "a direct PC relative jump", do u expect for static libs instead of dynamic libs?
--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/367A58E1-7532-409F-AB6B-5A762411BC2E%40mac.com.

-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may contain
confidential information. Any unauthorized review, use, disclosure or distribution
is prohibited. If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------

Michael Clark

unread,
Mar 2, 2017, 11:15:55 PM3/2/17
to Sober Liu, RISC-V ISA Dev
Two meanings for direct. Direct relative vs absolute indirect.

Michael Clark

unread,
Mar 2, 2017, 11:18:57 PM3/2/17
to RISC-V ISA Dev
The kind of off-topic stuff which is possible translations for AUIPC+JALR is here:

https://github.com/michaeljclark/riscv-meta/blob/master/doc/src/jumps.md

I am discovering that there are many potential ways to translate AUIPC+JALR. RET (jalr zero, ra) may need a hidden stack that contains pair<ra,translated_ra> and a fallback hash table lookup for misses and indirect JALR for function pointers will need hash table lookups. Likely to be dozens of cycles for truly indirect calls. RET (jalr zero, ra) can be optimised assuming a normal call stack is being used.

Michael Clark

unread,
Mar 2, 2017, 11:20:16 PM3/2/17
to Sober Liu, RISC-V ISA Dev

> On 3 Mar 2017, at 5:12 PM, Sober Liu <sob...@nvidia.com> wrote:
>
> I am not sure I get u idea fully. But are u expected that both code/data in 32bits range?

Yes. +-32 as per the AUIPC+JALR pair.

> And for “a direct PC relative jump", do u expect for static libs instead of dynamic libs?

Yes. I am thinking about the static case. I need to analyse GOT offset calls is dynamic libs. Next…

Michael Clark

unread,
Mar 2, 2017, 11:39:54 PM3/2/17
to Sober Liu, RISC-V ISA Dev
On 3 Mar 2017, at 5:20 PM, Michael Clark <michae...@mac.com> wrote:


On 3 Mar 2017, at 5:12 PM, Sober Liu <sob...@nvidia.com> wrote:

I am not sure I get u idea fully. But are u expected that both code/data in 32bits range?

Yes. +-32 as per the AUIPC+JALR pair.

And for  “a direct PC relative jump", do u expect for static libs instead of dynamic libs?

Yes. I am thinking about the static case. I need to analyse GOT offset calls is dynamic libs. Next…

Sorry I meant in dynamic libs. I am presently analysing vmlinux.

I will have to think about PLT stubs to GOT offsets and lazy resolution:

               1aec0:   00018e17             auipc          t3, pc + 98304
               1aec4:   880e3e03             ld             t3, -1920(t3)       # 0x0000000000032740
               1aec8:   000e0367             jalr           t1, t3, 0

We can assuming a dynamic linker doesn’t unlink a GOT entry by tracing AUIPC+LD+JALR a few times (after resolve has populated the GOT entry) to avoid a hash table lookup for translated code, although it would likely be possible to make a test case that changes a GOT entry and demonstrates the processor is not a RISC-V, rather makes some assumptions about jump targets. A translator should be able to pass tests. This would be an interesting tricky test.

Michael Clark

unread,
Mar 3, 2017, 12:04:20 AM3/3/17
to RISC-V ISA Dev
On 3 Mar 2017, at 5:39 PM, Michael Clark <michae...@mac.com> wrote:


On 3 Mar 2017, at 5:20 PM, Michael Clark <michae...@mac.com> wrote:


On 3 Mar 2017, at 5:12 PM, Sober Liu <sob...@nvidia.com> wrote:

I am not sure I get u idea fully. But are u expected that both code/data in 32bits range?

Yes. +-32 as per the AUIPC+JALR pair.

And for  “a direct PC relative jump", do u expect for static libs instead of dynamic libs?

Yes. I am thinking about the static case. I need to analyse GOT offset calls is dynamic libs. Next…

Sorry I meant in dynamic libs. I am presently analysing vmlinux.

I will have to think about PLT stubs to GOT offsets and lazy resolution:

               1aec0:   00018e17             auipc          t3, pc + 98304
               1aec4:   880e3e03             ld             t3, -1920(t3)       # 0x0000000000032740
               1aec8:   000e0367             jalr           t1, t3, 0

We can assuming a dynamic linker doesn’t unlink a GOT entry by tracing AUIPC+LD+JALR a few times (after resolve has populated the GOT entry) to avoid a hash table lookup for translated code, although it would likely be possible to make a test case that changes a GOT entry and demonstrates the processor is not a RISC-V, rather makes some assumptions about jump targets. A translator should be able to pass tests. This would be an interesting tricky test.

We would need to fault on writes to the GOT to make a translator pass tests, i.e. watch writes to pages containing function pointers referenced in translated code and recognise the pattern (AUIPC+LD+JALR). Quite complex to translate shared library calls efficiently. Sorry, diverging now.

Jacob Bachmeyer

unread,
Mar 3, 2017, 12:33:16 AM3/3/17
to Michael Clark, RISC-V ISA Dev
Michael Clark wrote:
> With parallel instruction decode, the immediate for the AUIPC+JALR jump could be decoded in one step, however the address temporary still needs to be committed to the register file for consistency.
>
> Given the register side effect is redundant in a fused decode implementation it leads to the possibility of an extension like this:
>
> auipc zero, pc + 1576960
> jalr ra, zero, -164 # <strcmp>
>
> AUIPC with rd=zero is a nop. I’m not suggesting this is a good idea; it just came to mind when considering the register side effect redundant with the fused variant. i.e. we just need to decode the immediate over two instructions.

No extension needed and perfectly consistent:

AUIPC ra, 1576960
JALR ra, ra, -164 # <strcmp>


In fact, this is the *only* way for AUIPC+JALR as a function call to be
a valid fusion pair. The example you give is a no-op followed by an
absolute jump of the type originally envisioned as an SBI call.

Remember: macro-op fusion requires that all side-effects of earlier
instructions be clobbered by later instructions in the fusion group.


-- Jacob

Michael Clark

unread,
Mar 3, 2017, 12:36:10 AM3/3/17
to jcb6...@gmail.com, RISC-V ISA Dev
Yes. Good idea. I like your version.

zero was just what immediately came to mind as a no side effect version. ra is perfect.

Michael Clark

unread,
Mar 3, 2017, 12:37:46 AM3/3/17
to jcb6...@gmail.com, RISC-V ISA Dev
In fact it’s such a good idea that CALL should emit it. The pseudo is hard-coded to use `t1`.

Michael Clark

unread,
Mar 3, 2017, 1:11:16 AM3/3/17
to jcb6...@gmail.com, RISC-V ISA Dev
On 3 Mar 2017, at 6:37 PM, Michael Clark <michae...@mac.com> wrote:


On 3 Mar 2017, at 6:36 PM, Michael Clark <michae...@mac.com> wrote:


On 3 Mar 2017, at 6:33 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:

Michael Clark wrote:
With parallel instruction decode, the immediate for the AUIPC+JALR jump could be decoded in one step, however the address temporary still needs to be committed to the register file for consistency.

Given the register side effect is redundant in a fused decode implementation it leads to the possibility of an extension like this:

auipc zero, pc + 1576960
jalr  ra, zero, -164 # <strcmp>

AUIPC with rd=zero is a nop. I’m not suggesting this is a good idea; it just came to mind when considering the register side effect redundant with the fused variant. i.e. we just need to decode the immediate over two instructions.

No extension needed and perfectly consistent:

AUIPC ra, 1576960
JALR ra, ra, -164 # <strcmp>


In fact, this is the *only* way for AUIPC+JALR as a function call to be a valid fusion pair.  The example you give is a no-op followed by an absolute jump of the type originally envisioned as an SBI call.

Remember:  macro-op fusion requires that all side-effects of earlier instructions be clobbered by later instructions in the fusion group.

I had been thinking about this liveness constraint earlier. 

Yes. Good idea. I like your version.

zero was just what immediately came to mind as a no side effect version. ra is perfect.

In fact it’s such a good idea that CALL should emit it. The pseudo is hard-coded to use `t1`.

CALL is a non-invasive change as nothing should depend on t1 for the no argument version. The 2 argument rd version of CALL would need to use rd as the temporary and the change would be more invasive. gcc emits the no argument version by default. TAIL can’t really be changed as we’d be clobbering ra. CALL is the common case.

Something like this in riscv-bintuils-gdb 

diff --git a/opcodes/riscv-opc.c b/opcodes/riscv-opc.c
index cc39390ec8..0c87b735dd 100644
--- a/opcodes/riscv-opc.c
+++ b/opcodes/riscv-opc.c
@@ -147,7 +147,7 @@ const struct riscv_opcode riscv_opcodes[] =
 {"jal",       "32C", "Ca",  MATCH_C_JAL, MASK_C_JAL, match_opcode, INSN_ALIAS },
 {"jal",       "I",   "a",  MATCH_JAL | (X_RA << OP_SH_RD), MASK_JAL | MASK_RD, match_opcode, INSN_ALIAS },
 {"call",      "I",   "d,c", (X_T1 << OP_SH_RS1), (int) M_CALL,  match_never, INSN_MACRO },
-{"call",      "I",   "c", (X_T1 << OP_SH_RS1) | (X_RA << OP_SH_RD), (int) M_CALL,  match_never, INSN_MACRO },
+{"call",      "I",   "c", (X_RA << OP_SH_RS1) | (X_RA << OP_SH_RD), (int) M_CALL,  match_never, INSN_MACRO },
 {"tail",      "I",   "c", (X_T1 << OP_SH_RS1), (int) M_CALL,  match_never, INSN_MACRO },
 {"jump",      "I",   "c,s", 0, (int) M_CALL,  match_never, INSN_MACRO },
 {"nop",       "C",   "",  MATCH_C_ADDI, 0xffff, match_opcode, INSN_ALIAS },

Allen J. Baum

unread,
Mar 3, 2017, 2:31:00 AM3/3/17
to Michael Clark, RISC-V ISA Dev
OK, I'm feeling dense.
I don't understand the statement:
macro-op fusion requires that all side-effects of earlier
instructions be clobbered by later instructions in the fusion group.

Why? The side effect of not modifying a register in a fused pair isn't an architectural requirement - but it could be an implementation requiremnt (since it saves register file ports - but perhaps not always). There are other side effects (exceptions) that must not be "clobbered". I suppose you could make a requirement that the first op of fused pair can never cause a trap or exception; that would solve part of the problem, or you could modify the statement to be specific about register side effects.

Regarding the zero case:
Macro op fusion is a nice to have, but not a guarantee, in a dynamic sense.
It's possible that sometimes ops will be fused - and other times the same op won't be fused (or that some pair of ops wil be fused while other pairs of the same ops won't be). An obvious case is when the pair cross a cache or page boundary - in which case using zero is broken.

Am I missing something?
--
**************************************************
* Allen Baum tel. (908)BIT-BAUM *
* 248-2286 *
**************************************************

Andrew Waterman

unread,
Mar 3, 2017, 3:10:04 AM3/3/17
to Michael Clark, Jacob Bachmeyer, RISC-V ISA Dev
We don't use rs1=ra here because that serves as a hint to pop the
return-address stack for some implementations.

Many instruction fusion opportunities will involve writing multiple
registers. For example, to reduce latency you'd also want to fuse
things like

lui t0, sym
ld t1, offset(t0)

which shows up in cases that t0 is later reused.

Superscalars typically over-provision write ports, so this is a matter
of control complexity, not extra datapath.
> --
> You received this message because you are subscribed to the Google Groups
> "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to isa-dev+u...@groups.riscv.org.
> To post to this group, send email to isa...@groups.riscv.org.
> Visit this group at
> https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
> To view this discussion on the web visit
> https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/9B54C10C-00C2-44C7-AC86-BF4A04A673AA%40mac.com.

Bruce Hoult

unread,
Mar 3, 2017, 3:47:50 AM3/3/17
to Andrew Waterman, Michael Clark, Jacob Bachmeyer, RISC-V ISA Dev
From the 2.1 spec:

"Return-address prediction stacks are a common feature of high-performance instruction-fetch units. We note that rd and rs1 can be used to guide an implementation’s instruction-fetch pre-diction logic, indicating whether JALR instructions should push (rd=x1), pop (rd=x0, rs1=x1), or not touch (otherwise) a return-address stack."

Here, rs1 is ra, but rd is not zero, so return/pop return buffer should not be assumed.

On the contrary, rd *is* ra, so call/push return buffer should be matched.

Seems ok to me, and in fact a very good idea.


--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+unsubscribe@groups.riscv.org.

To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

Andrew Waterman

unread,
Mar 3, 2017, 4:05:52 AM3/3/17
to Bruce Hoult, Michael Clark, Jacob Bachmeyer, RISC-V ISA Dev
My comment is consistent with the current spec's commentary: "push
(rd=x1/x5), pop (rs1=x1/x5), or not touch (otherwise)"

Both pushing and popping the RAS is useful for coroutines. Alpha has
a similar hint on its JSR instruction.
>> > email to isa-dev+u...@groups.riscv.org.
>> > To post to this group, send email to isa...@groups.riscv.org.
>> > Visit this group at
>> > https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
>> > To view this discussion on the web visit
>> >
>> > https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/9B54C10C-00C2-44C7-AC86-BF4A04A673AA%40mac.com.
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "RISC-V ISA Dev" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to isa-dev+u...@groups.riscv.org.

Bruce Hoult

unread,
Mar 3, 2017, 4:19:03 AM3/3/17
to Andrew Waterman, Michael Clark, Jacob Bachmeyer, RISC-V ISA Dev
2.1 is the most recent spec on riscv.org. It's not the current spec?



>> > To post to this group, send email to isa...@groups.riscv.org.
>> > Visit this group at
>> > https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
>> > To view this discussion on the web visit
>> >
>> > https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/9B54C10C-00C2-44C7-AC86-BF4A04A673AA%40mac.com.
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "RISC-V ISA Dev" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+unsubscribe@groups.riscv.org.

To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

Michael Clark

unread,
Mar 3, 2017, 7:10:04 AM3/3/17
to Andrew Waterman, Jacob Bachmeyer, RISC-V ISA Dev

> On 3/03/2017, at 9:09 PM, Andrew Waterman <and...@sifive.com> wrote:
>
> We don't use rs1=ra here because that serves as a hint to pop the
> return-address stack for some implementations.

I see. The stack hint is useful for handling fast indirect return. It would work if the pop hint was rs1=ra and rd=zero (assuming tail still uses rs1=t1).

> Many instruction fusion opportunities will involve writing multiple
> registers. For example, to reduce latency you'd also want to fuse
> things like
>
> lui t0, sym
> ld t1, offset(t0)
>
> which shows up in cases that t0 is later reused.

Yes we can avoid the load of t0 if the register is killed in near site after this expression. I will check again but I didn't remember seeing the absolute load pattern (well, not in vmlinux; I saw it only for constants).

This is the form that we can safely fuse into a single load without looking too far ahead:

lui t0, sym
ld t0, offset(t0)

Although I do remember seeing the lui pattern where the register is reused but it is not PIC or PIE, but of course there is also the equivalent auipc pattern for GOT references.

More complex fuse patterns might need an 8 to 12 instruction window. The problem is registers that are possibly not killed until after branches and jumps. The register has to not be live to avoid the partially formed address being committed (when it's a temporary for forming an address for a load / store).
> To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CA%2B%2B6G0D-L3r4OAT6kBgebAre5%2B6MNi0YJHEtq_ZZ4mm2u2eWhQ%40mail.gmail.com.

Jacob Bachmeyer

unread,
Mar 3, 2017, 5:23:04 PM3/3/17
to Allen J. Baum, Michael Clark, RISC-V ISA Dev
Allen J. Baum wrote:
> OK, I'm feeling dense.
> I don't understand the statement:
> macro-op fusion requires that all side-effects of earlier
> instructions be clobbered by later instructions in the fusion group.
>
> Why? The side effect of not modifying a register in a fused pair isn't an architectural requirement - but it could be an implementation requiremnt (since it saves register file ports - but perhaps not always). There are other side effects (exceptions) that must not be "clobbered". I suppose you could make a requirement that the first op of fused pair can never cause a trap or exception; that would solve part of the problem, or you could modify the statement to be specific about register side effects.

I was assuming that RISC-V implementations would do at most one regfile
write per macro-op. Under this constraint, my statement is correct, but
it has been mentioned that implementations complex enough to fuse
AUIPC+JALR are probably complex enough to have multiple regfile write
ports. Another requirement of macro-op fusion is that the result of a
fusion group must be identical to the same instructions executed
individually. Exceptions at the first instruction are easy, just trap
at the entire group, but exceptions on subsequent instructions get
complicated. I believe that speculative execution has been offered as a
solution to this.

> Regarding the zero case:
> Macro op fusion is a nice to have, but not a guarantee, in a dynamic sense.
> It's possible that sometimes ops will be fused - and other times the same op won't be fused (or that some pair of ops wil be fused while other pairs of the same ops won't be). An obvious case is when the pair cross a cache or page boundary - in which case using zero is broken.
>

Using zero in AUIPC+JALR is broken in any case, since the result of
macro-op fusion must be identical to the result of executing the
individual fused instructions in sequence.

> Am I missing something?

I do not believe so, but I clearly missed that there are side effects
other than regfile writes.


-- Jacob

Michael Clark

unread,
Mar 3, 2017, 5:35:47 PM3/3/17
to Andrew Waterman, Bruce Hoult, Jacob Bachmeyer, RISC-V ISA Dev
It’s unfortunate the pop constraint doesn’t also contain rd=0. I guess making the call stack hint for push and pop simple, simplifies the call stack implementation.

Jacob’s version allows a register write elision and potentially decoding the CALL target immediate in early decode.

For me, from a binary translation perspective, it lets me elide a redundant mov of the target address that I will need to populate into a temporary. One less instruction. Although I wouldn’t use binary translation as an argument. I would use the potential for micro-architectural optimisation of inter-module calls in large executables.

The weight I guess is the between the pop constraint simplicity and whether or not the register write elision on inter-module calls is worth it.


Jacob Bachmeyer

unread,
Mar 3, 2017, 6:35:42 PM3/3/17
to Michael Clark, Andrew Waterman, Bruce Hoult, RISC-V ISA Dev
Michael Clark wrote:
> It’s unfortunate the pop constraint doesn’t also contain rd=0. I guess
> making the call stack hint for push and pop simple, simplifies the
> call stack implementation.

Andrew Waterman hinted at a more-important reason: coroutines need to
both push *and* pop the return address stack on the same instruction.

> Jacob’s version allows a register write elision and potentially
> decoding the CALL target immediate in early decode.

It does, but it also breaks the return-stack hints on JALR for
coroutines (which are not mentioned in spec v2.1).

> For me, from a binary translation perspective, it lets me elide a
> redundant mov of the target address that I will need to populate into
> a temporary. One less instruction. Although I wouldn’t use binary
> translation as an argument. I would use the potential for
> micro-architectural optimisation of inter-module calls in large
> executables.
>
> The weight I guess is the between the pop constraint simplicity and
> whether or not the register write elision on inter-module calls is
> worth it.

Another factor is support for coroutines--the push constraint and pop
constraint must be compatible in that case. The return-address stack
hints in v2.1 do not meet this criteria and also do not mention using x5
as an alternate link register for millicode.

-- Jacob

Michael Clark

unread,
Mar 3, 2017, 7:29:42 PM3/3/17
to jcb6...@gmail.com, Andrew Waterman, Bruce Hoult, RISC-V ISA Dev

> On 4 Mar 2017, at 12:35 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>
> Michael Clark wrote:
>> It’s unfortunate the pop constraint doesn’t also contain rd=0. I guess making the call stack hint for push and pop simple, simplifies the call stack implementation.
>
> Andrew Waterman hinted at a more-important reason: coroutines need to both push *and* pop the return address stack on the same instruction.

Okay I understand now.

>> Jacob’s version allows a register write elision and potentially decoding the CALL target immediate in early decode.
>
> It does, but it also breaks the return-stack hints on JALR for coroutines (which are not mentioned in spec v2.1).

Interesting trade off.

>> For me, from a binary translation perspective, it lets me elide a redundant mov of the target address that I will need to populate into a temporary. One less instruction. Although I wouldn’t use binary translation as an argument. I would use the potential for micro-architectural optimisation of inter-module calls in large executables.
>>
>> The weight I guess is the between the pop constraint simplicity and whether or not the register write elision on inter-module calls is worth it.
>
> Another factor is support for coroutines--the push constraint and pop constraint must be compatible in that case. The return-address stack hints in v2.1 do not meet this criteria and also do not mention using x5 as an alternate link register for millicode.

The call stack and coroutine hints will also be useful for binary translators, for fast coroutines and procedure returns.

So we can’t elide the AUIPC+JALR temporary unless we trace past a CALL and see it used.

The microarchitecture can still decode the target address before register decoding if it looks at the opcode and immediate of the two adjacent instructions, it just needs to write the jump target register (optimally once).

Michael.

Rogier Brussee

unread,
Mar 8, 2017, 4:35:05 PM3/8/17
to RISC-V ISA Dev, michae...@mac.com, and...@sifive.com, br...@hoult.org, jcb6...@gmail.com
For all intents and purposes 

auipc ra imm20
jalr ra ra imm11*

is indistinguishable from "jal ra imm20imm11" except that such long immediates can't be done (the last bit in the immediate of jalr is ignored) and it seems completely to handle them the same internally i.e. after fusion.  It should be by far the  common case (in fact in Xcondensed I envisioned jalr_ra_ra imm11 to be a 2 byte Xcondensed instruction). The normal return is jalr zero ra 0 and fwiw for many implementations that instruction is compressed (C.jr ra ) instruction.

In fact the coroutine case seems not the case to optimise for, and surely if simultaneous  popping and pushing the call stack really is important one can use a different calling convention for them and do

auipc x5 imm20
jalr x5 x5 imm11*

I know this is swearing in church, but x5 was added as a register that may manipulate the call stack if any only in v2.1 (if I understood correctly for implementing register saving and restoring calls that would be called with jalr x5 zero imm), and this is a quality of implementation issue. Can't  the spec be simply updated to insist on jalr x0 ra 0 aka C.jr ra  or jalr rd x5 imm to pop the return stack (so jalr x5 x5 imm would push and pop the return stack) ? 

Rogier

Op zaterdag 4 maart 2017 00:35:42 UTC+1 schreef Jacob Bachmeyer:

kr...@berkeley.edu

unread,
Apr 24, 2017, 3:02:00 AM4/24/17
to Rogier Brussee, RISC-V ISA Dev, michae...@mac.com, and...@sifive.com, br...@hoult.org, jcb6...@gmail.com

We agree that the fusion case is more important to optimize for than
the coroutine case, but we do want to support both and to allow fused
calls with one reg write port when using the alternate link register
(it was added to call register save code for -Os case, so will be
frequently used in that code)

The new proposal is that push+pop is hinted only when rs1!=rd (and
rs1=x1/x5,rd=x1/x5), so

auipc ra, imm20; jalr ra, ra, imm11

can be fused with either x1 or x5, writes only a single value, and
only pushes the RAS.

Coroutines would have to use

jalr x1, x5, imm11
or
jalr x5, x1, imm11

to hint push+pop.

New text:
"JALR instructions should:
push only (rd=x1/x5, rs1!=x1/x5 or rs1=rd)
pop only (rs1=x1/x5, rd!=x1/x5),
push and pop (rd=x1/x5, rs1=x1/x5, and rs1!=rd)
or not touch (otherwise) a return-address stack."

Truth table

rd rs1 rd==rs1
!x1/x5 !x1/x5 X nothing
x1/x5 !x1/x5 X push
x1/x5 x1/x5 0 push+pop
x1/x5 x1/x5 1 push
!x1/x5 x1/x5 X pop

Krste
| --
| You received this message because you are subscribed to the Google Groups
| "RISC-V ISA Dev" group.
| To unsubscribe from this group and stop receiving emails from it, send an email
| to isa-dev+u...@groups.riscv.org.
| To post to this group, send email to isa...@groups.riscv.org.
| Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/
| .
| To view this discussion on the web visit https://groups.google.com/a/
| groups.riscv.org/d/msgid/isa-dev/
| 9ea8ebf6-63b6-4314-979e-c12292b54673%40groups.riscv.org.

Bruce Hoult

unread,
Apr 24, 2017, 8:44:29 AM4/24/17
to Krste Asanovic, Rogier Brussee, RISC-V ISA Dev, Michael Clark, Andrew Waterman, Jacob Bachmeyer
Perfect! Thank you. +1


| To post to this group, send email to isa...@groups.riscv.org.
| Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/
| .
| To view this discussion on the web visit https://groups.google.com/a/
| groups.riscv.org/d/msgid/isa-dev/
| 9ea8ebf6-63b6-4314-979e-c12292b54673%40groups.riscv.org.

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+unsubscribe@groups.riscv.org.

To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

Michael Clark

unread,
Apr 24, 2017, 8:16:42 PM4/24/17
to Bruce Hoult, Krste Asanovic, Rogier Brussee, RISC-V ISA Dev, Andrew Waterman, Jacob Bachmeyer
Excellent!

We’ll need to test a change to the gcc CALL macro to emit the following to take full advantage of the change as per Jacob’s original suggestion:

1: AUIPC ra, %pcrel_hi(symbol)
JALR ra, %pcrel_lo(1b)(ra)

TAIL uses zero as the link register so it can’t have its address register write elided.

Something like this in binutils (needs testing):

mclark@minty:~/src/riscv-gnu-toolchain/riscv-binutils-gdb$ git diff

diff --git a/opcodes/riscv-opc.c b/opcodes/riscv-opc.c
index cc39390ec8..0c87b735dd 100644
--- a/opcodes/riscv-opc.c
+++ b/opcodes/riscv-opc.c
@@ -147,7 +147,7 @@ const struct riscv_opcode riscv_opcodes[] =
 {"jal",       "32C", "Ca",  MATCH_C_JAL, MASK_C_JAL, match_opcode, INSN_ALIAS },
 {"jal",       "I",   "a",  MATCH_JAL | (X_RA << OP_SH_RD), MASK_JAL | MASK_RD, match_opcode, INSN_ALIAS },
 {"call",      "I",   "d,c", (X_T1 << OP_SH_RS1), (int) M_CALL,  match_never, INSN_MACRO },
-{"call",      "I",   "c", (X_T1 << OP_SH_RS1) | (X_RA << OP_SH_RD), (int) M_CALL,  match_never, INSN_MACRO },
+{"call",      "I",   "c", (X_RA << OP_SH_RS1) | (X_RA << OP_SH_RD), (int) M_CALL,  match_never, INSN_MACRO },
 {"tail",      "I",   "c", (X_T1 << OP_SH_RS1), (int) M_CALL,  match_never, INSN_MACRO },
 {"jump",      "I",   "c,s", 0, (int) M_CALL,  match_never, INSN_MACRO },
 {"nop",       "C",   "",  MATCH_C_ADDI, 0xffff, match_opcode, INSN_ALIAS },

Rogier Brussee

unread,
Apr 25, 2017, 4:07:20 PM4/25/17
to RISC-V ISA Dev, rogier....@gmail.com, michae...@mac.com, and...@sifive.com, br...@hoult.org, jcb6...@gmail.com
That seems optimal!

Thanks Rogier


Op maandag 24 april 2017 09:02:00 UTC+2 schreef krste:

Andrew Waterman

unread,
Apr 26, 2017, 4:26:12 AM4/26/17
to Michael Clark, Bruce Hoult, Krste Asanovic, Rogier Brussee, RISC-V ISA Dev, Jacob Bachmeyer
Yeah, there's nothing to be done for TAIL.

Can you PR this change against the riscv-binutils github repo? I'm
pretty sure it will "just work," so we can test it in conjunction with
other toolchain improvements sometime in May.
>> | to isa-dev+u...@groups.riscv.org.
>> | To post to this group, send email to isa...@groups.riscv.org.
>> | Visit this group at
>> https://groups.google.com/a/groups.riscv.org/group/isa-dev/
>> | .
>> | To view this discussion on the web visit https://groups.google.com/a/
>> | groups.riscv.org/d/msgid/isa-dev/
>> | 9ea8ebf6-63b6-4314-979e-c12292b54673%40groups.riscv.org.
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "RISC-V ISA Dev" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to isa-dev+u...@groups.riscv.org.
> --
> You received this message because you are subscribed to the Google Groups
> "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to isa-dev+u...@groups.riscv.org.
> To post to this group, send email to isa...@groups.riscv.org.
> Visit this group at
> https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
> To view this discussion on the web visit
> https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/F9FF315C-F2C4-47F2-A999-4D54B65581D2%40mac.com.

Michael Clark

unread,
Apr 26, 2017, 4:28:20 AM4/26/17
to Andrew Waterman, Bruce Hoult, Krste Asanovic, Rogier Brussee, RISC-V ISA Dev, Jacob Bachmeyer
Hi Andrew,

No problem. Yes, it’s a simple change, but I’ll test compile the toolchain, compile something and do an objdump and then make a pull request…

Cheers,
Michael.
> To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CA%2B%2B6G0AkwH2v7GU_GXrbBPuMaaoWupQiVSRoBjjtr7r5hSxV0g%40mail.gmail.com.

kr...@berkeley.edu

unread,
Apr 26, 2017, 5:09:43 AM4/26/17
to Andrew Waterman, Michael Clark, Bruce Hoult, Krste Asanovic, Rogier Brussee, RISC-V ISA Dev, Jacob Bachmeyer

Though the TAIL case can still be fused with a single register write,
given that JALR writes x0.

Krste

Rogier Brussee

unread,
Apr 26, 2017, 5:40:15 AM4/26/17
to RISC-V ISA Dev, br...@hoult.org, kr...@berkeley.edu, rogier....@gmail.com, and...@sifive.com, jcb6...@gmail.com
There are now effectively two canonical, potentially  hardware optimised, ways to do a call which can be replaced by jal without observable effects (if the immediate is small enough): with ra == x1 and with  t0 == x5 as link register and temporary. The t0 version is supposed to be used for storing and restoring registers, but it seems more generally useful for guaranteed leaf calls, (or more generally, for calls with a calling convention where ra is callee saved).  I don't  know how difficult it is to teach gcc "leaf_call" (or "call_ra_callee_saved"  calls) but it seems like just another calling convention. They should also be useful for static leaf functions that do not leave file scope as function pointers, but in any case, it seems like a useful self documenting asm macro. 

Perhaps call_absolute and leaf_call_absolute (with auipc replaced with lui) should also be canonicalised as macros?

Likewise, perhaps call_coroutine_ra and call_coroutine_t0 should also be canonicalised as macros (arguably with names that reflect that they trash t0 respectively ra)?


Rogier

Op dinsdag 25 april 2017 02:16:42 UTC+2 schreef michaeljclark:
Message has been deleted
Message has been deleted

Michael Clark

unread,
Jun 18, 2017, 12:57:55 PM6/18/17
to Rogier Brussee, RISC-V ISA Dev, br...@hoult.org, kr...@berkeley.edu, and...@sifive.com, jcb6...@gmail.com

On 19 Jun 2017, at 1:09 AM, Rogier Brussee <rogier....@gmail.com> wrote:

Recently, the CALL macro has been changed in the assembly and ELF spec. I wondered whether it would not make sense to also change the TAIL macro from

1:
AUIPC t0, %pcrel_hi(symbol)
JALR  ra, t0 %pcrel_hi(1b)

AUIPC t1, %pcrel_hi(symbol)
JALR  zero, t1 %pcrel_hi(1b)

to

AUIPC t1, %pcrel_hi(symbol)
JALR t1 , t1 %pcrel_hi(1b)

We take t1 = x6  so as to leave callstacks alone. Sure this sets t1, but who cares: if callstacks are left alone that does no harm. It may even be useful for stack unwinding for exception handling and debugging to have a chance to know where a call came from even if it is a tail call. The main point is that hardware needs only match one pattern for call fusion, as in the absence of tail calls, the original tail-call == jump pattern would only be useful for very very long jumps in a function which should be rare enough to be worth the trouble.

CALL no longer has a target address side effect, but changing TAIL has no benefit as we can’t eliminate one side effect like we could for CALL, and in fact it just introduces a different side effect.

There is nothing that can be elided like there is with CALL which previously had two side effects (ra and t1). In the fusion case for TAIL, we trade a write of t1 with the target address with a write of the link address. In fact for simple implementations it introduces a redundant write (two writes and one read from t1 instead of the current case which is a single write and single read to/from t1). I don’t think we should change TAIL.

Michael


Op dinsdag 25 april 2017 02:16:42 UTC+2 schreef michaeljclark:
Excellent!
Perfect! Thank you. +1


| To post to this group, send email to isa...@groups.riscv.org.
| Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/
| .
| To view this discussion on the web visit https://groups.google.com/a/
| groups.riscv.org/d/msgid/isa-dev/
| 9ea8ebf6-63b6-4314-979e-c12292b54673%40groups.riscv.org.

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
-- 
You received this message because you are subscribed to the Google Groups "