Interpreting fused AUIPC+JALR as a direct jump (with or without register side effect)

1003 views
Skip to first unread message

Michael Clark

unread,
Mar 2, 2017, 10:55:49 PM3/2/17
to RISC-V ISA Dev
Just sharing thoughts on fused AUIPC+JALR being considered a direct jump and link…

In static analysis of large binaries, many library calls are in the form of AUIPC+JALR versus JAL (which has +/-1MiB range). It is common for modern applications to have from 10MB to 100MB of text and most function calls expressed as AUIPC+JALR.

$ stat -f %z “/Applications/Google Chrome.app//Contents/Versions/56.0.2924.87/Google Chrome Framework.framework/Google Chrome Framework”
112012064

JALR is as we know a register indirect jump and link return address instruction, and interestingly register indirect calls and returns are particularly hard for dynamic binary translation (my specific interest). A translator typically needs to inject a stub at the translation point that looks up the address of the translation for the ‘dynamic’ target address, and as the target address is not known at the time of translation, a translator can’t always translate past indirect jumps (and obviously return). There are some interesting techniques in the literature, such that the inserted stubs can learn a static address and later rewrite the indirect jump as a direct jump (for the indirect call case, but obviously not return).

Indirect jumps are also likely harder for microarchitectures due to requiring a register read to decode the target address for instruction prefetch. i.e. there may be a higher latency to resolve the jump target address further down the pipeline versus decoding it early as an immediate.

While JALR is technically a register indirect jump, the fused adjacent combination of AUIPC+JALR can be seen as a direct PC relative jump and link with load target address (as a side effect) and on the contrary can be efficiently translated, or in a microarchitecture, the jump target instruction address prefetch can be started before register commit (of the side effect).

The observation (thought experiment) is that one of these AUIPC+JALR can later be split, and the JALR can potentially be used as a ROP gadget given enough diversity of offsets one might be able to get a return address onto the stack pointing to an adjacent function given a known value for the temporary (the t1 temporary from the last indirect call in code that is being exploited). From a binary translation perspective, the trace for the basic block target address for the split entry point would not exist and would need to be re-translated starting with JALR, and the JALR would be treated as an unfused indirect JALR and require a runtime translation stub.

auipc t1, pc + 1589248
1:
jalr ra, t1, 324 # <memset>



auipc t1, pc + 1576960
jalr ra, t1, -164 # <strcmp>
ret



jal x0, 1b # or ra value restored from xyz(sp)

I am just mentally questioning the safety of treating the AUIPC+JALR pair as a direct jump (with register side-effect) instead of as an indirect call.

With parallel instruction decode, the immediate for the AUIPC+JALR jump could be decoded in one step, however the address temporary still needs to be committed to the register file for consistency.

Given the register side effect is redundant in a fused decode implementation it leads to the possibility of an extension like this:

auipc zero, pc + 1576960
jalr ra, zero, -164 # <strcmp>

AUIPC with rd=zero is a nop. I’m not suggesting this is a good idea; it just came to mind when considering the register side effect redundant with the fused variant. i.e. we just need to decode the immediate over two instructions.

Michael

Sober Liu

unread,
Mar 2, 2017, 11:12:52 PM3/2/17
to Michael Clark, RISC-V ISA Dev
I am not sure I get u idea fully. But are u expected that both code/data in 32bits range?
And for "a direct PC relative jump", do u expect for static libs instead of dynamic libs?
--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/367A58E1-7532-409F-AB6B-5A762411BC2E%40mac.com.

-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may contain
confidential information. Any unauthorized review, use, disclosure or distribution
is prohibited. If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------

Michael Clark

unread,
Mar 2, 2017, 11:15:55 PM3/2/17
to Sober Liu, RISC-V ISA Dev
Two meanings for direct. Direct relative vs absolute indirect.

Michael Clark

unread,
Mar 2, 2017, 11:18:57 PM3/2/17
to RISC-V ISA Dev
The kind of off-topic stuff which is possible translations for AUIPC+JALR is here:

https://github.com/michaeljclark/riscv-meta/blob/master/doc/src/jumps.md

I am discovering that there are many potential ways to translate AUIPC+JALR. RET (jalr zero, ra) may need a hidden stack that contains pair<ra,translated_ra> and a fallback hash table lookup for misses and indirect JALR for function pointers will need hash table lookups. Likely to be dozens of cycles for truly indirect calls. RET (jalr zero, ra) can be optimised assuming a normal call stack is being used.

Michael Clark

unread,
Mar 2, 2017, 11:20:16 PM3/2/17
to Sober Liu, RISC-V ISA Dev

> On 3 Mar 2017, at 5:12 PM, Sober Liu <sob...@nvidia.com> wrote:
>
> I am not sure I get u idea fully. But are u expected that both code/data in 32bits range?

Yes. +-32 as per the AUIPC+JALR pair.

> And for “a direct PC relative jump", do u expect for static libs instead of dynamic libs?

Yes. I am thinking about the static case. I need to analyse GOT offset calls is dynamic libs. Next…

Michael Clark

unread,
Mar 2, 2017, 11:39:54 PM3/2/17
to Sober Liu, RISC-V ISA Dev
On 3 Mar 2017, at 5:20 PM, Michael Clark <michae...@mac.com> wrote:


On 3 Mar 2017, at 5:12 PM, Sober Liu <sob...@nvidia.com> wrote:

I am not sure I get u idea fully. But are u expected that both code/data in 32bits range?

Yes. +-32 as per the AUIPC+JALR pair.

And for  “a direct PC relative jump", do u expect for static libs instead of dynamic libs?

Yes. I am thinking about the static case. I need to analyse GOT offset calls is dynamic libs. Next…

Sorry I meant in dynamic libs. I am presently analysing vmlinux.

I will have to think about PLT stubs to GOT offsets and lazy resolution:

               1aec0:   00018e17             auipc          t3, pc + 98304
               1aec4:   880e3e03             ld             t3, -1920(t3)       # 0x0000000000032740
               1aec8:   000e0367             jalr           t1, t3, 0

We can assuming a dynamic linker doesn’t unlink a GOT entry by tracing AUIPC+LD+JALR a few times (after resolve has populated the GOT entry) to avoid a hash table lookup for translated code, although it would likely be possible to make a test case that changes a GOT entry and demonstrates the processor is not a RISC-V, rather makes some assumptions about jump targets. A translator should be able to pass tests. This would be an interesting tricky test.

Michael Clark

unread,
Mar 3, 2017, 12:04:20 AM3/3/17
to RISC-V ISA Dev
On 3 Mar 2017, at 5:39 PM, Michael Clark <michae...@mac.com> wrote:


On 3 Mar 2017, at 5:20 PM, Michael Clark <michae...@mac.com> wrote:


On 3 Mar 2017, at 5:12 PM, Sober Liu <sob...@nvidia.com> wrote:

I am not sure I get u idea fully. But are u expected that both code/data in 32bits range?

Yes. +-32 as per the AUIPC+JALR pair.

And for  “a direct PC relative jump", do u expect for static libs instead of dynamic libs?

Yes. I am thinking about the static case. I need to analyse GOT offset calls is dynamic libs. Next…

Sorry I meant in dynamic libs. I am presently analysing vmlinux.

I will have to think about PLT stubs to GOT offsets and lazy resolution:

               1aec0:   00018e17             auipc          t3, pc + 98304
               1aec4:   880e3e03             ld             t3, -1920(t3)       # 0x0000000000032740
               1aec8:   000e0367             jalr           t1, t3, 0

We can assuming a dynamic linker doesn’t unlink a GOT entry by tracing AUIPC+LD+JALR a few times (after resolve has populated the GOT entry) to avoid a hash table lookup for translated code, although it would likely be possible to make a test case that changes a GOT entry and demonstrates the processor is not a RISC-V, rather makes some assumptions about jump targets. A translator should be able to pass tests. This would be an interesting tricky test.

We would need to fault on writes to the GOT to make a translator pass tests, i.e. watch writes to pages containing function pointers referenced in translated code and recognise the pattern (AUIPC+LD+JALR). Quite complex to translate shared library calls efficiently. Sorry, diverging now.

Jacob Bachmeyer

unread,
Mar 3, 2017, 12:33:16 AM3/3/17
to Michael Clark, RISC-V ISA Dev
Michael Clark wrote:
> With parallel instruction decode, the immediate for the AUIPC+JALR jump could be decoded in one step, however the address temporary still needs to be committed to the register file for consistency.
>
> Given the register side effect is redundant in a fused decode implementation it leads to the possibility of an extension like this:
>
> auipc zero, pc + 1576960
> jalr ra, zero, -164 # <strcmp>
>
> AUIPC with rd=zero is a nop. I’m not suggesting this is a good idea; it just came to mind when considering the register side effect redundant with the fused variant. i.e. we just need to decode the immediate over two instructions.

No extension needed and perfectly consistent:

AUIPC ra, 1576960
JALR ra, ra, -164 # <strcmp>


In fact, this is the *only* way for AUIPC+JALR as a function call to be
a valid fusion pair. The example you give is a no-op followed by an
absolute jump of the type originally envisioned as an SBI call.

Remember: macro-op fusion requires that all side-effects of earlier
instructions be clobbered by later instructions in the fusion group.


-- Jacob

Michael Clark

unread,
Mar 3, 2017, 12:36:10 AM3/3/17
to jcb6...@gmail.com, RISC-V ISA Dev
Yes. Good idea. I like your version.

zero was just what immediately came to mind as a no side effect version. ra is perfect.

Michael Clark

unread,
Mar 3, 2017, 12:37:46 AM3/3/17
to jcb6...@gmail.com, RISC-V ISA Dev
In fact it’s such a good idea that CALL should emit it. The pseudo is hard-coded to use `t1`.

Michael Clark

unread,
Mar 3, 2017, 1:11:16 AM3/3/17
to jcb6...@gmail.com, RISC-V ISA Dev
On 3 Mar 2017, at 6:37 PM, Michael Clark <michae...@mac.com> wrote:


On 3 Mar 2017, at 6:36 PM, Michael Clark <michae...@mac.com> wrote:


On 3 Mar 2017, at 6:33 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:

Michael Clark wrote:
With parallel instruction decode, the immediate for the AUIPC+JALR jump could be decoded in one step, however the address temporary still needs to be committed to the register file for consistency.

Given the register side effect is redundant in a fused decode implementation it leads to the possibility of an extension like this:

auipc zero, pc + 1576960
jalr  ra, zero, -164 # <strcmp>

AUIPC with rd=zero is a nop. I’m not suggesting this is a good idea; it just came to mind when considering the register side effect redundant with the fused variant. i.e. we just need to decode the immediate over two instructions.

No extension needed and perfectly consistent:

AUIPC ra, 1576960
JALR ra, ra, -164 # <strcmp>


In fact, this is the *only* way for AUIPC+JALR as a function call to be a valid fusion pair.  The example you give is a no-op followed by an absolute jump of the type originally envisioned as an SBI call.

Remember:  macro-op fusion requires that all side-effects of earlier instructions be clobbered by later instructions in the fusion group.

I had been thinking about this liveness constraint earlier. 

Yes. Good idea. I like your version.

zero was just what immediately came to mind as a no side effect version. ra is perfect.

In fact it’s such a good idea that CALL should emit it. The pseudo is hard-coded to use `t1`.

CALL is a non-invasive change as nothing should depend on t1 for the no argument version. The 2 argument rd version of CALL would need to use rd as the temporary and the change would be more invasive. gcc emits the no argument version by default. TAIL can’t really be changed as we’d be clobbering ra. CALL is the common case.

Something like this in riscv-bintuils-gdb 

diff --git a/opcodes/riscv-opc.c b/opcodes/riscv-opc.c
index cc39390ec8..0c87b735dd 100644
--- a/opcodes/riscv-opc.c
+++ b/opcodes/riscv-opc.c
@@ -147,7 +147,7 @@ const struct riscv_opcode riscv_opcodes[] =
 {"jal",       "32C", "Ca",  MATCH_C_JAL, MASK_C_JAL, match_opcode, INSN_ALIAS },
 {"jal",       "I",   "a",  MATCH_JAL | (X_RA << OP_SH_RD), MASK_JAL | MASK_RD, match_opcode, INSN_ALIAS },
 {"call",      "I",   "d,c", (X_T1 << OP_SH_RS1), (int) M_CALL,  match_never, INSN_MACRO },
-{"call",      "I",   "c", (X_T1 << OP_SH_RS1) | (X_RA << OP_SH_RD), (int) M_CALL,  match_never, INSN_MACRO },
+{"call",      "I",   "c", (X_RA << OP_SH_RS1) | (X_RA << OP_SH_RD), (int) M_CALL,  match_never, INSN_MACRO },
 {"tail",      "I",   "c", (X_T1 << OP_SH_RS1), (int) M_CALL,  match_never, INSN_MACRO },
 {"jump",      "I",   "c,s", 0, (int) M_CALL,  match_never, INSN_MACRO },
 {"nop",       "C",   "",  MATCH_C_ADDI, 0xffff, match_opcode, INSN_ALIAS },

Allen J. Baum

unread,
Mar 3, 2017, 2:31:00 AM3/3/17
to Michael Clark, RISC-V ISA Dev
OK, I'm feeling dense.
I don't understand the statement:
macro-op fusion requires that all side-effects of earlier
instructions be clobbered by later instructions in the fusion group.

Why? The side effect of not modifying a register in a fused pair isn't an architectural requirement - but it could be an implementation requiremnt (since it saves register file ports - but perhaps not always). There are other side effects (exceptions) that must not be "clobbered". I suppose you could make a requirement that the first op of fused pair can never cause a trap or exception; that would solve part of the problem, or you could modify the statement to be specific about register side effects.

Regarding the zero case:
Macro op fusion is a nice to have, but not a guarantee, in a dynamic sense.
It's possible that sometimes ops will be fused - and other times the same op won't be fused (or that some pair of ops wil be fused while other pairs of the same ops won't be). An obvious case is when the pair cross a cache or page boundary - in which case using zero is broken.

Am I missing something?
--
**************************************************
* Allen Baum tel. (908)BIT-BAUM *
* 248-2286 *
**************************************************

Andrew Waterman

unread,
Mar 3, 2017, 3:10:04 AM3/3/17
to Michael Clark, Jacob Bachmeyer, RISC-V ISA Dev
We don't use rs1=ra here because that serves as a hint to pop the
return-address stack for some implementations.

Many instruction fusion opportunities will involve writing multiple
registers. For example, to reduce latency you'd also want to fuse
things like

lui t0, sym
ld t1, offset(t0)

which shows up in cases that t0 is later reused.

Superscalars typically over-provision write ports, so this is a matter
of control complexity, not extra datapath.
> --
> You received this message because you are subscribed to the Google Groups
> "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to isa-dev+u...@groups.riscv.org.
> To post to this group, send email to isa...@groups.riscv.org.
> Visit this group at
> https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
> To view this discussion on the web visit
> https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/9B54C10C-00C2-44C7-AC86-BF4A04A673AA%40mac.com.

Bruce Hoult

unread,
Mar 3, 2017, 3:47:50 AM3/3/17
to Andrew Waterman, Michael Clark, Jacob Bachmeyer, RISC-V ISA Dev
From the 2.1 spec:

"Return-address prediction stacks are a common feature of high-performance instruction-fetch units. We note that rd and rs1 can be used to guide an implementation’s instruction-fetch pre-diction logic, indicating whether JALR instructions should push (rd=x1), pop (rd=x0, rs1=x1), or not touch (otherwise) a return-address stack."

Here, rs1 is ra, but rd is not zero, so return/pop return buffer should not be assumed.

On the contrary, rd *is* ra, so call/push return buffer should be matched.

Seems ok to me, and in fact a very good idea.


--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+unsubscribe@groups.riscv.org.

To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

Andrew Waterman

unread,
Mar 3, 2017, 4:05:52 AM3/3/17
to Bruce Hoult, Michael Clark, Jacob Bachmeyer, RISC-V ISA Dev
My comment is consistent with the current spec's commentary: "push
(rd=x1/x5), pop (rs1=x1/x5), or not touch (otherwise)"

Both pushing and popping the RAS is useful for coroutines. Alpha has
a similar hint on its JSR instruction.
>> > email to isa-dev+u...@groups.riscv.org.
>> > To post to this group, send email to isa...@groups.riscv.org.
>> > Visit this group at
>> > https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
>> > To view this discussion on the web visit
>> >
>> > https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/9B54C10C-00C2-44C7-AC86-BF4A04A673AA%40mac.com.
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "RISC-V ISA Dev" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to isa-dev+u...@groups.riscv.org.

Bruce Hoult

unread,
Mar 3, 2017, 4:19:03 AM3/3/17
to Andrew Waterman, Michael Clark, Jacob Bachmeyer, RISC-V ISA Dev
2.1 is the most recent spec on riscv.org. It's not the current spec?



>> > To post to this group, send email to isa...@groups.riscv.org.
>> > Visit this group at
>> > https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
>> > To view this discussion on the web visit
>> >
>> > https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/9B54C10C-00C2-44C7-AC86-BF4A04A673AA%40mac.com.
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "RISC-V ISA Dev" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+unsubscribe@groups.riscv.org.

To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

Michael Clark

unread,
Mar 3, 2017, 7:10:04 AM3/3/17
to Andrew Waterman, Jacob Bachmeyer, RISC-V ISA Dev

> On 3/03/2017, at 9:09 PM, Andrew Waterman <and...@sifive.com> wrote:
>
> We don't use rs1=ra here because that serves as a hint to pop the
> return-address stack for some implementations.

I see. The stack hint is useful for handling fast indirect return. It would work if the pop hint was rs1=ra and rd=zero (assuming tail still uses rs1=t1).

> Many instruction fusion opportunities will involve writing multiple
> registers. For example, to reduce latency you'd also want to fuse
> things like
>
> lui t0, sym
> ld t1, offset(t0)
>
> which shows up in cases that t0 is later reused.

Yes we can avoid the load of t0 if the register is killed in near site after this expression. I will check again but I didn't remember seeing the absolute load pattern (well, not in vmlinux; I saw it only for constants).

This is the form that we can safely fuse into a single load without looking too far ahead:

lui t0, sym
ld t0, offset(t0)

Although I do remember seeing the lui pattern where the register is reused but it is not PIC or PIE, but of course there is also the equivalent auipc pattern for GOT references.

More complex fuse patterns might need an 8 to 12 instruction window. The problem is registers that are possibly not killed until after branches and jumps. The register has to not be live to avoid the partially formed address being committed (when it's a temporary for forming an address for a load / store).
> To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CA%2B%2B6G0D-L3r4OAT6kBgebAre5%2B6MNi0YJHEtq_ZZ4mm2u2eWhQ%40mail.gmail.com.

Jacob Bachmeyer

unread,
Mar 3, 2017, 5:23:04 PM3/3/17
to Allen J. Baum, Michael Clark, RISC-V ISA Dev
Allen J. Baum wrote:
> OK, I'm feeling dense.
> I don't understand the statement:
> macro-op fusion requires that all side-effects of earlier
> instructions be clobbered by later instructions in the fusion group.
>
> Why? The side effect of not modifying a register in a fused pair isn't an architectural requirement - but it could be an implementation requiremnt (since it saves register file ports - but perhaps not always). There are other side effects (exceptions) that must not be "clobbered". I suppose you could make a requirement that the first op of fused pair can never cause a trap or exception; that would solve part of the problem, or you could modify the statement to be specific about register side effects.

I was assuming that RISC-V implementations would do at most one regfile
write per macro-op. Under this constraint, my statement is correct, but
it has been mentioned that implementations complex enough to fuse
AUIPC+JALR are probably complex enough to have multiple regfile write
ports. Another requirement of macro-op fusion is that the result of a
fusion group must be identical to the same instructions executed
individually. Exceptions at the first instruction are easy, just trap
at the entire group, but exceptions on subsequent instructions get
complicated. I believe that speculative execution has been offered as a
solution to this.

> Regarding the zero case:
> Macro op fusion is a nice to have, but not a guarantee, in a dynamic sense.
> It's possible that sometimes ops will be fused - and other times the same op won't be fused (or that some pair of ops wil be fused while other pairs of the same ops won't be). An obvious case is when the pair cross a cache or page boundary - in which case using zero is broken.
>

Using zero in AUIPC+JALR is broken in any case, since the result of
macro-op fusion must be identical to the result of executing the
individual fused instructions in sequence.

> Am I missing something?

I do not believe so, but I clearly missed that there are side effects
other than regfile writes.


-- Jacob

Michael Clark

unread,
Mar 3, 2017, 5:35:47 PM3/3/17
to Andrew Waterman, Bruce Hoult, Jacob Bachmeyer, RISC-V ISA Dev
It’s unfortunate the pop constraint doesn’t also contain rd=0. I guess making the call stack hint for push and pop simple, simplifies the call stack implementation.

Jacob’s version allows a register write elision and potentially decoding the CALL target immediate in early decode.

For me, from a binary translation perspective, it lets me elide a redundant mov of the target address that I will need to populate into a temporary. One less instruction. Although I wouldn’t use binary translation as an argument. I would use the potential for micro-architectural optimisation of inter-module calls in large executables.

The weight I guess is the between the pop constraint simplicity and whether or not the register write elision on inter-module calls is worth it.


Jacob Bachmeyer

unread,
Mar 3, 2017, 6:35:42 PM3/3/17
to Michael Clark, Andrew Waterman, Bruce Hoult, RISC-V ISA Dev
Michael Clark wrote:
> It’s unfortunate the pop constraint doesn’t also contain rd=0. I guess
> making the call stack hint for push and pop simple, simplifies the
> call stack implementation.

Andrew Waterman hinted at a more-important reason: coroutines need to
both push *and* pop the return address stack on the same instruction.

> Jacob’s version allows a register write elision and potentially
> decoding the CALL target immediate in early decode.

It does, but it also breaks the return-stack hints on JALR for
coroutines (which are not mentioned in spec v2.1).

> For me, from a binary translation perspective, it lets me elide a
> redundant mov of the target address that I will need to populate into
> a temporary. One less instruction. Although I wouldn’t use binary
> translation as an argument. I would use the potential for
> micro-architectural optimisation of inter-module calls in large
> executables.
>
> The weight I guess is the between the pop constraint simplicity and
> whether or not the register write elision on inter-module calls is
> worth it.

Another factor is support for coroutines--the push constraint and pop
constraint must be compatible in that case. The return-address stack
hints in v2.1 do not meet this criteria and also do not mention using x5
as an alternate link register for millicode.

-- Jacob

Michael Clark

unread,
Mar 3, 2017, 7:29:42 PM3/3/17
to jcb6...@gmail.com, Andrew Waterman, Bruce Hoult, RISC-V ISA Dev

> On 4 Mar 2017, at 12:35 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>
> Michael Clark wrote:
>> It’s unfortunate the pop constraint doesn’t also contain rd=0. I guess making the call stack hint for push and pop simple, simplifies the call stack implementation.
>
> Andrew Waterman hinted at a more-important reason: coroutines need to both push *and* pop the return address stack on the same instruction.

Okay I understand now.

>> Jacob’s version allows a register write elision and potentially decoding the CALL target immediate in early decode.
>
> It does, but it also breaks the return-stack hints on JALR for coroutines (which are not mentioned in spec v2.1).

Interesting trade off.

>> For me, from a binary translation perspective, it lets me elide a redundant mov of the target address that I will need to populate into a temporary. One less instruction. Although I wouldn’t use binary translation as an argument. I would use the potential for micro-architectural optimisation of inter-module calls in large executables.
>>
>> The weight I guess is the between the pop constraint simplicity and whether or not the register write elision on inter-module calls is worth it.
>
> Another factor is support for coroutines--the push constraint and pop constraint must be compatible in that case. The return-address stack hints in v2.1 do not meet this criteria and also do not mention using x5 as an alternate link register for millicode.

The call stack and coroutine hints will also be useful for binary translators, for fast coroutines and procedure returns.

So we can’t elide the AUIPC+JALR temporary unless we trace past a CALL and see it used.

The microarchitecture can still decode the target address before register decoding if it looks at the opcode and immediate of the two adjacent instructions, it just needs to write the jump target register (optimally once).

Michael.

Rogier Brussee

unread,
Mar 8, 2017, 4:35:05 PM3/8/17
to RISC-V ISA Dev, michae...@mac.com, and...@sifive.com, br...@hoult.org, jcb6...@gmail.com
For all intents and purposes 

auipc ra imm20
jalr ra ra imm11*

is indistinguishable from "jal ra imm20imm11" except that such long immediates can't be done (the last bit in the immediate of jalr is ignored) and it seems completely to handle them the same internally i.e. after fusion.  It should be by far the  common case (in fact in Xcondensed I envisioned jalr_ra_ra imm11 to be a 2 byte Xcondensed instruction). The normal return is jalr zero ra 0 and fwiw for many implementations that instruction is compressed (C.jr ra ) instruction.

In fact the coroutine case seems not the case to optimise for, and surely if simultaneous  popping and pushing the call stack really is important one can use a different calling convention for them and do

auipc x5 imm20
jalr x5 x5 imm11*

I know this is swearing in church, but x5 was added as a register that may manipulate the call stack if any only in v2.1 (if I understood correctly for implementing register saving and restoring calls that would be called with jalr x5 zero imm), and this is a quality of implementation issue. Can't  the spec be simply updated to insist on jalr x0 ra 0 aka C.jr ra  or jalr rd x5 imm to pop the return stack (so jalr x5 x5 imm would push and pop the return stack) ? 

Rogier

Op zaterdag 4 maart 2017 00:35:42 UTC+1 schreef Jacob Bachmeyer:

kr...@berkeley.edu

unread,
Apr 24, 2017, 3:02:00 AM4/24/17
to Rogier Brussee, RISC-V ISA Dev, michae...@mac.com, and...@sifive.com, br...@hoult.org, jcb6...@gmail.com

We agree that the fusion case is more important to optimize for than
the coroutine case, but we do want to support both and to allow fused
calls with one reg write port when using the alternate link register
(it was added to call register save code for -Os case, so will be
frequently used in that code)

The new proposal is that push+pop is hinted only when rs1!=rd (and
rs1=x1/x5,rd=x1/x5), so

auipc ra, imm20; jalr ra, ra, imm11

can be fused with either x1 or x5, writes only a single value, and
only pushes the RAS.

Coroutines would have to use

jalr x1, x5, imm11
or
jalr x5, x1, imm11

to hint push+pop.

New text:
"JALR instructions should:
push only (rd=x1/x5, rs1!=x1/x5 or rs1=rd)
pop only (rs1=x1/x5, rd!=x1/x5),
push and pop (rd=x1/x5, rs1=x1/x5, and rs1!=rd)
or not touch (otherwise) a return-address stack."

Truth table

rd rs1 rd==rs1
!x1/x5 !x1/x5 X nothing
x1/x5 !x1/x5 X push
x1/x5 x1/x5 0 push+pop
x1/x5 x1/x5 1 push
!x1/x5 x1/x5 X pop

Krste
| --
| You received this message because you are subscribed to the Google Groups
| "RISC-V ISA Dev" group.
| To unsubscribe from this group and stop receiving emails from it, send an email
| to isa-dev+u...@groups.riscv.org.
| To post to this group, send email to isa...@groups.riscv.org.
| Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/
| .
| To view this discussion on the web visit https://groups.google.com/a/
| groups.riscv.org/d/msgid/isa-dev/
| 9ea8ebf6-63b6-4314-979e-c12292b54673%40groups.riscv.org.

Bruce Hoult

unread,
Apr 24, 2017, 8:44:29 AM4/24/17
to Krste Asanovic, Rogier Brussee, RISC-V ISA Dev, Michael Clark, Andrew Waterman, Jacob Bachmeyer
Perfect! Thank you. +1


| To post to this group, send email to isa...@groups.riscv.org.
| Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/
| .
| To view this discussion on the web visit https://groups.google.com/a/
| groups.riscv.org/d/msgid/isa-dev/
| 9ea8ebf6-63b6-4314-979e-c12292b54673%40groups.riscv.org.

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+unsubscribe@groups.riscv.org.

To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

Michael Clark

unread,
Apr 24, 2017, 8:16:42 PM4/24/17
to Bruce Hoult, Krste Asanovic, Rogier Brussee, RISC-V ISA Dev, Andrew Waterman, Jacob Bachmeyer
Excellent!

We’ll need to test a change to the gcc CALL macro to emit the following to take full advantage of the change as per Jacob’s original suggestion:

1: AUIPC ra, %pcrel_hi(symbol)
JALR ra, %pcrel_lo(1b)(ra)

TAIL uses zero as the link register so it can’t have its address register write elided.

Something like this in binutils (needs testing):

mclark@minty:~/src/riscv-gnu-toolchain/riscv-binutils-gdb$ git diff

diff --git a/opcodes/riscv-opc.c b/opcodes/riscv-opc.c
index cc39390ec8..0c87b735dd 100644
--- a/opcodes/riscv-opc.c
+++ b/opcodes/riscv-opc.c
@@ -147,7 +147,7 @@ const struct riscv_opcode riscv_opcodes[] =
 {"jal",       "32C", "Ca",  MATCH_C_JAL, MASK_C_JAL, match_opcode, INSN_ALIAS },
 {"jal",       "I",   "a",  MATCH_JAL | (X_RA << OP_SH_RD), MASK_JAL | MASK_RD, match_opcode, INSN_ALIAS },
 {"call",      "I",   "d,c", (X_T1 << OP_SH_RS1), (int) M_CALL,  match_never, INSN_MACRO },
-{"call",      "I",   "c", (X_T1 << OP_SH_RS1) | (X_RA << OP_SH_RD), (int) M_CALL,  match_never, INSN_MACRO },
+{"call",      "I",   "c", (X_RA << OP_SH_RS1) | (X_RA << OP_SH_RD), (int) M_CALL,  match_never, INSN_MACRO },
 {"tail",      "I",   "c", (X_T1 << OP_SH_RS1), (int) M_CALL,  match_never, INSN_MACRO },
 {"jump",      "I",   "c,s", 0, (int) M_CALL,  match_never, INSN_MACRO },
 {"nop",       "C",   "",  MATCH_C_ADDI, 0xffff, match_opcode, INSN_ALIAS },

Rogier Brussee

unread,
Apr 25, 2017, 4:07:20 PM4/25/17
to RISC-V ISA Dev, rogier....@gmail.com, michae...@mac.com, and...@sifive.com, br...@hoult.org, jcb6...@gmail.com
That seems optimal!

Thanks Rogier


Op maandag 24 april 2017 09:02:00 UTC+2 schreef krste:

Andrew Waterman

unread,
Apr 26, 2017, 4:26:12 AM4/26/17
to Michael Clark, Bruce Hoult, Krste Asanovic, Rogier Brussee, RISC-V ISA Dev, Jacob Bachmeyer
Yeah, there's nothing to be done for TAIL.

Can you PR this change against the riscv-binutils github repo? I'm
pretty sure it will "just work," so we can test it in conjunction with
other toolchain improvements sometime in May.
>> | to isa-dev+u...@groups.riscv.org.
>> | To post to this group, send email to isa...@groups.riscv.org.
>> | Visit this group at
>> https://groups.google.com/a/groups.riscv.org/group/isa-dev/
>> | .
>> | To view this discussion on the web visit https://groups.google.com/a/
>> | groups.riscv.org/d/msgid/isa-dev/
>> | 9ea8ebf6-63b6-4314-979e-c12292b54673%40groups.riscv.org.
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "RISC-V ISA Dev" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to isa-dev+u...@groups.riscv.org.
> --
> You received this message because you are subscribed to the Google Groups
> "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to isa-dev+u...@groups.riscv.org.
> To post to this group, send email to isa...@groups.riscv.org.
> Visit this group at
> https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
> To view this discussion on the web visit
> https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/F9FF315C-F2C4-47F2-A999-4D54B65581D2%40mac.com.

Michael Clark

unread,
Apr 26, 2017, 4:28:20 AM4/26/17
to Andrew Waterman, Bruce Hoult, Krste Asanovic, Rogier Brussee, RISC-V ISA Dev, Jacob Bachmeyer
Hi Andrew,

No problem. Yes, it’s a simple change, but I’ll test compile the toolchain, compile something and do an objdump and then make a pull request…

Cheers,
Michael.
> To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CA%2B%2B6G0AkwH2v7GU_GXrbBPuMaaoWupQiVSRoBjjtr7r5hSxV0g%40mail.gmail.com.

kr...@berkeley.edu

unread,
Apr 26, 2017, 5:09:43 AM4/26/17
to Andrew Waterman, Michael Clark, Bruce Hoult, Krste Asanovic, Rogier Brussee, RISC-V ISA Dev, Jacob Bachmeyer

Though the TAIL case can still be fused with a single register write,
given that JALR writes x0.

Krste

Rogier Brussee

unread,
Apr 26, 2017, 5:40:15 AM4/26/17
to RISC-V ISA Dev, br...@hoult.org, kr...@berkeley.edu, rogier....@gmail.com, and...@sifive.com, jcb6...@gmail.com
There are now effectively two canonical, potentially  hardware optimised, ways to do a call which can be replaced by jal without observable effects (if the immediate is small enough): with ra == x1 and with  t0 == x5 as link register and temporary. The t0 version is supposed to be used for storing and restoring registers, but it seems more generally useful for guaranteed leaf calls, (or more generally, for calls with a calling convention where ra is callee saved).  I don't  know how difficult it is to teach gcc "leaf_call" (or "call_ra_callee_saved"  calls) but it seems like just another calling convention. They should also be useful for static leaf functions that do not leave file scope as function pointers, but in any case, it seems like a useful self documenting asm macro. 

Perhaps call_absolute and leaf_call_absolute (with auipc replaced with lui) should also be canonicalised as macros?

Likewise, perhaps call_coroutine_ra and call_coroutine_t0 should also be canonicalised as macros (arguably with names that reflect that they trash t0 respectively ra)?


Rogier

Op dinsdag 25 april 2017 02:16:42 UTC+2 schreef michaeljclark:
Message has been deleted
Message has been deleted

Michael Clark

unread,
Jun 18, 2017, 12:57:55 PM6/18/17
to Rogier Brussee, RISC-V ISA Dev, br...@hoult.org, kr...@berkeley.edu, and...@sifive.com, jcb6...@gmail.com

On 19 Jun 2017, at 1:09 AM, Rogier Brussee <rogier....@gmail.com> wrote:

Recently, the CALL macro has been changed in the assembly and ELF spec. I wondered whether it would not make sense to also change the TAIL macro from

1:
AUIPC t0, %pcrel_hi(symbol)
JALR  ra, t0 %pcrel_hi(1b)

AUIPC t1, %pcrel_hi(symbol)
JALR  zero, t1 %pcrel_hi(1b)

to

AUIPC t1, %pcrel_hi(symbol)
JALR t1 , t1 %pcrel_hi(1b)

We take t1 = x6  so as to leave callstacks alone. Sure this sets t1, but who cares: if callstacks are left alone that does no harm. It may even be useful for stack unwinding for exception handling and debugging to have a chance to know where a call came from even if it is a tail call. The main point is that hardware needs only match one pattern for call fusion, as in the absence of tail calls, the original tail-call == jump pattern would only be useful for very very long jumps in a function which should be rare enough to be worth the trouble.

CALL no longer has a target address side effect, but changing TAIL has no benefit as we can’t eliminate one side effect like we could for CALL, and in fact it just introduces a different side effect.

There is nothing that can be elided like there is with CALL which previously had two side effects (ra and t1). In the fusion case for TAIL, we trade a write of t1 with the target address with a write of the link address. In fact for simple implementations it introduces a redundant write (two writes and one read from t1 instead of the current case which is a single write and single read to/from t1). I don’t think we should change TAIL.

Michael


Op dinsdag 25 april 2017 02:16:42 UTC+2 schreef michaeljclark:
Excellent!
Perfect! Thank you. +1


| To post to this group, send email to isa...@groups.riscv.org.
| Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/
| .
| To view this discussion on the web visit https://groups.google.com/a/
| groups.riscv.org/d/msgid/isa-dev/
| 9ea8ebf6-63b6-4314-979e-c12292b54673%40groups.riscv.org.

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
-- 
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

Rogier Brussee

unread,
Jun 18, 2017, 1:39:39 PM6/18/17
to RISC-V ISA Dev, rogier....@gmail.com, br...@hoult.org, kr...@berkeley.edu, and...@sifive.com, jcb6...@gmail.com
I thought I had deleted this message and put it in the thread CALL macro in the assembler cookbook and ELF spec, explaining the slightly non obvious rationale better. Here is the gist:


The main point of the change would be that hardware would only need to match _one_ pattern for a macro-op fuse

AUIPC rd Imm1       # with rd != zero
JALR rd  rd imm2 

mapping to the _standard_ JAL behaviour (except for the longer immediate, but including the side effect on callstacks)  that covers _both_ the CALL and the TAIL use case,.  

It would make a macro op-fuse of 

AUIPC rd Imm1       # with rd != zero
JALR zero  rd imm2 

mapping to a strange _new_ "jump with 'link' the upper part of the destination address as output"  superfluous as the only remaining use case of the pattern would be very long (±2GB) jumps within a subroutine.

I recognise that in the non-fuse case it requires two writes and could be a trifle more expensive,

Ciao
Rogier

Op zondag 18 juni 2017 18:57:55 UTC+2 schreef michaeljclark:

Michael Clark

unread,
Jun 18, 2017, 1:42:49 PM6/18/17
to Rogier Brussee, RISC-V ISA Dev, br...@hoult.org, kr...@berkeley.edu, and...@sifive.com, jcb6...@gmail.com
Sorry to hijack this thread, but it prompted me to properly test the CALL macro register elision case in the binary translator I am working on <https://rv8.io/> that actually implements macro-op fusion today. I’ve recently enabled the macro-op fusion code as it is now stable. I need to write about it because the current description belies the translators current capabilities which is now 2.4X QEMU performance (or over 4X QEMU ARM).

I have translations for 3 types of macro-op fusion below, including the register write elision that is made possible by the changes to the CALL macro. It is somewhat interesting as this is technically a working (albeit software) implementation of macro-op fusion.

The first two cases are JALR (indirect jumps) that due to macro-op fusion are able to be translated as direct jumps (the original plan as is the subject of this thread), or in the case of this hot path translator, are in fact inline cached.

The third case calculates an address separately before the JALR and the translator is not yet smart enough to optimise this case and it is treated as an indirect jump which requires a jump target cache lookup to find the translated address (also accelerated). In the case of a GOT load for something in the PLT, acceleration would require a comparison against the learned function address, to find the translated code address, e.g. if we wanted to accelerate shared library calls. It would be possible but quite some work.

In any case when I started this thread I had not yet started the translator and was just thinking of possible optimisations. Now the translator is stable on quite complex codes and macro-op fusion works. Most of the deficiencies are now due to lack of syscall coverage in the user-mode translator.

Here is a test program for macro-op fusion, including the register elision case:

#include <stdio.h>

size_t add(size_t a, size_t b)
{
return a + b;
}

int main()
{
size_t total = 0;
for (size_t i = 0; i < 1000; i++) {
#if defined (MACRO_FUSION)
__asm__ __volatile__(
" mv a0, %1\n"
" mv a1, %2\n"
"1: auipc t1, %%pcrel_hi(add)\n"
" jalr ra, %%pcrel_lo(1b)(t1)\n"
"   mv %0, a0\n"
: "=r"(total)
: "r"(total), "r"(i)
);
#elif defined (MACRO_FUSION_ELISION)
__asm__ __volatile__(
" mv a0, %1\n"
" mv a1, %2\n"
"1: auipc ra, %%pcrel_hi(add)\n"
" jalr ra, %%pcrel_lo(1b)(ra)\n"
"   mv %0, a0\n"
: "=r"(total)
: "r"(total), "r"(i)
);
#elif defined (MACRO_INDIRECT)
__asm__ __volatile__(
" mv a0, %1\n"
" mv a1, %2\n"
"1: auipc t1, %%pcrel_hi(add)\n"
" addi t1, t1, %%pcrel_lo(1b)\n"
"   jalr ra, t1\n"
"   mv %0, a0\n"
: "=r"(total)
: "r"(total), "r"(i)
);
#else
total = add(total, i);
#endif
}
printf("total=%lu\n", total);
return 0;
}


macro fusion of auipc+jalr into a direct jump (inlined cached during translation)

$ riscv64-unknown-elf-gcc -DMACRO_FUSION -Os -c src/test/test-fusion.c -o build/riscv64-unknown-elf/obj/test-fusion-macro.o
$ riscv64-unknown-elf-gcc -Os build/riscv64-unknown-elf/obj/test-fusion-macro.o -o build/riscv64-unknown-elf/bin/test-fusion-macro
$ rv-jit --log-jit-trace build/riscv64-unknown-elf/bin/test-fusion-macro

Note: the translator has coalesced (or perhaps lifted) the AUIPC+JALR into a CALL (see 0x100c0)  with two side effects. Notice the PC jumps by 8. The target address is written to rdi (t1) and the link address is written to rdx (ra), and is later compared in RET (if it differs which is unlikely, e.g. setjmp/longjmp, it branches out of the trace. The call has been inlined by the JIT.

L2:
# 0x00000000000100bc add         a0, zero, a1
mov r8, r9                              ; 4D8BC1
L3:
# 0x00000000000100be add         a1, zero, a5
mov r9, r13                             ; 4D8BCD
L4:
# 0x00000000000100c0 call t1, 0x188
mov rdi, 10248                          ; BF48020100
mov rdx, 100C8                          ; BAC8000100
L5:
# 0x0000000000010248 add         a0, a0, a1
add r8, r9                              ; 4D03C1
L6:
# 0x000000000001024a jalr        zero, ra, 0
cmp rdx, 100C8                          ; 4881FAC8000100
je L7                                   ; 0F84........
mov qword [rbp], 1024A                  ; 48C745004A020100
jmp L0                                  ; E9........
L7:
L8:
# 0x00000000000100c8 add         a1, zero, a0
mov r9, r8                              ; 4D8BC8
L9:
# 0x00000000000100ca addi        a5, a5, 1
add r13, 1                              ; 4983C501
L10:
# 0x00000000000100cc bne         a5, a4, pc - 16
cmp r13, r12                            ; 4D3BEC
short jne L2                            ; 75C7


macro fusion of auipc+jalr into a direct jump (inlined cached during translation) with elision of address temporary

$ riscv64-unknown-elf-gcc -DMACRO_FUSION_ELISION -Os -c src/test/test-fusion.c -o build/riscv64-unknown-elf/obj/test-fusion-elision.o
$ riscv64-unknown-elf-gcc -Os build/riscv64-unknown-elf/obj/test-fusion-elision.o -o build/riscv64-unknown-elf/bin/test-fusion-elision
$ rv-jit --log-jit-trace build/riscv64-unknown-elf/bin/test-fusion-elision

Note: the translator has lifted the AUIPC+JALR into a CALL (see 0x100c0)  and elided the address temporary rdi (t1). Only the link address is written to rdx (ra), and is later compared in RET (if it differs which is unlikely, e.g. setjmp/longjmp, it branches out of the trace. The call has been inlined by the JIT.

L2:
# 0x00000000000100bc add         a0, zero, a1
mov r8, r9                              ; 4D8BC1
L3:
# 0x00000000000100be add         a1, zero, a5
mov r9, r13                             ; 4D8BCD
L4:
# 0x00000000000100c0 call ra, 0x188
mov rdx, 100C8                          ; BAC8000100
L5:
# 0x0000000000010248 add         a0, a0, a1
add r8, r9                              ; 4D03C1
L6:
# 0x000000000001024a jalr        zero, ra, 0
cmp rdx, 100C8                          ; 4881FAC8000100
je L7                                   ; 0F84........
mov qword [rbp], 1024A                  ; 48C745004A020100
jmp L0                                  ; E9........
L7:
L8:
# 0x00000000000100c8 add         a1, zero, a0
mov r9, r8                              ; 4D8BC8
L9:
# 0x00000000000100ca addi        a5, a5, 1
add r13, 1                              ; 4983C501
L10:
# 0x00000000000100cc bne         a5, a4, pc - 16
cmp r13, r12                            ; 4D3BEC
short jne L2                            ; 75CC


macro fusion of la with coalescing of two writes for aupic+add into a single write

$ riscv64-unknown-elf-gcc -DMACRO_INDIRECT -Os -c src/test/test-fusion.c -o build/riscv64-unknown-elf/obj/test-fusion-indirect.o
$ riscv64-unknown-elf-gcc -Os build/riscv64-unknown-elf/obj/test-fusion-indirect.o -o build/riscv64-unknown-elf/bin/test-fusion-indirect
$ rv-jit --log-jit-trace build/riscv64-unknown-elf/bin/test-fusion-indirect

Note: the translator has lifted the AUIPC+ADDI into LA (see 0x100c0) and emits a single register write. It has moved the target address into the program counter backing store [rbp] and the link address into rdx (ra) and is unconditionally branching to the jump target cache stub (part of the JALR acceleration mechanism).

L2:
# 0x00000000000100cc add         a1, zero, a0
mov r9, r8                              ; 4D8BC8
L3:
# 0x00000000000100ce addi        a5, a5, 1
add r13, 1                              ; 4983C501
L4:
# 0x00000000000100d0 bne         a5, a4, pc - 20
cmp r13, r12                            ; 4D3BEC
jne L5                                  ; 0F85........
mov qword [rbp], 100D4                  ; 48C74500D4000100
jmp 7FFF02001000                        ; 40E900000000
L5:
L6:
# 0x00000000000100bc add         a0, zero, a1
mov r8, r9                              ; 4D8BC1
L7:
# 0x00000000000100be add         a1, zero, a5
mov r9, r13                             ; 4D8BCD
L8:
# 0x00000000000100c0 la t1, 0x18c
mov rdi, 1024C                          ; BF4C020100
L9:
# 0x00000000000100c8 jalr        ra, t1, 0
mov qword [rbp], rdi                    ; 48897D00
mov rdx, 100CC                          ; BACC000100
jmp 7FFF02001000                        ; 40E900000000


Andrew Waterman

unread,
Jun 19, 2017, 1:52:08 AM6/19/17
to Rogier Brussee, RISC-V ISA Dev, Bruce Hoult, Krste Asanovic, Jacob Bachmeyer
On Sun, Jun 18, 2017 at 6:09 AM, Rogier Brussee
<rogier....@gmail.com> wrote:
> Recently, the CALL macro has been changed in the assembly and ELF spec. I
> wondered whether it would not make sense to also change the TAIL macro from
>
> 1:
> AUIPC t0, %pcrel_hi(symbol)
> JALR ra, t0 %pcrel_hi(1b)
>
> to
>
> AUIPC t1, %pcrel_hi(symbol)
> JALR t1 , t1 %pcrel_hi(1b)
>
> We take t1 = x6 so as to leave callstacks alone. Sure this sets t1, but who
> cares: if callstacks are left alone that does no harm. It may even be useful
> for stack unwinding for exception handling and debugging to have a chance to
> know where a call came from even if it is a tail call. The main point is
> that hardware needs only match one pattern for call fusion, as in the
> absence of tail calls, the original tail-call == jump pattern would only be
> useful for very very long jumps in a function which should be rare enough to
> be worth the trouble.

TAIL is currently defined as

auipc t1, ...
jalr x0, t1, ...

We already avoid using t0, to avoid messing up the call stacks.

DWARF information provides sufficient information to recover the call
graph. Furthermore, PLTs can destroy the value in t1, so this isn't
helpful in the general case.

Finally, some low-end unpipelined implementations will execute
JAL/JALR more slowly when rd != 0.

On balance, I don't support linking to t1 in the tail-call case.

>
>
>
> Op dinsdag 25 april 2017 02:16:42 UTC+2 schreef michaeljclark:
>>
>>> | to isa-dev+u...@groups.riscv.org.
>>> | To post to this group, send email to isa...@groups.riscv.org.
>>> | Visit this group at
>>> https://groups.google.com/a/groups.riscv.org/group/isa-dev/
>>> | .
>>> | To view this discussion on the web visit https://groups.google.com/a/
>>> | groups.riscv.org/d/msgid/isa-dev/
>>> | 9ea8ebf6-63b6-4314-979e-c12292b54673%40groups.riscv.org.
>>>
>>> --
>>> You received this message because you are subscribed to the Google Groups
>>> "RISC-V ISA Dev" group.
>>> To unsubscribe from this group and stop receiving emails from it, send an
>>> email to isa-dev+u...@groups.riscv.org.

Rogier Brussee

unread,
Jun 19, 2017, 6:42:23 AM6/19/17
to RISC-V ISA Dev, rogier....@gmail.com, br...@hoult.org, kr...@berkeley.edu, jcb6...@gmail.com
TL; DR  
With my proposed change of the TAIL macro, opcode fusion of TAIL _comes for free_ if you opcode fuse CALL, unlike the current definition of TAIL.   

In the non-opcode fused case for non pipelined processors the proposed TAIL macro may be a trifle more expensive than the current version, however.


Op maandag 19 juni 2017 07:52:08 UTC+2 schreef andrew:
On Sun, Jun 18, 2017 at 6:09 AM, Rogier Brussee
<rogier....@gmail.com> wrote:
> Recently, the CALL macro has been changed in the assembly and ELF spec. I
> wondered whether it would not make sense to also change the TAIL macro from
>
> 1:
> AUIPC t0, %pcrel_hi(symbol)
> JALR  ra, t0 %pcrel_hi(1b)
>
> to
>
> AUIPC t1, %pcrel_hi(symbol)
> JALR t1 , t1 %pcrel_hi(1b)
>
> We take t1 = x6  so as to leave callstacks alone. Sure this sets t1, but who
> cares: if callstacks are left alone that does no harm. It may even be useful
> for stack unwinding for exception handling and debugging to have a chance to
> know where a call came from even if it is a tail call. The main point is
> that hardware needs only match one pattern for call fusion, as in the
> absence of tail calls, the original tail-call == jump pattern would only be
> useful for very very long jumps in a function which should be rare enough to
> be worth the trouble.

TAIL is currently defined as

  auipc t1, ...
  jalr x0, t1, ...

We already avoid using t0, to avoid messing up the call stacks.

OK my bad, but that was not really the point. 


DWARF information provides sufficient information to recover the call
graph.  

I see.  Anyway I should not even have mentioned this point
 
Furthermore, PLTs can destroy the value in t1, so this isn't
helpful in the general case.
 

Finally, some low-end unpipelined implementations will execute
JAL/JALR more slowly when rd != 0.

This I recognise and is a real downside in the non op-fused case, but would it really matter?


I repeat and slightly extend my answer to Michael Clark, because I may not have been sufficiently clear. My apologies in advance if I am merely being pedantic in spelling it out in excruciating detail.


The point of the change would be that hardware would only need to match _one_  pattern to macro-op fuse both CALL and TAIL


I assume that the typical RV processor pipeline the hardware equivalent of Internal.JAL does something like  


JAL rd imm20
--> Internal.JAL SEXT(imm20)<<1

Where internal.JAL has semantics something like

Internal.JAL rd,  immXLEN :

      if(rd == ra || rd == t0)
           CALSTACK.PUSH(PC)
      rd <-- PC
      PC <-- PC +  immXLEN

For a CALL macro -op fuse you would there want to macro op fuse

AUIPC rd Imm1       # with rd != zero
JALR rd  rd imm2 

--> Internal.JAL rd SEXT(imm1)<<12 + SEXT(imm2)

But that _also_ implements fusing the  proposed TAIL macro.

On the other hand one can macro op-fuse the current TAIL macro

AUIPC rd Imm1       # with rd != zero
JALR zero  rd imm2 

--> Internal.JALUD SEXT(imm1)<<12 + SEXT(imm2)

but it is an additional pattern to match, and it necessarily involves a strange new otherwise useless internal JALUD ( 'Jump and 'link' the upper part of the destination") instruction. 
the semantics of  Internal.JALUD would be something like 

Internal.JALUD rd immXLEN:

      if(rd == ra || rd == t0)
           CALSTACK.POP(PC)

       rd <-- PC + immXLEN - SEXT(immXLEN & ((1 <<12) -1)
       PC <-- PC +  immXLEN

even though in practice the rd value would never actually be used because of the calling convention, it still has to be set for correctness, and so we cannot reuse an internal.J instruction that only jumps.

Ciao
Rogier
 

Michael Clark

unread,
Jun 19, 2017, 8:22:26 PM6/19/17
to Rogier Brussee, RISC-V ISA Dev, br...@hoult.org, kr...@berkeley.edu, jcb6...@gmail.com
Hi Rogier,

I understand what you mean regarding sharing the macro-fusion pattern now. That wasn’t clear to me, however I still think the JALR should avoid the redundant register write as simple implementations won’t be able to do anything about this extra write, which they don’t have now (and that was not the case for CALL), and macro-op implementations, being more sophisticated, are more able to bear the cost of having two patterns.

It’s interesting that you point this out as my macro-fusion pattern for CALL is as follows:

(auipc, rd=x }, { jalr rd=ra, rs1=x }

The call macro explicitly sets rd to ra and in my implementation

I need to add a second macro-op pattern match for TAIL:

(auipc, rd=x }, { jalr rd=zero, rs1=x }

This is a small price. I still think the price of having two pattern matches in the macro-op fusion case is a better trade than giving all simple implementations the cost of the redundant write. That aside, it is novel. I had not thought about the rationale of having a single pattern match.

Michael.

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

Andrew Waterman

unread,
Jun 19, 2017, 9:23:36 PM6/19/17
to Michael Clark, Rogier Brussee, RISC-V ISA Dev, Bruce Hoult, Krste Asanovic, Jacob Bachmeyer
On Mon, Jun 19, 2017 at 5:22 PM, Michael Clark <michae...@mac.com> wrote:
> Hi Rogier,
>
> I understand what you mean regarding sharing the macro-fusion pattern now.
> That wasn’t clear to me, however I still think the JALR should avoid the
> redundant register write as simple implementations won’t be able to do
> anything about this extra write, which they don’t have now (and that was not
> the case for CALL), and macro-op implementations, being more sophisticated,
> are more able to bear the cost of having two patterns.
>
> It’s interesting that you point this out as my macro-fusion pattern for CALL
> is as follows:
>
> (auipc, rd=x }, { jalr rd=ra, rs1=x }
>
> The call macro explicitly sets rd to ra and in my implementation
>
> I need to add a second macro-op pattern match for TAIL:
>
> (auipc, rd=x }, { jalr rd=zero, rs1=x }
>
> This is a small price. I still think the price of having two pattern matches
> in the macro-op fusion case is a better trade than giving all simple
> implementations the cost of the redundant write.

This is still my POV. The main cost in fusion is supporting the first
pair; the incremental cost for additional patterns is minor by
comparison.

(Incidentally, the current TAIL fusion has some logic in common with
fusing AUIPC + SW, in that neither pair writes a second register, so
both need to write the AUIPC result to a register.)
> https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/A7087602-3755-4A12-9AD9-FC4A589F990A%40mac.com.

Rogier Brussee

unread,
Jun 20, 2017, 3:31:54 AM6/20/17
to RISC-V ISA Dev, rogier....@gmail.com, br...@hoult.org, kr...@berkeley.edu, jcb6...@gmail.com
Hi Michael and Andrew,

thanks for your response. I trust your judgement now that my point is clear, you have far more experience than I have. But some concluding remarks inline anyway. 

Op dinsdag 20 juni 2017 02:22:26 UTC+2 schreef michaeljclark:
Hi Rogier,

I understand what you mean regarding sharing the macro-fusion pattern now. That wasn’t clear to me, however I still think the JALR should avoid the redundant register write as simple implementations won’t be able to do anything about this extra write, which they don’t have now (and that was not the case for CALL), and macro-op implementations, being more sophisticated, are more able to bear the cost of having two patterns.

 
TAIL is still just an assembly macro so the linker could change it to the more simple implementation friendly version if needed. The two versions are semantically equivalent on the calling convention level.  The default just determines what the natural fuse op pairs are. 

 
It’s interesting that you point this out as my macro-fusion pattern for CALL is as follows:

(auipc, rd=x }, { jalr rd=ra, rs1=x }

The call macro explicitly sets rd to ra and in my implementation 

I figured that.  It seemed to me that {auipc rd = x  where (x != zero) }{jalr rd = x , rs = x} would be just as easy because they would map uniformly to JAL with rd! != zero and a 32 bit immediate.
 
I need to add a second macro-op pattern match for TAIL:

(auipc, rd=x }, { jalr rd=zero, rs1=x }


and a separate implementation, the internal "JALUD" instruction (see my response to Andrew). Andrew points out that it shares some logic with AUIPC + SW. I had not thought of that.

This is a small price.



 
I still think the price of having two pattern matches in the macro-op fusion case is a better trade than giving all simple implementations the cost of the redundant write. That aside, it is novel. I had not thought about the rationale of having a single pattern match.


I trust your judgements.

Ciao
Rogier

Andrew Waterman

unread,
Jun 20, 2017, 3:44:45 AM6/20/17
to Rogier Brussee, RISC-V ISA Dev, Bruce Hoult, Krste Asanovic, Jacob Bachmeyer
Thanks for making the suggestion, and for identifying the errors in the specs.

On Tue, Jun 20, 2017 at 12:31 AM, Rogier Brussee
> https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/4d0ac66a-32b8-451e-8c56-dd1040eb1e43%40groups.riscv.org.

Albert Cahalan

unread,
Jun 20, 2017, 3:38:02 PM6/20/17
to Allen J. Baum, Michael Clark, RISC-V ISA Dev
On 3/3/17, Allen J. Baum <allen...@esperantotech.com> wrote:

> Regarding the zero case:
> Macro op fusion is a nice to have, but not a guarantee, in a dynamic sense.
> It's possible that sometimes ops will be fused - and other times the same op
> won't be fused (or that some pair of ops wil be fused while other pairs of
> the same ops won't be). An obvious case is when the pair cross a cache or
> page boundary - in which case using zero is broken.

If fusion is required, zero is perfectly fine when crossing a page boundary.
Fault on the address of the missing page, but without having advanced the
instruction pointer.

This is likely where you are headed anyway. It happened for ARM Thumb,
which did a similar thing for jumps. Originally, prior to Thumb2, all opcodes
were 16 bits in size. Jumps and calls would be preceded by an extra opcode
that would load 13 bits into a register. The assembler hid this, so it looked a
bit like a double-wide opcode, but you could actually split it and put other
opcodes in the middle. The two 16-bit halves of a jump ran separately. Well,
along comes Thumb2 and the need for more opcodes. Five of the bits in the
fused opcode pair were redundant if fusion were required, so they could be
reused for additional opcodes. Unfortunately, the bits were non-zero, so the
resulting encoding got really nasty. To extract a number from the opcode can
require pulling out 5 fields, XORing a couple of them together (!!!), and then
concatenating the bits in some screwy order.

So, to summarize: ARM Thumb also insisted on fixed-size instructions with
fusion to handle large offsets. This was a failure, and a side-effect of being
forced to give up on the idealism is a nasty instruction encoding. There was
also some slight compatibility breakage involving page fault addresses and
involving "improper" code that took advantage of the ability to put stuff into
the middle of a fused pair. It looks like you're headed toward the same mess.

Bruce Hoult

unread,
Jun 20, 2017, 4:38:06 PM6/20/17
to Albert Cahalan, Allen J. Baum, Michael Clark, RISC-V ISA Dev
While what you say about ARM and thumb2 is true [1], that is irrelevant to RISC-V. Fusion is NOT required.

In RISC-V fusing two instructions must *always* be optional, and the results precisely the same whether it happens or not.

The Thumb2 situation is relevant to another aspect of RISC-V, which is that both have variable-length instructions that need not be aligned or contained in a single VM page or cache line. In the case of Thumb2 this is any time the high three bits of the first 16 bits of an instruction are 0x111, indicating a 32 bit instruction. In the case of RISC-V this is any time the lowest two bits of the first 16 bits of an instruction are 0x11, indicating a 32 bit (or longer, if supported) instruction.

These two aspects are directly comparable.

Fused pairs of 16 bit RISC-V instructions are *not* comparable, because the first 16 bits indicate a valid 16 bit instruction, just as instructions with high bits of 0x111 were valid individual 16 bit instructions in Thumb1 (but not Thumb2 ... a forward incompatibility).

[1] I haven't verified the XOR etc details


--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+unsubscribe@groups.riscv.org.

To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

Albert Cahalan

unread,
Jun 20, 2017, 5:38:36 PM6/20/17
to Bruce Hoult, Allen J. Baum, Michael Clark, RISC-V ISA Dev
On 6/20/17, Bruce Hoult <br...@hoult.org> wrote:

> While what you say about ARM and thumb2 is true [1], that is irrelevant to
> RISC-V. Fusion is NOT required.

RISC-V and original Thumb are thus the same.

> In RISC-V fusing two instructions must *always* be optional, and the
> results precisely the same whether it happens or not.

Yep, that was how Thumb was originally specified.

> Fused pairs of 16 bit RISC-V instructions are *not* comparable, because the
> first 16 bits indicate a valid 16 bit instruction, just as instructions
> with high bits of 0x111 were valid individual 16 bit instructions in Thumb1
> (but not Thumb2 ... a forward incompatibility).

For RISC-V, a fused pair of AUIPC+JALR looks mighty comparable to the
situation with Thumb and Thumb2. At some point, as with ARM, you may
decide that the bits to represent JALR are partly redundant. As ARM did,
you will want to reclaim those bits by mandating that AUIPC+JALR be fused.
Then stuff that is now AUIPC+AUIPC for example can be interpreted as
some other new instruction which is completely unrelated to AUIPC.

(divide opcodes into two sets, those that makes sense after an AUIPC and
those that do not, and then for any X in the latter set make AUIPC+X be a
completely unrelated instruction)

This is a path you appear to be headed down. The encodings will be gross.

Bruce Hoult

unread,
Jun 20, 2017, 5:56:47 PM6/20/17
to Albert Cahalan, Allen J. Baum, Michael Clark, RISC-V ISA Dev
The results of re-interpreting two currently valid individual consecutive RISC-V instructions to mean something different and incompatible in future would indeed be gross.

Fortunately, it is NOT the path anyone is headed down.

It goes directly against the clearly stated philosophy that programs consisting of a frozen RISC-V instruction set (which C is on the verge of) will remain forward compatible forever.

It would have no purpose, as RISC-V already defines perfectly good 32 bit, 48 bit, and longer encodings with plenty of room for extension within the system, unlike Thumb1.

All anyone is discussing is advanced implementations opportunistically recognising certain instruction patterns in order to execute them in fewer uOps/cycles than a simpler implementation would. With 100% forward and backward compatibility. 

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+unsubscribe@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

Rogier Brussee

unread,
Jun 21, 2017, 6:09:43 AM6/21/17
to RISC-V ISA Dev, rogier....@gmail.com, br...@hoult.org, kr...@berkeley.edu, jcb6...@gmail.com
Hi Michael,


Op dinsdag 20 juni 2017 02:22:26 UTC+2 schreef michaeljclark:
Hi Rogier,

I understand what you mean regarding sharing the macro-fusion pattern now. That wasn’t clear to me, however I still think the JALR should avoid the redundant register write as simple implementations won’t be able to do anything about this extra write, which they don’t have now (and that was not the case for CALL), and macro-op implementations, being more sophisticated, are more able to bear the cost of having two patterns.

It’s interesting that you point this out as my macro-fusion pattern for CALL is as follows:

(auipc, rd=x }, { jalr rd=ra, rs1=x }

The call macro explicitly sets rd to ra and in my implementation

On second thought, this pattern does not fit the (max) two input one output mould of a standard RV instruction as it has two outputs (ra and x as a clobber) and one input. This is no problem for the CALL macro (and I guess for your software implementation), but if you insist on fixing the link register to ra and want to stay in the general mould, the  fusion pattern would have to be

{auipc rd=ra }, { jalr rd=ra, rs1=ra }

ciao 

Rogier


 

I need to add a second macro-op pattern match for TAIL:

(auipc, rd=x }, { jalr rd=zero, rs1=x }



It would perhaps be interesting to know how in your software implementation the two fusions for CALL  {auipc rd=ra }, { jalr rd=ra, rs1=ra } and TAIL {auipc rd=x }, { jalr rd=ra, rs1=x } (perhaps with x = t1 hardwired ?) compare to a JAL with long immediate {auipc rd=x }, { jalr rd=x, rs1=x } fusion!

Michael Clark

unread,
Jun 21, 2017, 5:45:06 PM6/21/17
to Rogier Brussee, RISC-V ISA Dev, br...@hoult.org, kr...@berkeley.edu, jcb6...@gmail.com
On 21 Jun 2017, at 10:09 PM, Rogier Brussee <rogier....@gmail.com> wrote:

Hi Michael,

Op dinsdag 20 juni 2017 02:22:26 UTC+2 schreef michaeljclark:
Hi Rogier,

I understand what you mean regarding sharing the macro-fusion pattern now. That wasn’t clear to me, however I still think the JALR should avoid the redundant register write as simple implementations won’t be able to do anything about this extra write, which they don’t have now (and that was not the case for CALL), and macro-op implementations, being more sophisticated, are more able to bear the cost of having two patterns.

It’s interesting that you point this out as my macro-fusion pattern for CALL is as follows:

(auipc, rd=x }, { jalr rd=ra, rs1=x }

The call macro explicitly sets rd to ra and in my implementation

On second thought, this pattern does not fit the (max) two input one output mould of a standard RV instruction as it has two outputs (ra and x as a clobber) and one input. This is no problem for the CALL macro (and I guess for your software implementation), but if you insist on fixing the link register to ra and want to stay in the general mould, the  fusion pattern would have to be

{auipc rd=ra }, { jalr rd=ra, rs1=ra }

Yes you are correct. This is a concern for a micro-architecture with a single write port, however it’s not really applicable to my case which is a binary translator.

The binary translator has an implementation specific constraint of a single explicit destination operand due to the fused psuedo-ops sharing the decode structure of regular RISC-V ops, so I can only have one explicit destination operand. The pseudo-op in the case of the old CALL macro has an implicit link register operand that is always ra and rs1 is the variable target address. I do this so I can coalesce the AUIPC and subsequent addition for the target address in the case of the old macro. Given this is an internal implementation detail that helps me improve codegen, I think its reasonable to do. Likewise a micro-architecture that has more than one write port may have more fusion opportunities e,g, address calculation for far LOADs, which have two outputs:

{auipc rd=x }, { ld rd=y, rs1=x }

An architecture with two write-ports could do a far-load in one less cycle. I think the 2-in 1-out constraint is really designed to be imposed on the ISA itself and not necessarily on fused instructions. The single write port microarchitecture constraint may only apply to conventional hardware implementations and is a lower bound that the ISA enforces on implementations, not an upper-bound on more sophisticated implementations that are cable of macro-fusion. We can reason about implicit or more than one explicit destination operands in custom fused sequences to make more efficient translation. In the case of far relative load/store, a store only requires one write port but optimising a auipd+load requires two. I don’t know if anyone would do this - it requires a 4th operand lane in the reorder buffer for OoO (e.g. rs3 from FMA), but as a write operand, not a read operand. Someone might consider it to make far loads execute in one cycle; assuming it is a common enough sequence to be profitable.

Cheers,
Michael.

Jacob Bachmeyer

unread,
Jun 21, 2017, 6:31:41 PM6/21/17
to Rogier Brussee, RISC-V ISA Dev, br...@hoult.org, kr...@berkeley.edu
Rogier Brussee wrote:
> Hi Michael,
>
> Op dinsdag 20 juni 2017 02:22:26 UTC+2 schreef michaeljclark:
>
> Hi Rogier,
>
> I understand what you mean regarding sharing the macro-fusion
> pattern now. That wasn’t clear to me, however I still think the
> JALR should avoid the redundant register write as simple
> implementations won’t be able to do anything about this extra
> write, which they don’t have now (and that was not the case for
> CALL), and macro-op implementations, being more sophisticated, are
> more able to bear the cost of having two patterns.
>
> It’s interesting that you point this out as my macro-fusion
> pattern for CALL is as follows:
>
> (auipc, rd=x }, { jalr rd=ra, rs1=x }
>
> The call macro explicitly sets rd to ra and in my implementation
>
>
> On second thought, this pattern does not fit the (max) two input one
> output mould of a standard RV instruction as it has two outputs (ra
> and x as a clobber) and one input. This is no problem for the CALL
> macro (and I guess for your software implementation), but if you
> insist on fixing the link register to ra and want to stay in the
> general mould, the fusion pattern would have to be
>
> {auipc rd=ra }, { jalr rd=ra, rs1=ra }


Am I correct that we are assuming that millicode calls will always use
intramodule JAL?


-- Jacob

Jacob Bachmeyer

unread,
Jun 21, 2017, 6:49:50 PM6/21/17
to Albert Cahalan, Bruce Hoult, Allen J. Baum, Michael Clark, RISC-V ISA Dev
This is where RISC-V diverges from ARM -- *all* instructions make sense
after AUIPC, even AUIPC itself. (Example: loading pointers to multiple,
distant, PC-relative data objects.) The latter set that you describe is
empty.

Also, unlike ARM, we have a standard encoding for variable-length
instructions. Fused "not really" AUIPC+AUIPC gives 2 20-bit immediates
for a 40-bit encoding space in a 64-bit instruction; a sub-minor opcode
(one of 2^12) in the standard 64-bit encoding gives a 45-bit encoding
space. There is no reason to even consider using AUIPC+AUIPC to perform
some other operation.


-- Jacob

Bruce Hoult

unread,
Jun 21, 2017, 6:54:55 PM6/21/17