48-bit LI-type instruction format (Large Immediate)

Michael Clark

unread,

Apr 7, 2017, 9:52:10 PM4/7/17

to RISC-V ISA Dev

Hi All,

I was reading a discussion on comp.arch about parsing variable length instructions (*1) and it made me think about reserving a subset of the 48-bit opcode space for instructions that can take a 32-bit immediate (such as CFI.SALA) and getting an experimental but very useful 48-bit instruction into RISC-V decoders.

RISC-V’s base 32-bit instruction format and prevalent 12-bit I-Type immediate sacrifices the ability to easily synthesise large immediate values however the variable length encoding gives rise to the potential of an instruction type with a 32-bit immediate.

While LUI has a 20-bit immediate, it is limited in the bit patterns it can synthesise; it consumes one major opcode, and in position independent code is effectively only used for sign extended 32-bit constant building (LUI+ADDI). LUI is not used for constructing large immediate values as can be seen from LI expansion. On RV128, the current constant building approach is even worse.

The proposal is for a 48-bit LI-Type instruction format (Large Immediate) where the register is positioned in the same place as RVC ‘rdrs1’ field so that register decode can be shared with RVC, albeit with a different instruction length.

| imm32 | nnnn | rdrs1[4:0] | n | 011111 |

I then propose a 48-bit LI-Type instruction for large constant building (Bv1 or Cv3 extension):

Instruction: Shift Left Logical Load Large Immediate
Type: LI-Type
Format: SLLLLI rdrs1, imm32
Pseudo: rd = (rs1 << 32) ∨ imm[31:0]

Encoding: | imm32 | 1000 | rdrs1[4:0] | 0 | 011111 |

By necessity, LI-Type has the constraint where rd = rs1, as is necessary to fit a 32-bit immediate into a 48-bit instruction. Many RVC 16-bit encodings already have the rd = rs1 constraint with the register in the same bit position. There are 10-bits remaining after subtraction of the 6-bit length encoding for a 48-bit instruction, so encoding one register and a 32-bit immediate leaves 5-bits thus one LI-Type instruction takes 1/32th (~3%) of the 48-bit encoding space.

A 64-bit constant (on RV64) can be loaded with two 48-bit SLLLLI instructions (12 bytes):

SLLLLI t0, 0xCAFED00D
SLLLLI t0, 0xFEEDBEEF

Compared to:

LI t0, 0xCAFED00DFEEDBEEF

Which emits 24 bytes:

10078: e57f72b7    lui t0,0xe57f7
1007c: 8072829b    addiw t0,t0,-2041
10080: 01329293    slli t0,t0,0x13
10084: bb728293    addi t0,t0,-1097
10088: 00e29293    slli t0,t0,0xe
1008c: eef28293    addi t0,t0,-273

Shift Left Logical Load Large Immediate has a 2:1 compression ratio for 64-bit constant building compared to the Base ISA. SLLLLI is also useful where timing is critical such that it is more appropriate for a constant to be streamed from I$ vs a random read from D$ which could potentially leak timing information via the cache. SLLLLI is more space efficient than AUIPC+LD. Note, there is no provision to use zero as the source given the rdrs1 constraint and we don’t want to use 6% of the 48-bit opcode space for constant building. A 16-bit C.ADDI can be used to clear the register if less than XLEN/32 bits are being loaded.

LI-Type is similar to U-Type in the 32-bit encoding space which only has 3 instructions (LUI, AUIPC and JAL).

Is 2:1 compression for loading large constants worth 3% of the 48-bit encoding space? A quadrant based approach, as has been used in other ISAs, requires more encoding space, and likewise does separate SHIFT and ORI. SLLLLI uses a relatively small amount of encoding space for a reasonable gain.

It might be a good candidate instruction to get RISC-V decoders wired up to accept 48-bit instructions. Food for thought.

Michael.

[1] https://groups.google.com/forum/#!topic/comp.arch/jb45aPUZbBI%5B101-125%5D

Andrew Waterman

unread,

Apr 7, 2017, 10:12:33 PM4/7/17

to Michael Clark, RISC-V ISA Dev

Some feature along these lines may make sense, but I should point out
that loading 64-bit constants from memory isn't a bad alternative.
With relaxation to gp, it takes 32 bits of code + 64 bits of data to
do so. If the same constant is referenced from multiple objects, the
constants are merged, making the incremental cost 32 bits of code.

In another thread, David Chisnall made the valid point that these
constants may require another D$ line/DTLB entry, but when this is
part of the routine code generation strategy (as it is for our GCC
port), there should be a more than commensurate reduction in I$/ITLB
pressure for some workloads.

> --
> You received this message because you are subscribed to the Google Groups
> "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to isa-dev+u...@groups.riscv.org.
> To post to this group, send email to isa...@groups.riscv.org.
> Visit this group at
> https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
> To view this discussion on the web visit
> https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/8344D1B5-D801-4A03-A22E-55006CFCD9D1%40mac.com.

Alex Elsayed

unread,

Apr 7, 2017, 10:14:10 PM4/7/17

to isa...@groups.riscv.org

On Friday, 7 April 2017 18:52:04 PDT Michael Clark wrote:

> Hi All,

>

> I was reading a discussion on comp.arch about parsing variable length

> instructions (*1) and it made me think about reserving a subset of the

> 48-bit opcode space for instructions that can take a 32-bit immediate (such

> as CFI.SALA) and getting an experimental but very useful 48-bit instruction

> into RISC-V decoders.

"Useful" requires metrics. What will the average static instruction bytes reduction in programs be? What will the average dynamic instruction bytes reduction in programs be? What will this gain or lose in area vs. no change, macro-op fusion, etc?

In addition, regarding CFI.SALA, I believe I proposed an encoding in the NOP space that effectively took a 32-bit immediate, though admittedly using eight instruction bytes rather than six.

> RISC-V’s base 32-bit instruction format and prevalent 12-bit I-Type

> immediate sacrifices the ability to easily synthesise large immediate

> values however the variable length encoding gives rise to the potential of

> an instruction type with a 32-bit immediate.

On the other hand, how _frequently_ are large immediates synthesized? An instruction that reduces the cost of an idiom by a factor of 2 has a hugely different value depending on whether that idiom occurs in every function prologue/epilogue, vs. one that occurs one, or ten, or even a hundred times in the execution of a program.

> While LUI has a 20-bit immediate, it is limited in the bit patterns it can

> synthesise; it consumes one major opcode, and in position independent code

> is effectively only used for sign extended 32-bit constant building

> (LUI+ADDI). LUI is not used for constructing large immediate values as can

> be seen from LI expansion. On RV128, the current constant building approach

> is even worse.

Yes, but there are also rather few 128-bit constants in current use; much less ones used frequently in the course of execution. I suspect that while 64-bit constants _exist_ in somewhat greater numbers, they are similarly rarely accessed - and when they are, they are likely to be in RODATA rather than immediates.

I'm very curious about cases where a constant read would be a meaningful timing leak. AFAICT, that could only occur if _which_ constant was loaded depended on data that should not leak - which requires either secret-dependent memory access or secret-dependent control flow either way. Both are _intrinsic_ timing leaks; ones this instruction could not resolve.

> SLLLLI is more space efficient than

> AUIPC+LD. Note, there is no provision to use zero as the source given the

> rdrs1 constraint and we don’t want to use 6% of the 48-bit opcode space for

> constant building. A 16-bit C.ADDI can be used to clear the register if

> less than XLEN/32 bits are being loaded.

>

> LI-Type is similar to U-Type in the 32-bit encoding space which only has 3

> instructions (LUI, AUIPC and JAL).

>

> Is 2:1 compression for loading large constants worth 3% of the 48-bit

> encoding space? A quadrant based approach, as has been used in other ISAs,

> requires more encoding space, and likewise does separate SHIFT and ORI.

> SLLLLI uses a relatively small amount of encoding space for a reasonable

> gain.

As noted above, whether it's worth it would have to depend not just on its compression rate in isolation, but in the frequency with which it is applicable to real-world programs.

> It might be a good candidate instruction to get RISC-V decoders wired up to

> accept 48-bit instructions. Food for thought.

>

> Michael.

>

> [1]

> https://groups.google.com/forum/#!topic/comp.arch/jb45aPUZbBI%5B101-125%5D

> <https://groups.google.com/forum/#!topic/comp.arch/jb45aPUZbBI[101-125]>

signature.asc

Michael Clark

unread,

Apr 7, 2017, 11:06:41 PM4/7/17

to Alex Elsayed, isa...@groups.riscv.org

On 8 Apr 2017, at 2:14 PM, Alex Elsayed <etern...@gmail.com> wrote:

On Friday, 7 April 2017 18:52:04 PDT Michael Clark wrote:
> Hi All,
>
> I was reading a discussion on comp.arch about parsing variable length
> instructions (*1) and it made me think about reserving a subset of the
> 48-bit opcode space for instructions that can take a 32-bit immediate (such
> as CFI.SALA) and getting an experimental but very useful 48-bit instruction
> into RISC-V decoders.

"Useful" requires metrics. What will the average static instruction bytes reduction in programs be? What will the average dynamic instruction bytes reduction in programs be? What will this gain or lose in area vs. no change, macro-op fusion, etc?

In addition, regarding CFI.SALA, I believe I proposed an encoding in the NOP space that effectively took a 32-bit immediate, though admittedly using eight instruction bytes rather than six.

The NOP encoding is interesting, but if 48-bit decode with 32-bit immediate is available, this would simplify things. There is mention that many of the ISA extensions will use 48-bit encodings, so this is based on the assumption that 48-bit encodings become prevalent.

> RISC-V’s base 32-bit instruction format and prevalent 12-bit I-Type
> immediate sacrifices the ability to easily synthesise large immediate
> values however the variable length encoding gives rise to the potential of
> an instruction type with a 32-bit immediate.

On the other hand, how _frequently_ are large immediates synthesized? An instruction that reduces the cost of an idiom by a factor of 2 has a hugely different value depending on whether that idiom occurs in every function prologue/epilogue, vs. one that occurs one, or ten, or even a hundred times in the execution of a program.

Good question. I guess it varies based on the ISA and compiler. i.e. whether the ISA can inline a constant in less space than a load from a constant pool. There will be fewer if the compiler emits constants to a constant pool (as does RISC-V gcc) versus encoding them in an instruction immediate. It cuts both ways. Constants that are used once save space with a more compact inline encoding (than AUIPC+LD) but use a little more space if the constant is referenced more than once. The temporal locality of an inline constant in icache is better in some situations. It would need some modelling and experimentation.

x86 has mov r, imm64 however x86 also has an incredibly difficult problem of trying to figure out the instruction length during decode, to enable decoding of instructions in parallel. RISC-V models with 3 different instruction lengths (16, 32, 48) with the instruction length indicator at the fixed position in the first half word makes parallel decode of variable length instructions much more tractable… In principle.

Admittedly timing leaks are for data dependent access to arrays of constants, however any code that uses an internal constant can work more predictably in a constant time with streaming icache access versus random dcache access. It seems easier to reason about the constant time property of a function that has immediate constants. The other property is that the code can operate with out LOAD which gives the code a property that is distinct from position independence, and that is data memory independence. i.e. the code can be running out of ROM and caches while the memory controller is down. e.g. firmware.

> SLLLLI is more space efficient than
> AUIPC+LD. Note, there is no provision to use zero as the source given the
> rdrs1 constraint and we don’t want to use 6% of the 48-bit opcode space for
> constant building. A 16-bit C.ADDI can be used to clear the register if
> less than XLEN/32 bits are being loaded.
>
> LI-Type is similar to U-Type in the 32-bit encoding space which only has 3
> instructions (LUI, AUIPC and JAL).
>
> Is 2:1 compression for loading large constants worth 3% of the 48-bit
> encoding space? A quadrant based approach, as has been used in other ISAs,
> requires more encoding space, and likewise does separate SHIFT and ORI.
> SLLLLI uses a relatively small amount of encoding space for a reasonable
> gain.

As noted above, whether it's worth it would have to depend not just on its compression rate in isolation, but in the frequency with which it is applicable to real-world programs.

> It might be a good candidate instruction to get RISC-V decoders wired up to
> accept 48-bit instructions. Food for thought.
>
> Michael.
>
> [1]
> https://groups.google.com/forum/#!topic/comp.arch/jb45aPUZbBI%5B101-125%5D
> <https://groups.google.com/forum/#!topic/comp.arch/jb45aPUZbBI[101-125]>

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/3677826.SyffiZQDtn%40arkadios.

Michael Clark

unread,

Apr 7, 2017, 11:29:23 PM4/7/17

to Andrew Waterman, RISC-V ISA Dev

> On 8 Apr 2017, at 2:12 PM, Andrew Waterman <wate...@eecs.berkeley.edu> wrote:
>
> Some feature along these lines may make sense, but I should point out
> that loading 64-bit constants from memory isn't a bad alternative.
> With relaxation to gp, it takes 32 bits of code + 64 bits of data to
> do so. If the same constant is referenced from multiple objects, the
> constants are merged, making the incremental cost 32 bits of code.

Yes. It increases icache compared to GP relative accesses.

However now the routine depends on gp. This is fine for most code but there are some cases where code may want to avoid LOAD and to not depend on register state. Firmware. That however is a special case so perhaps not the best argument. There is of course also the limit on the size of the gp relative pool.

> In another thread, David Chisnall made the valid point that these
> constants may require another D$ line/DTLB entry, but when this is
> part of the routine code generation strategy (as it is for our GCC
> port), there should be a more than commensurate reduction in I$/ITLB
> pressure for some workloads.

Perhaps synthesising constants while executing in place from X-only marked pages. That is a little bit obscure but currently takes lots of space.

It would probably need a compiler option to tune constant loads. Compilers for ISAs that have large inline constants appear to use them from disassembly, including imm64. Surprising that assigning 8 individual byte values to an array ended up as a mov imm64.

Actually there is a big downside of a naked imm32 which is the potential for embedding code in an immediate. It may make sense to arrange the bits so that the immediate can’t fully alias with an instruction. Will have to think about the embeddings of 48-bit encodings if the instruction pointer is pointed somewhere where it shouldn’t be.

Stefan O'Rear

unread,

Apr 8, 2017, 12:37:38 AM4/8/17

to Andrew Waterman, Michael Clark, RISC-V ISA Dev

On Fri, Apr 7, 2017 at 7:12 PM, Andrew Waterman
<wate...@eecs.berkeley.edu> wrote:
> Some feature along these lines may make sense, but I should point out
> that loading 64-bit constants from memory isn't a bad alternative.
> With relaxation to gp, it takes 32 bits of code + 64 bits of data to
> do so. If the same constant is referenced from multiple objects, the
> constants are merged, making the incremental cost 32 bits of code.

Since I have a compiler sitting around that I know how to modify, I
instrumented Go to log all 64-bit constant generation (it is currently
using per-function literal pools with no coalescing, 16 bytes per
constant) and ran it against something small, something large, and the
biggest thing I had handy.

test/helloworld.go:
195 216 text bytes, 171 references to 67 unique constants

Savings for SLLLLI : 0.35%
Global literal pool: 0.43%
Global literal pool + relaxation: 0.78%

cmd/compile:
4 031 208 text bytes, 1543 references to 453 unique constants

SLLLLI: 0.15%
Global literal pool: 0.22%
Global literal pool + GP relaxation: 0.37%

k8s.io/kubernetes/cmd/hyperkube:
28 662 416 text bytes, 5148 references to 1037 unique constants, 4473
references to the 512 most popular

SLLLLI: 0.07%
Global literal pool: 0.11%
Global literal pool + GP relaxation: 0.18%

(incidentally most of the 64-bit constants look like they came about
because of div → mul transformation in the middle end …)

-s

Michael Clark

unread,

Apr 8, 2017, 2:34:25 AM4/8/17

to Stefan O'Rear, Andrew Waterman, RISC-V ISA Dev

I guess it can’t be justified based on space saving compared to GP relaxation.

On the surface a 64-bit SLLLLI constant requires streaming 96-bits of icache vs streaming 32-bits icache + 64-bits of random access to dcache so they are equivalent for the best case in bytes transferred considering a GP relative load. Also, 48-bit instructions add a lot of complexity to instruction decode for OoO implementations.

The primary thinking was functions that (don’t) depend on LOAD random access from global memory. e.g. constant time. Either way it’s coming from memory, but the kernel can be temporally coherent. i.e. streamed, and not subject to jitter from “way dependent” data path cache access patterns; essentially global memory access versus the temporal instruction stream.

Constant time synthesis of large constants; avoiding LOAD, for various reasons in X-only pages.

Bruce Hoult

unread,

Apr 8, 2017, 9:51:57 AM4/8/17

to Michael Clark, Stefan O'Rear, Andrew Waterman, RISC-V ISA Dev

That's assuming you *have* a dcache. Not necessarily the case, and microcontrollers such as the E310-G000 in my HiFive1 board (the only RISC-V hardware currently on open sale) has a 16 KB icache but no dcache. There is 16 KB of SRAM scratch for data, but program constants will come from the flash. Loading a 32 bit value from the flash takes on the order of 1000 ns, while loading from icache or sram takes 2 clock cycles at by default 256 MHz i.e. about 8 ns.

Thus, on this hardware, constants located in the instruction stream are vastly preferable from a sped point of view.

Not: literal pools at the end of the function/BB/whatever do not help -- they take up valuable space in the icache, but loading them as data is not satisfied from the icache.

Michael Clark

unread,

Apr 8, 2017, 5:53:30 PM4/8/17

to Bruce Hoult, Stefan O'Rear, Andrew Waterman, RISC-V ISA Dev

Yes. Even on larger CPUs, a LOAD might be a dozen cycles coming from L2 or many dozens of cycles if coming from RAM. Once the word is in L1 the speed will be closer to constant, but not quite constant, as streaming dcache LOADs cause the constant pool cache way to be evicted to the victim buffer and then ping pong back and forth into L1 depending on the streaming data. Loop jitter. Less determinism.

If it were OpenCL we would have __constant__ and we would know that the constant is coming from fast scratchpad SRAM versus global memory (DRAM). Perhaps a RISC-V OpenCL implementation might point gp to SRAM.

A constant in an immediate instruction allows separation between what the programmer intends to be in the instruction path versus the data path. I was thinking of functions that want to avoid dcache footprint. In fact we could even move a synthesised DP value into the FPU register file without touching the data path.

Interestingly clang 4.0 and gcc 7 will collapse byte assignments into a 64-bit immediate. This is new. This is not done in gcc 6.3 but it is done in gcc 7 so the immediate must have some advantage on certain architectures, or it is a bug.

- https://godbolt.org/g/u3PkM2

large immediate test

extern char test(char *a);

int main()
{
char a[8] = { 1, 2, 3, 4, 5, 6, 7, 8};
test(a);
}

x86-64 clang 4.0

main: # @main
push rax
movabs rax, 578437695752307201
mov qword ptr [rsp], rax
mov rdi, rsp
call test(char*)
xor eax, eax
pop rcx
ret

x86-64 gcc 7 snapshot

main:
sub rsp, 24
movabs rax, 578437695752307201
lea rdi, [rsp+8]
mov QWORD PTR [rsp+8], rax
call test(char*)
xor eax, eax
add rsp, 24
ret

x86-64 gcc 6.3

main:
sub rsp, 24
mov rdi, rsp
mov BYTE PTR [rsp], 1
mov BYTE PTR [rsp+1], 2
mov BYTE PTR [rsp+2], 3
mov BYTE PTR [rsp+3], 4
mov BYTE PTR [rsp+4], 5
mov BYTE PTR [rsp+5], 6
mov BYTE PTR [rsp+6], 7
mov BYTE PTR [rsp+7], 8
call test(char*)
xor eax, eax
add rsp, 24
ret

riscv gcc 7.0.1 experimental

.option nopic
.section .sdata,"aw",@progbits
.align 3
.LC0:
.byte 1
.byte 2
.byte 3
.byte 4
.byte 5
.byte 6
.byte 7
.byte 8
.text
.align 1
.globl main
.type main, @function
main:
add sp,sp,-32
sd ra,24(sp)
sd s0,16(sp)
add s0,sp,32
lui a5,%hi(.LC0)
ld a5,%lo(.LC0)(a5)
sd a5,-24(s0)
add a5,s0,-24
mv a0,a5
call test
li a5,0
mv a0,a5
ld ra,24(sp)
ld s0,16(sp)
add sp,sp,32
jr ra
.size main, .-main
.ident "GCC: (GNU) 7.0.1 20170202 (experimental)"

Stefan O'Rear

unread,

Apr 8, 2017, 5:58:24 PM4/8/17

to Michael Clark, Bruce Hoult, Andrew Waterman, RISC-V ISA Dev

On Sat, Apr 8, 2017 at 2:53 PM, Michael Clark <michae...@mac.com> wrote:
> Interestingly clang 4.0 and gcc 7 will collapse byte assignments into a
> 64-bit immediate. This is new. This is not done in gcc 6.3 but it is done in
> gcc 7 so the immediate must have some advantage on certain architectures, or
> it is a bug.
>
> - https://godbolt.org/g/u3PkM2
>
> large immediate test
>
> extern char test(char *a);
>
> int main()
> {
> char a[8] = { 1, 2, 3, 4, 5, 6, 7, 8};
> test(a);
> }

Advantage compared to what? a is a non-const object so passing test a
pointer to rodata would not be standards-conforming.

-s

Michael Clark

unread,

Apr 8, 2017, 6:07:10 PM4/8/17

to Stefan O'Rear, Bruce Hoult, Andrew Waterman, RISC-V ISA Dev

I was referring to the use of a 64-bit immediate. The compiler could copy the constant on the RHS into the array, like it does with -O0

e.g. https://godbolt.org/g/aZi5zp

Jacob Bachmeyer

unread,

Apr 8, 2017, 6:16:03 PM4/8/17

to Alex Elsayed, isa...@groups.riscv.org

Alex Elsayed wrote:
>
> On Friday, 7 April 2017 18:52:04 PDT Michael Clark wrote:
>
> > Hi All,
>
> >
>
> > I was reading a discussion on comp.arch about parsing variable length
>
> > instructions (*1) and it made me think about reserving a subset of the
>
> > 48-bit opcode space for instructions that can take a 32-bit
> immediate (such
>
> > as CFI.SALA) and getting an experimental but very useful 48-bit
> instruction
>
> > into RISC-V decoders.
>
>
>
> "Useful" requires metrics. What will the average static instruction
> bytes reduction in programs be? What will the average dynamic
> instruction bytes reduction in programs be? What will this gain or
> lose in area vs. no change, macro-op fusion, etc?
>
>
>
> In addition, regarding CFI.SALA, I believe I proposed an encoding in
> the NOP space that effectively took a 32-bit immediate, though
> admittedly using eight instruction bytes rather than six.
>

My suggested CFI.SALA was a 64-bit instruction and was also a direct
branch. This allows it to be used to make dense jump tables which
increases the number of "slots" within reach of a single base register
value. I later half-suggested an encoding for PROPOSE/ACCEPT that used
two 14-bit half cookies that could be hidden in U-type NOPs using about
1/16 of either AUIPC x0 or LUI x0, but this would omit the "how did we
get here?" option for accepting an indirect jump and would be much less
useful in jump tables, due to its larger size.

-- Jacob

Stefan O'Rear

unread,

Apr 8, 2017, 6:34:28 PM4/8/17

to Michael Clark, Bruce Hoult, Andrew Waterman, RISC-V ISA Dev

On Sat, Apr 8, 2017 at 3:07 PM, Michael Clark <michae...@mac.com> wrote:
> I was referring to the use of a 64-bit immediate. The compiler could copy the constant on the RHS into the array, like it does with -O0
>
> e.g. https://godbolt.org/g/aZi5zp

It's fairly clear that if you have 64-bit immediates you should use them.

It's far less clear that if you don't have them you should add them.
The benefits for general code are rather marginal.

It is especially less clear if your ISA is explicitly optimized for simplicity.

-s

Jacob Bachmeyer

unread,

Apr 8, 2017, 7:58:28 PM4/8/17

to Michael Clark, RISC-V ISA Dev

Michael Clark wrote:
> The proposal is for a 48-bit LI-Type instruction format (Large
> Immediate) where the register is positioned in the same place as RVC
> ‘rdrs1’ field so that register decode can be shared with RVC, albeit
> with a different instruction length.
>
> | imm32 | nnnn | rdrs1[4:0] | n | 011111 |
>
>
> I then propose a 48-bit LI-Type instruction for large constant
> building (Bv1 or Cv3 extension):
>
> Instruction: Shift Left Logical Load Large Immediate
> Type: LI-Type
> Format: SLLLLI rdrs1, imm32
> Pseudo: rd = (rs1 << 32) ∨ imm[31:0]
>
> Encoding: | imm32 | 1000 | rdrs1[4:0] | 0 | 011111 |
>
>
> By necessity, LI-Type has the constraint where rd = rs1, as is
> necessary to fit a 32-bit immediate into a 48-bit instruction. Many
> RVC 16-bit encodings already have the rd = rs1 constraint with the
> register in the same bit position. There are 10-bits remaining after
> subtraction of the 6-bit length encoding for a 48-bit instruction, so
> encoding one register and a 32-bit immediate leaves 5-bits thus one
> LI-Type instruction takes 1/32th (~3%) of the 48-bit encoding space.

I like the concept; something similar was one of the first ideas I had
for improving RISC-V, but ran into some problems related to preventing
immediate values from being mistaken as code after an incorrect branch.

The first problem I have with this is the name: all other shift
instructions in RISC-V have variable shift amounts. I almost asked
about hardware complexity, then noticed that SLLLLI shifts by a fixed 32
bits. Further, the mnemonic, while cute, is confusingly similar to the
very different SLLI in baseline RISC-V. I suggest "Incremental Load
Immediate" or "ILI".

The second problem is one that previously had me suggesting a a new page
table format to add an XLEN-bit "long immediate obfuscation" value that
would be XORed with every long immediate in that page. While this would
make attacks on JITs infeasible, it would also open several cans of
worms of its own. Probably, the best solution is to make ILI a 64-bit
instruction containing a 32-bit immediate.

To prevent the abuse of immediate values, we need to expand them such
that any parcel containing bits of the value cannot be executed as a
"useful" instruction by an attacker. If we require bits 15,14,13, and 0
of every value-carrying parcel to be zero, we can store 12 bits per
parcel and those parcels can only be executed as C.NOP, C.ADDI, C.SLLI,
or C.SLLI64. Hopefully, this will be a sufficient defense against
abuse. There was a choice between permitting C.SLLI/C.SLLI64 or
C.ADDI4SPN: C.SLLI appears less dangerous because (like C.ADDI) it can
only read and modify a single register, while C.ADDI4SPN copies a value
derived from the stack pointer (!) to another register.

Concretely: (further shuffling of the immediate bits might be useful)

inst[63:61] = parcel3[15:13] = 3'b000 # RVC constraint
inst[ 60 ] = parcel3[ 12 ] = 1'b0 # extra bit
inst[59:49] = parcel3[11: 1] = imm32[31:21] # immediate fragment
inst[ 48 ] = parcel3[ 0 ] = 1'b0 # RVC constraint
inst[47:45] = parcel2[15:13] = 3'b000 # RVC constraint
inst[44:33] = parcel2[12: 1] = imm32[20: 9] # immediate fragment
inst[ 32 ] = parcel2[ 0 ] = 1'b0 # RVC constraint
inst[31:29] = parcel1[15:13] = 3'b000 # RVC constraint
inst[28:20] = parcel1[12: 4] = imm32[ 8: 0] # common imm[8:0]
inst[19:17] = parcel1[ 3: 1] = 3'b000 # extra bits
inst[ 16 ] = parcel1[ 0 ] = 1'b0 # RVC constraint
inst[ 15 ] = parcel0[ 15 ] = 1'b0 # extra bit
inst[14:12] = parcel0[14:12] = 3'b000 # common funct3
inst[11: 7] = parcel0[11: 7] = rs1/rd # common RVC source/destination register
inst[ 6: 0] = parcel0[ 6: 0] = 7'b0111111 # 64-bit instruction

This form touches 1/8 of the 64-bit encoding space (if only funct3 is
considered as the multiplex point), but contains 20 bits that
must-be-zero for the instruction to be recognized and therefore actually
consumes 2^-20 of the 64-bit encoding space. Due to the RVC
constraints, the high 3 bits of the common funct7 field are also zero;
combining this with funct3 reduces the encoding space touched to 1/64.

A 64-bit constant can be loaded on RV64 with two 64-bit ILI
instructions, totaling 16 bytes, which while longer than the 12 bytes
that 48-bit instructions would require, is still less than the 24 bytes
that the baseline ISA requires. I expect the improvements on RV128 to
be much larger.

Also note that there is no need to be able to shift from zero--XLEN/32
ILI instructions will completely replace the previous contents of the
register. Shorter constants can be loaded using LUI/ADDI/ILI sequences,
since LUI clears its destination register.

Another alternative could be a 32-bit ILI instruction that shifts in
12-bit units, but there is no room in OP-IMM and SLLI-by-12/ADDI would
be an equivalent macro-op fusion pair.

> Shift Left Logical Load Large Immediate has a 2:1 compression ratio
> for 64-bit constant building compared to the Base ISA. SLLLLI is also
> useful where timing is critical such that it is more appropriate for a
> constant to be streamed from I$ vs a random read from D$ which could
> potentially leak timing information via the cache. SLLLLI is more
> space efficient than AUIPC+LD. Note, there is no provision to use zero
> as the source given the rdrs1 constraint and we don’t want to use 6%
> of the 48-bit opcode space for constant building. A 16-bit C.ADDI can
> be used to clear the register if less than XLEN/32 bits are being loaded.
>
> LI-Type is similar to U-Type in the 32-bit encoding space which only
> has 3 instructions (LUI, AUIPC and JAL).
>
> Is 2:1 compression for loading large constants worth 3% of the 48-bit
> encoding space? A quadrant based approach, as has been used in other
> ISAs, requires more encoding space, and likewise does separate SHIFT
> and ORI. SLLLLI uses a relatively small amount of encoding space for a
> reasonable gain.

The 64-bit ILI that I suggest offers a slightly reduced 3:2 compression
at a hard cost of about one millionth of the 64-bit encoding space and
creates far fewer opportunities for mischief with JIT runtimes. A
softer cost is that about 1.5% of the 64-bit encoding space must be
planned around this instruction, if only the must-be-zero bits from the
common funct7 and funct3 fields are considered.

To adapt some text from the earlier proposal: Is 3:2 compression for
loading large constants worth affecting 1.5% of the 64-bit encoding
space while actually consuming a millionth of the 64-bit encoding
space? This approach also curtails the potential for mischief by
jumping into the middle of a long load-immediate instruction that the
48-bit form offers.

> It might be a good candidate instruction to get RISC-V decoders wired
> up to accept 48-bit instructions. Food for thought.

The combination of 16/32/64-bit instructions, all aligned on any 16-bit
boundary (as needed to both support RVC and any standard 64-bit
instruction set extension) makes the additional complexity to support
48-bit instruction length negligible.

-- Jacob

Michael Clark

unread,

Apr 8, 2017, 8:40:46 PM4/8/17

to jcb6...@gmail.com, RISC-V ISA Dev

The 48-bit LI-Type encoding could be adjusted such some immediate bits are shuffled around. I don’t think it is worth worrying about. Various instruction embeddings are unfortunately going to be the common case with variable length instructions. If anything, it makes CFI enabled by default more important . With CFI enabled, the instruction embedding problem essentially goes away. I noticed that Clang’s CFI works by calling a function to verify function pointers before indirect calls and apparently the performance overhead is not too bad, such that it will be enabled on production builds of mainstream software:

- https://clang.llvm.org/docs/ControlFlowIntegrity.html

It seems that it is likely that Clang CFI may be enabled by default on future Chrome builds. However it seems that hardware support may only provide a small performance improvement. Interesting nevertheless. Perhaps all indirect calls can just AUIPC+JALR ‘verify_vtable' for indirect calls. Chrome on RISC-V is definitely going to exceed the +/-1MiB JAL range for static function calls and its constant pool is also possibly larger than 2^12 (gp relative).

In any case it will be interesting when RISC-V has 16-bit, 32-bit, 48-bit, and 64-bit instructions, as per the variable length instruction coding scheme…

I find the break in the prefix code for 80-bit to 192-bit instructions to be quite fascinating. The coding strategy changes from a 7-bit prefix for 64-bit instructions to a 7-bit prefix and a 3-bit multiplier to encode the number of additional 16-bit words in the instruction. This segment count compression scheme would seem to indicate that the probability of very long instruction words (80-bit to 192-bit) may increase in frequency the longer the instructions gets, otherwise one would expect to see a continuation of the longer prefix scheme, if they were expected to be less frequent.

RISC-V instruction decoders that work on >= 80-bit instructions will clearly need a different strategy than 16-bit, 32-bit, 48-bit, and 64-bit instructions. It will be interesting to see how they end up being used…

Michael Clark

unread,

Apr 8, 2017, 9:30:22 PM4/8/17

to jcb6...@gmail.com, RISC-V ISA Dev

On 9 Apr 2017, at 11:58 AM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:

Michael Clark wrote:
The proposal is for a 48-bit LI-Type instruction format (Large Immediate) where the register is positioned in the same place as RVC ‘rdrs1’ field so that register decode can be shared with RVC, albeit with a different instruction length.

   | imm32 | nnnn | rdrs1[4:0] | n | 011111 |

I then propose a 48-bit LI-Type instruction for large constant building (Bv1 or Cv3 extension):

   Instruction: Shift Left Logical Load Large Immediate
   Type: LI-Type
   Format: SLLLLI rdrs1, imm32
   Pseudo: rd = (rs1 << 32) ∨ imm[31:0]

   Encoding: | imm32 | 1000 | rdrs1[4:0] | 0 | 011111 |

By necessity, LI-Type has the constraint where rd = rs1, as is necessary to fit a 32-bit immediate into a 48-bit instruction. Many RVC 16-bit encodings already have the rd = rs1 constraint with the register in the same bit position. There are 10-bits remaining after subtraction of the 6-bit length encoding for a 48-bit instruction, so encoding one register and a 32-bit immediate leaves 5-bits thus one LI-Type instruction takes 1/32th (~3%) of the 48-bit encoding space.

I like the concept; something similar was one of the first ideas I had for improving RISC-V, but ran into some problems related to preventing immediate values from being mistaken as code after an incorrect branch.

The first problem I have with this is the name: all other shift instructions in RISC-V have variable shift amounts. I almost asked about hardware complexity, then noticed that SLLLLI shifts by a fixed 32 bits. Further, the mnemonic, while cute, is confusingly similar to the very different SLLI in baseline RISC-V. I suggest “Incremental Load Immediate" or "ILI".

ILI is an okay name.

The 32-bit shift is to avoid taking up encoding space for halves or quadrants (like the half with ARM’s movt which shifts a 16-bit immediate to bits 31:16). A constant offset and constant position for the immediate doesn’t take much area. If an instruction had quadrants for loading 128-bit values it would take up 12% of the 48-bit opcode space and the wiring would be more complex as the immediate would need to be muxed into multiple positions. Circuit wise, the ILI instruction should be relatively simple:

BTW ARM makes the same justification for movw/movt, which is avoiding cache misses on the data side, which is likely why compiler engineers use imm64 on x86:

- https://community.arm.com/processors/b/blog/posts/how-to-load-constants-in-assembly-for-arm-architecture

I think 3%, or one 48-bit LI-Type, instruction (5-bit register and 32-bit immediate) is not too bad given the functionality. I can imagine use cases for building large constants where we explicitly want to avoid dcache footprint and are doing so in X-only memory so we can’t use LOAD. We don’t currently build large constants with instructions because it is extremely inefficient to do so with SLLI and 12-bit ADDI. LUI is of marginal use when we are say trying to load a 128-bit constant in X-only memory, for use in a trampoline.

The second problem is one that previously had me suggesting a a new page table format to add an XLEN-bit "long immediate obfuscation" value that would be XORed with every long immediate in that page. While this would make attacks on JITs infeasible, it would also open several cans of worms of its own. Probably, the best solution is to make ILI a 64-bit instruction containing a 32-bit immediate.

JITs can use CFI techniques. CFI.SALA.

Jacob Bachmeyer

unread,

Apr 8, 2017, 10:21:47 PM4/8/17

to Michael Clark, RISC-V ISA Dev

Michael Clark wrote:
>> On 9 Apr 2017, at 11:58 AM, Jacob Bachmeyer <jcb6...@gmail.com

Fair enough, but I suggest a 64-bit form of ILI, which is still denser
than the current approach and reduces the risk of the immediate being
abused.

>> The second problem is one that previously had me suggesting a a new
>> page table format to add an XLEN-bit "long immediate obfuscation"
>> value that would be XORed with every long immediate in that page.
>> While this would make attacks on JITs infeasible, it would also open
>> several cans of worms of its own. Probably, the best solution is to
>> make ILI a 64-bit instruction containing a 32-bit immediate.
>
> JITs can use CFI techniques. CFI.SALA.

Convenient, but CFI.SALA will almost certainly not be in the same
extension as ILI, so there are good reasons to limit the abuse potential
of ILI.

-- Jacob

Jacob Bachmeyer

unread,

Apr 8, 2017, 10:57:28 PM4/8/17

to Michael Clark, RISC-V ISA Dev

Michael Clark wrote:
>> On 9 Apr 2017, at 11:58 AM, Jacob Bachmeyer <jcb6...@gmail.com

It is not worth worrying about yet--further bit shuffling would be to
minimize hardware complexity, not to change or protect semantics. Any
shuffling of immediate bits can be balanced by an attacker.

> Various instruction embeddings are unfortunately going to be the
> common case with variable length instructions. If anything, it makes

> CFI enabled by default more important. With CFI enabled, the

> instruction embedding problem essentially goes away.

CFI is great and all, but will almost certainly not be in the same
extension as ILI. Therefore, implementations may have both, either, or
neither. This is a very good reason to limit the abuse potential of
ILI. I seem to recall a sandbox escape that relied on a MOV reg,imm64,
although I cannot find it at the moment as it is buried under so many
other sandbox escapes. Put simply, user-provided constants are a unique
hazard, although I suppose a JIT could mask them with random values at
the cost of an extra temporary. (Generate nonce N; emit "load (A XOR
N); load N; XOR (A XOR N), N -> A"; discard N) Note that such a trace,
needing two 48-bit ILI operations per 32-bit fragment (one for masked
data; one for nonce) and the XOR to combine them (possibly a C.XOR) is
far longer than a single 64-bit "hardened" ILI instruction per 32-bit
fragment.

> I noticed that Clang’s CFI works by calling a function to verify
> function pointers before indirect calls and apparently the performance
> overhead is not too bad, such that it will be enabled on production
> builds of mainstream software:
>
> - https://clang.llvm.org/docs/ControlFlowIntegrity.html
>
> It seems that it is likely that Clang CFI may be enabled by default on
> future Chrome builds. However it seems that hardware support may only
> provide a small performance improvement. Interesting nevertheless.
> Perhaps all indirect calls can just AUIPC+JALR ‘verify_vtable' for
> indirect calls. Chrome on RISC-V is definitely going to exceed the
> +/-1MiB JAL range for static function calls and its constant pool is
> also possibly larger than 2^12 (gp relative).
>
> In any case it will be interesting when RISC-V has 16-bit, 32-bit,
> 48-bit, and 64-bit instructions, as per the variable length
> instruction coding scheme…
>
> I find the break in the prefix code for 80-bit to 192-bit instructions
> to be quite fascinating. The coding strategy changes from a 7-bit
> prefix for 64-bit instructions to a 7-bit prefix and a 3-bit
> multiplier to encode the number of additional 16-bit words in the
> instruction. This segment count compression scheme would seem to
> indicate that the probability of very long instruction words (80-bit
> to 192-bit) may increase in frequency the longer the instructions
> gets, otherwise one would expect to see a continuation of the longer
> prefix scheme, if they were expected to be less frequent.

Or it is a case of taking advantage of existing fields--that 3-bit
length counter is the funct3 field in shorter instructions. I would
suggest taking the same approach with the funct7 field for instructions
>176 bits: encode the number of parcels beyond 11 in the 7 bits that
form the funct7 field in a baseline R-type instruction. I see it more
as individual bits in first parcel(s) becoming less important as the
overall instruction length increases--the prefix scheme helps to pack
more information into the shorter instructions.

> RISC-V instruction decoders that work on >= 80-bit instructions will
> clearly need a different strategy than 16-bit, 32-bit, 48-bit, and
> 64-bit instructions. It will be interesting to see how they end up
> being used…

I disagree. It is largely a simple matter of a larger decode buffer.
Much more interesting is the question of what instructions will actually
need that many bits. (Chained operations are one possibility that came
up in another thread.)

-- Jacob

Michael Clark

unread,

Apr 8, 2017, 11:09:43 PM4/8/17

to jcb6...@gmail.com, RISC-V ISA Dev

There are interesting CFI related issue with variable length instruction encodings.

I question whether it is realistic to reserve bits in longer instruction formats to restrict possible embeddings of smaller instructions as by products of immediate values and/or registers or any other function selectors. It may place onerous restrictions on the design of larger instruction formats and even if we were to restrict embeddings to say just one C quadrant, there are still potentially dangerous embeddings that can adjust register contents, so one questions whether the endeavour is worthwhile.

One could make a reasonable argument that CFI is a necessary state of the art protection for variable length instruction sets given the current state of the art in attack vectors on other architectures with variable length instructions, assuming one chooses this level of protection. Many ISAs have simple immediate encodings or constants in text which could embed instructions and CFI doesn’t necessitate ISA extensions, however ISA extensions can potentially reduce the overhead. It’s likely RISC-V will gain Clang’s CFI when the LLVM port is complete.

No matter how careful the design, there are going to be executable embeddings, unless more bits than the longest length prefix are removed from the add-on half-words such that any longer encoding does not have executable embeddings. This is not practical. I think CFI is more practical than trying to avoiding these embeddings. rs1 in a 32-bit instruction is already in the position where it can select a 16-bit opcode so there are already likely to be some interesting RVC embeddings. Would have to do some analysis…

Jacob Bachmeyer

unread,

Apr 9, 2017, 1:20:56 AM4/9/17

to Michael Clark, RISC-V ISA Dev

Michael Clark wrote:
>> On 9 Apr 2017, at 2:57 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>>
>> Michael Clark wrote:
>>
>>> Various instruction embeddings are unfortunately going to be the common case with variable length instructions. If anything, it makes CFI enabled by default more important. With CFI enabled, the instruction embedding problem essentially goes away.
>>>
>> CFI is great and all, but will almost certainly not be in the same extension as ILI. Therefore, implementations may have both, either, or neither. This is a very good reason to limit the abuse potential of ILI. I seem to recall a sandbox escape that relied on a MOV reg,imm64, although I cannot find it at the moment as it is buried under so many other sandbox escapes. Put simply, user-provided constants are a unique hazard, although I suppose a JIT could mask them with random values at the cost of an extra temporary. (Generate nonce N; emit “load (A XOR N); load N; XOR (A XOR N), N -> A"; discard N) Note that such a trace, needing two 48-bit ILI operations per 32-bit fragment (one for masked data; one for nonce) and the XOR to combine them (possibly a C.XOR) is far longer than a single 64-bit "hardened" ILI instruction per 32-bit fragment.
>>
>
> There are interesting CFI related issue with variable length instruction encodings.
>
> I question whether it is realistic to reserve bits in longer instruction formats to restrict possible embeddings of smaller instructions as by products of immediate values and/or registers or any other function selectors. It may place onerous restrictions on the design of larger instruction formats and even if we were to restrict embeddings to say just one C quadrant, there are still potentially dangerous embeddings that can adjust register contents, so one questions whether the endeavour is worthwhile.
>

It is not necessary in general. 32-bit long immediate values are a
special hazard because of the opportunity to embed complete instructions
in user-provided data. This is a particular hazard for JIT and dynamic
binary translation systems. Shuffling bits in the immediate may make
hardware decoding easier, but provides no security benefit. The
provided immediate should be broken up in order to limit the possible
mischief. Note that if the suggested 64-bit form of ILI is considered
to touch 1/64 of the 64-bit address space (multiplexing on funct3 and
the top three bits of funct7) then 14 bits of brownfield opcode space
(funct3 = 3'b000, funct7 = 7'b000xxxx; ILI is only recognized if another
14 bits in the 64-bit word are all zero) is available around the 64-bit
ILI instruction that does not meet the RVC constraints. The use of this
space to encode other operations that do not provide such easy mischief
is perfectly fine.

That all executable embeddings in 64-bit ILI are 16-bit instructions was
an intentional choice in order to more easily control the possible
interpretations.

> One could make a reasonable argument that CFI is a necessary state of the art protection for variable length instruction sets given the current state of the art in attack vectors on other architectures with variable length instructions, assuming one chooses this level of protection. Many ISAs have simple immediate encodings or constants in text which could embed instructions and CFI doesn’t necessitate ISA extensions, however ISA extensions can potentially reduce the overhead. It’s likely RISC-V will gain Clang’s CFI when the LLVM port is complete.
>

Such measures still must be used and it is far too easy to either
neglect them or to decide to not use them, possibly with good reason but
more probably with no good reason. Baking CFI into the RISC-V baseline
will not happen. Since ILI is a uniquely abusable operation, it should
have special restrictions. Defense-in-depth.

> No matter how careful the design, there are going to be executable embeddings, unless more bits than the longest length prefix are removed from the add-on half-words such that any longer encoding does not have executable embeddings. This is not practical. I think CFI is more practical than trying to avoiding these embeddings. rs1 in a 32-bit instruction is already in the position where it can select a 16-bit opcode so there are already likely to be some interesting RVC embeddings. Would have to do some analysis…
>

Executable embeddings are generally not a problem in and of themselves.
They become a problem when an attacker can produce them at will and
control what they will do. The position of rs1 in a 32-bit instruction
is not a cause for concern, since a JIT or translator can freely control
the value of the rs1 field. An attacker can predict rs1 and build
exploitable operations only with precise knowledge of JIT internals.
CFI is good, yes, but it will not be in the RISC-V baseline, so
implementations may very well have ILI, various JIT features, and not
have CFI support. (Could ILI and CFI.SALA both be in the J extension?
CFI is particularly important in JIT environments. Hmmmm.)

Again, 64-bit ILI is structured the way it is precisely because 32-bit
(and longer) immediate values are uniquely abusable.

-- Jacob

Rogier Brussee

unread,

Apr 9, 2017, 2:29:43 PM4/9/17

to RISC-V ISA Dev

Just to clarify but shouldn't a 64 bit load

C.li t0 0

SLLLLI t0, 0xCAFED00D

SLLLLI t0, 0xFEEDBEEF

I have supposed the first li is a a RVC instruction so this is 14 bytes. But mutatis mutandis the sequence generated by the compiler would have two RVC slli instructions as well, so that would be 20 bytes rather than 24.

The one thing that I could imagine large "constants" being really common is function calls and I can completely imagine a 48 bit jal/j with an immediate of 31 bits or so with only ra or zero as link register possibly with an extra bit for absolute and relative.

Op zaterdag 8 april 2017 03:52:10 UTC+2 schreef michaeljclark:

Alex Elsayed

unread,

Apr 9, 2017, 2:34:33 PM4/9/17

to isa...@groups.riscv.org

No, the C.LI is entirely unnecessary, because all of the bits that were in t0 before the SLLLLI sequence would be shifted off the upper end.

(On RV32, only a single SLLLLI would be used; on RV128, four SLLLLI suffice. In all cases, all bits that were once in t0 are gone by the end of the sequence.)

> > <https://groups.google.com/forum/#!topic/comp.arch/jb45aPUZbBI[101-125]>

signature.asc

Bruce Hoult

unread,

Apr 9, 2017, 5:19:13 PM4/9/17

to Rogier Brussee, RISC-V ISA Dev

On Sun, Apr 9, 2017 at 9:29 PM, Rogier Brussee <rogier....@gmail.com> wrote:

Just to clarify but shouldn't a 64 bit load

C.li t0 0
SLLLLI t0, 0xCAFED00D
SLLLLI t0, 0xFEEDBEEF

I have supposed the first li is a a RVC instruction so this is 14 bytes. But mutatis mutandis the sequence generated by the compiler would have two RVC slli instructions as well, so that would be 20 bytes rather than 24.

The one thing that I could imagine large "constants" being really common is function calls and I can completely imagine a 48 bit jal/j with an immediate of 31 bits or so with only ra or zero as link register possibly with an extra bit for absolute and relative.

Floating point code is likely to be full of them.

Andrew Waterman

unread,

Apr 9, 2017, 5:31:25 PM4/9/17

to Bruce Hoult, Rogier Brussee, RISC-V ISA Dev

You could make a case for an FLI instruction, which takes e.g. a sign,
a few exponent bits, and a few significand bits; but if the constant
generation requires a move from the integer unit, it's typically more
profitable to use gp-relative accesses to a constant pool instead.

>
> --
> You received this message because you are subscribed to the Google Groups
> "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to isa-dev+u...@groups.riscv.org.
> To post to this group, send email to isa...@groups.riscv.org.
> Visit this group at
> https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
> To view this discussion on the web visit

> https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CAMU%2BEkwMyB6R%2Byc4deQ%2BvNV8yFyNYpwLiqqoutiXmCFJcRcbgw%40mail.gmail.com.

Rogier Brussee

unread,

Apr 10, 2017, 3:10:57 AM4/10/17

to RISC-V ISA Dev

Good Point!

so the comparison would be 12 vs 20 byte.

Op zondag 9 april 2017 20:34:33 UTC+2 schreef Alex Elsayed:

Reply all

Reply to author

Forward