Bruce Hoult <
bruce...@gmail.com> writes:
>There is a gain for 64 bit literals though. Mitch's machine would need 32 +=
> 64 bits (96), while the base RV64I instruction set needs up to 6x 32 bit i=
>nstructions (192 bits) in the general case.
IIRC on Alpha general 64-bit literals are loaded with one 32-bit
instruction, from the constant table. So in terms of code size and
I-cache it is a win (but I think that the obsession of many posters
here with code size misguided). Costs: 1) D-cache pressure (and there
the utilization of the cache line may be significantly lower than in
the I-cache). 2) It consumes a register as pointer to the constant
table. 3) It consumes additional instructions when calling/returning
between compilation units.
Also, when I looked at the code generated for the big constants in my
programs, gcc often had managed to produce a short sequence of
instructions instead; somebody must have invested quite a bit of work
into that.
>This is reduced with the C exte=
>nsion as the final shift and OR can with suitable register allocation alway=
>s be 16 bit instructions (so 160 bits). Specific constants can be smaller a=
>s one or more of the LUIs or ADDIs may be able to be encoded as 16 bit inst=
>ructions or even omitted.
>
>The B extension will reduce the general case to 5 instructions, using PACK =
>instead of shift/or.
An instruction that does
dest = lit16|(src<<16)
would allow to do it in 128 bits. Only one instruction would be
needed (use zero as src for the first one). Maybe a second one with
dest = (~lit16)|(src<<16)
to make the sequences for -2^95<=n<0 shorter.
Alternatively, with 22-bit literals and
dest = lit22|(dest<<22)
it could be done in 96 bits. You would need a second instruction for
the initial literal:
dest = sext(lit22)
So why do neither Aarch64 nor RISC-V have it (and not Alpha, which was
also designed from the start as a 64-bit architecture)? Maybe the
sequential nature of the sequences is the reason; but is it a real
problem?
1) In single-issue implementations it is not.
2) In OoO implementations, it should not be, either: With good branch
prediction (the usual case) the front end runs far ahead of the
execution of instructions on the critical data-flow path. These
constant sequences would depend on nothing and their instructions
would be executed soon after being decoded, and their results would be
available early enough to avoid delaying the instructions on the
critical path. Also instructions from these sequences can be combined
in the decoder.
3) Superscalar in-order implementations pose the biggest problem, but
even there it does not seem to be particularly bad: a) The compiler
can interleave instructions from constant sequences with other
instructions to provide ILP. b) The decoder can combine these
sequences into a single instruction. However, one probably has to
decide in advance which way such implementations should go because
dealing with it in the decoder probably becomes much harder if the
instructions are interleaved.