>> There is a gain for 64 bit literals though. Mitch's machine would need 32 +=
>> 64 bits (96), while the base RV64I instruction set needs up to 6x 32 bit i=
>> nstructions (192 bits) in the general case.
>
> IIRC on Alpha general 64-bit literals are loaded with one 32-bit
> instruction, from the constant table. So in terms of code size and
> I-cache it is a win (but I think that the obsession of many posters
> here with code size misguided). Costs: 1) D-cache pressure (and there
> the utilization of the cache line may be significantly lower than in
> the I-cache). 2) It consumes a register as pointer to the constant
> table. 3) It consumes additional instructions when calling/returning
> between compilation units.
>
> Also, when I looked at the code generated for the big constants in my
> programs, gcc often had managed to produce a short sequence of
> instructions instead; somebody must have invested quite a bit of work
> into that.
>
Yeah; loading constants from memory kinda sucks...
>> This is reduced with the C exte=
>> nsion as the final shift and OR can with suitable register allocation alway=
>> s be 16 bit instructions (so 160 bits). Specific constants can be smaller a=
>> s one or more of the LUIs or ADDIs may be able to be encoded as 16 bit inst=
>> ructions or even omitted.
>>
>> The B extension will reduce the general case to 5 instructions, using PACK =
>> instead of shift/or.
>
> An instruction that does
>
> dest = lit16|(src<<16)
>
> would allow to do it in 128 bits. Only one instruction would be
> needed (use zero as src for the first one). Maybe a second one with
>
> dest = (~lit16)|(src<<16)
>
> to make the sequences for -2^95<=n<0 shorter.
>
BJX2 basically has these already...
Essentially, this constant-loading mechanism was what the ISA was
originally designed around.
Its role/future is less certain now, with the ISA now having "jumbo ops"
which can more directly encode larger immediate values. But, it still
makes sense to keep around given jumbo-ops are a fairly expensive
feature (so it makes sense to keep them optional).
The jumbo ops have a "jumbo prefix" which glues 22 bits onto the
following op, allowing cases of:
OP Imm33s, Rn
OP Rm, Imm32s, Rn
OP (Rm, Disp31), Rn
As well as a 64-bit constant load (can be encoded in 96 bits), and a
32-bit constant load in 64 bits.
The constant load cases don't save much space (loading a 32-bit constant
in two 32-bit ops was already possible); but in this case, does allow
the constant to be loaded in a single clock cycle.
> Alternatively, with 22-bit literals and
>
> dest = lit22|(dest<<22)
>
> it could be done in 96 bits. You would need a second instruction for
> the initial literal:
>
> dest = sext(lit22)
>
However, 22 is a bit steep (if one also needs to provide space for a
destination register).
There are instructions to load a 25-bit sign-extended value into R0
however, which was/is a common case for immediates which don't fit in
whatever the instruction provides, but do happen to fit into a 25-bit
immediate.
In early design stages, there was IIRC an op for:
Rn = (R0<<16)|Imm16;
But it no longer exists, as it wouldn't have saved much.
> So why do neither Aarch64 nor RISC-V have it (and not Alpha, which was
> also designed from the start as a 64-bit architecture)? Maybe the
> sequential nature of the sequences is the reason; but is it a real
> problem?
>
> 1) In single-issue implementations it is not.
>
> 2) In OoO implementations, it should not be, either: With good branch
> prediction (the usual case) the front end runs far ahead of the
> execution of instructions on the critical data-flow path. These
> constant sequences would depend on nothing and their instructions
> would be executed soon after being decoded, and their results would be
> available early enough to avoid delaying the instructions on the
> critical path. Also instructions from these sequences can be combined
> in the decoder.
>
> 3) Superscalar in-order implementations pose the biggest problem, but
> even there it does not seem to be particularly bad: a) The compiler
> can interleave instructions from constant sequences with other
> instructions to provide ILP. b) The decoder can combine these
> sequences into a single instruction. However, one probably has to
> decide in advance which way such implementations should go because
> dealing with it in the decoder probably becomes much harder if the
> instructions are interleaved.
>
Can't say for certain.
As I can note, my CPU cores are either single-issue/scalar, or
VLIW-style (which can still run scalar code, albeit slower). There is
partial compatibility to let VLIW style code be usable on a scalar core,
but as-is "jumbo ops" kind of mess this up.
It is possible, however, that jumbo-ops could be supported on a scalar
core via a different mechanism, eg:
The prefix ops are decoded as NOPs (but update some internal state);
The decoder for the following op takes note that a prefix was present.
In my case, I have noted that performance still seems to be effected
somewhat by memory subsystem performance.
Tried boosting the L1 and L2 cache sizes:
Doom engine games saw the biggest improvement in performance;
Quake, still slow, but appears to have a pretty good hit ratio;
ROTT wasn't effected much, but also has the worst hit/miss ratio.
Cache size increases:
L1 I$: 2K -> 8K
L1 D$: 2K -> 16K
L2: 64K -> 128K
Caches are still direct mapped though; this is about the biggest I can
make L2 without going over the FPGA's resource budget.
As-is: Doom variants (Doom 1/2 and Hexen) seem to be getting the best
performance at present.
ROTT seem to get notably worse performance (reason not yet entirely
obvious, *).
So, eg, some approximate "average-ish" framerates:
Doom: ~18-22 (acceptable)
ROTT: ~ 7-10 (kinda sucks/annoying)
Quake: ~ 3-4 (mostly unplayable)
Combined (L1 D$ + L2) hit ratio:
Quake: ~ 98.0% hit;
Doom: ~ 97.6% hit;
ROTT: ~ 91.3% hit.
Average MIPS values ATM (at 50MHz), % cycles waiting for mem-access:
Quake: ~ 19.6 / ~ 54%
Doom: ~ 25.7 / ~ 42%
ROTT: ~ 10.0 / ~ 77%
*: There are a few areas of concern, but nothing yet seems to be
significant enough to explain why it is getting less than half the
framerate of Doom. I already went and rewrote a lot of the inner
renderer loops in ASM (these don't really seem to be the issue).
But, it is like there is something in the background which is being
problematic (well, along with it still being rather buggy and crash-prone).
Note that the numbers are based on modeling the cache behavior in my
emulator, which seems to be generally pretty close (except that ROTT
seems to perform even worse when running on the FPGA hardware;
framerates aren't really all that much better than Quake...).
Of the 3D engines I have tested it also seems to be doing the worst in
terms of cache hit/miss ratio, but this wasn't really improved much by
making the caches bigger.
Much of what misses L1 in ROTT goes all the way to DRAM (~ 30% L2 hit).
Quake seems to have an ~ 66% L2 hit rate.
Doom gets ~ 83% L2 hit (but has a lower L1 hit rate vs Quake).
Ironically, ROTT is also the only one of the engines still using an
8-bit / 256-color renderer (mostly because the engine was doing palette
animation tricks which would break otherwise).
TODO: Maybe port Wolfenstein 3D and compare if it behaves similarly to ROTT.
...