Basically, it had specified:
32 and 64 bit loads may be 2 or 3 cycles, depending on various stuff;
8 and 16 bit loads were 3 cycle.
Though, the SiFive cores appear to be aligned-only internally (with
unaligned cases triggering a severe performance penalty).
>
>> I could add a "fast path" to the L1 cache where, if the memory access
>> satisfied certain requirements, it would be reduced to 2 cycle latency:
>> Aligned-Only, 32 or 64 bit Load;
>> Normal RAM access (not MMIO or similar);
>> Does not trigger a "read-after-write" dependency;
>> ...
>> This case allowing for cheaper memory access logic which doesn't kill
>> the timing (if the result is forwarded directly to the pipeline).
>
> The above was a question poised to me while interviewing with HP in 1988.
>
> The right answer is:: "Do nothing that harms the frequency of the
> pipeline".
> {{Which you my or may not be doing to yourself}}
>
Reducing the latency in this way isn't ideal for LUT cost or timing, but
not really like I can get my core much faster than 50 MHz, so...
Supporting a subset of aligned-only 32/64 bit access with a shortcut,
does at least offer a performance advantage (and more viable than trying
to get the general case down to 2 cycles, which is almost guaranteed to
blow the timing constraints).
Though, yeah, the L1 shortcut and "fast ALU" do add roughly 4k LUTs to
the cost of the CPU core. I suspect some of this cost may be that the
register forwarding path seems to mass duplicate any combinatorial logic
which is connected to it (but, the only way to avoid doing this being to
have 2c ALU and 3c Load, so, ...).
> The second correct right answer is:: "Do nothing that adds 1 to the
> exponent of test vector complexity". {{Which you invariably are doing to
> yourself}}
>
Well, if there is one good point of messing around with some core
mechanisms of the CPU core or pipeline, it is that if I screw something
up, typically the core will blow up pretty much immediately in
simulation, making it easier to debug.
Much harder to identify bugs which may take hours of simulation time
before they manifest (or, an unidentified bug where after several days
of running the Quake demo loop, Quake will seemingly try to jump to a
NULL address and crash; but this bug seemingly does not manifest in the
emulator).
Have also observed that the C version of my RP2 decoder breaks in both
the simulation and emulator in RV64 mode; however, I had noted that the
same bug may also appear in an x86-64 build with GCC, and seems to
depend on optimization level and if/when some variables are zeroed. I
think this may be more a case of "something in the code is playing badly
with GCC" though (but not yet identified any "smoking gun" in terms of
UB, using "memcpy()" in place of pointer derefs does not fix the issue,
but was the source of me realizing that GCC inlines memcpy on RV64 using
byte load/store).
Bug seemingly goes away with "-O0" in GCC, but then Doom in unbearably
slow (runs at single-digit speeds). Partial workaround for the RV case
for now being to use the original uncompressed Doom WADs.
But, yeah, my core could be simpler...
Now supporting the common superset of both BJX2 and RV64G (excluding
privileged spec) probably doesn't exactly help.
Though, as noted, despite now extending to RV64G, the BJX2 core still
does not have separate FPRs, but instead the decoder just sorta maps
RV64's FPR's to R32-R63 ...
Some stuff does reveal stuff it might have made sense to do differently
in retrospect, say:
Treating plain ALU ops, Compare Ops, and Conversion ops, as 3 different
entities (as to all being mostly lumped under the ALU umbrella, with
needing separate ALU modules for Lane 1/2/3 due to the ALU having a lot
of logic in Lane 1 that is N/A for Lanes 2 and 3, ...).
Say:
ALU, does exclusively ADD/SUB / AND/OR/XOR
And closely related operations.
CMP, does Integer and FPU comparison.
Ideally with more orthogonal handling of SR.T or GPR output.
As-is, the output-handling part is a little messy.
CNV, does type conversion (likely always 2 cycle).
Don't really need 1-cycle FP-SIMD convert or RGB555 pack/unpack, ...
MOV, does register MOV like operations.
MOV Reg, Reg
MOV Imm, Reg
EXTS.L and EXTU.L
These need to be 1 cycle.
Most other converter ops can remain 2 cycle.
As-is, probably my BJX2 ISA design is bigger and more complex than ideal.
Might have been better if some things were more orthogonal, but
eliminating some cases in favor of orthogonal alternatives requires
having an architectural zero register (with its own pros/cons).
But, my redesign attempts tend to be prone to losing PrWEX, which
although not highly used, is at least "still useful".
Some amount of the listing is used up by cruft from short-lived
experimental features.
For example, the 48-bit ALU ops turned out to be a bit of a dud:
Both unexpectedly expensive for the CPU core, and not offering much of a
performance advantage over the prior workarounds for using 64-bit ALU
ops (such as doing a 64-bit subtract and then sign-extending the result
from 48 to 64 bits).
Granted, this does still leave the annoyance that one either uses
zero-extended pointers in C, or needs to manually work-around the
tagging if bounds-checking is enabled, and leaves a mismatch between
bounds-checked and non-bounds-checked code.
Where, say, relative pointer comparison, no bounds checking:
CMPQGT R5, R4
With bounds-checking:
SUB R4, R5, R2
MOVST R2, R2 //48-bit sign extension
CMPQGT 0, R2
Vs:
CMPPGT R5, R4 //Ignoring high 16 bits
But, despite the overhead of 2 extra ops, the relative performance
impact on code seems to be fairly modest.
Though, as-is, a similar annoyance comes up if comparing function
pointers, which remain tagged even without bounds-checking. Did tweak
the rules for some ops though such that at least function pointers will
always give the same value for the same CPU mode (so == and != work as
expected).
Does mean there is wonk though if wanting to use relative comparisons of
function pointers between ISA modes, or trying to use a function-pointer
as a base address to access memory, but these are mostly non-issues in
practice.
>> Basically, in this case, the L1D$ has an alternate output that is
>> directed to EX2 with a flag that encodes whether the value is valid.
>> It does not replace the logic in EX3, mostly because (unless something
>> has gone terribly wrong), both should always give the same output value.
>
>
>> Also an alternate "fast case ALU", which reduces ALU to 1-cycle for a
>> few common cases:
>> ADD{S/U}L, SUB{S/U}L
>> ADD/SUB if the input values fall safely into signed 32-bit range.
>> Currently +/- 2^30, as this can't overflow the signed 32-bit.
>> Skips 64-bit mostly because low-latency 64-bit ADD is harder.
>> AND/OR/XOR
>> These handle full 64-bit though.
>
Ironically, making ADDS.L and ADDU.L be 1 cycle was originally part of
the intention, but took a while to get to it.
ADDS.L basically doing a sign-extending 32-bit ADD.
Where RV64's equivalent is the ADDW instruction.
ADDU.L is Zero-extending.
Apparently RV64 has ADDUW as part of BitManip.
Likely, may add (from BitManip):
Zba, more or less maps over, though semantics are not exact.
Could add SHnADD, mapped to LEA, but semantics are not exact (*1).
Zbb, a large part maps over.
Zbkb, partly maps.
Zbs/Zbkc/Zbkx, doesn't map
*1: Main potential problem case would be if GCC tried to use them as
generic 64-bit ALU ops, which would break if mapped over to the LEA.x
logic. The description in the spec seems to imply that they could also
be used for 64-bit ALU though.
Though, there is concern over a lot of edge cases where my
implementation of RV64 differs from the RISC-V spec (in some areas,
these differences were necessary to "make it work", with the presumption
that "GCC wont likely notice", but does depend some on how exactly GCC
uses the ISA, *2).
*2:
Does not seem to make use of the contents of link-register values;
Does not seem to use A's AMOxx instructions for normal output;
RV64IMA and RV64G output is the same as if A were not present.
Does not seem to use FMADD/FMSUB for F/D (absent "-ffast-math");
So, doesn't yet seem to matter that they are absent.
...
RV64 does bring its own wonk as there seems to be not entirely
consistent handling of unsigned 32-bit values (it seems, whether they
are sign or zero extended to 64 bits isn't entirely consistent).
It seems like the C ABI had specified sign-extend-everything, but the
RISC-V ISA itself assumes zero-extended unsigned values (and deals with
this wonk in some cases by having explicit "ignore all the high-order
bits" instruction forms, including for cases where BJX2 only provided
for signed 64-bit inputs; since "unsigned int" maps cleanly to 64-bit
range if one assumes zero-extended values, making the existence of
specifically 32-bit unsigned instructions unnecessary in these cases).
Would have been "better" had the RV64 ABI spec specified zero-extended
unsigned values, avoiding this particular bit of wonk...
Though, some cases, not going to bother with. If you pass in a negative
input to an Float->UnsignedInt conversion, more inclined to be like
"Meh, whatever, the result will come out negative I guess" (in my case,
the Float->Int converter ops do not have range clamped outputs, in
general, and will just sorta produce a modulo output if values go out of
range; otherwise, this responsibility was assumed to be left to software).