On 10/10/2024 7:32 AM, Bruce Hoult wrote:
> On Thu, Oct 10, 2024 at 7:24 PM Guy Lemieux <
guy.l...@gmail.com> wrote:
>> A variable shifter often works with log(XLEN) layers, where each layer
>> shifts by 1b, 2b, 4b, 8b, etc, respectively.
>
> Yes, my previous message was written under this assumption.
>
> The point is that this doesn't -- on all implementations I know of --
> take log(XLEN) clock cycles, but rather all log(XLEN) stages cascade
> within 1 clock cycle. This is the case because integer add (which is
> far more common and important) also takes O(log(XLEN)) layers of logic
> to propagate the carry (in the best implementation), and the big O
> constant is relatively similar in both cases.
>
> So, yes, there is no advantage to power of two shifts, despite each
> layer of the shift network (optionally) shifting by a power of two.
>
Hmm, There could be an RV32I- or RV32E-, with additional cost cutting:
Only constant shifts are allowed, and only power of 2;
The shift amount is treated as part of the opcode;
For operations like BEQ and friends, Rs2 is required to be X0;
Effectively, only comparing Rs1 with zero.
Unaligned load/store is disallowed;
JAL and JALR may only have X0 and X1 as a destination;
...
The smallest CPU core I had pulled off in the past was around 4 kLUT,
and had used an SH-2 like subset of the SuperH ISA.
No integer multiply;
No variable shift;
...
The way one would implement variable shifts being to branch into a table
of 1-bit shift operations.
In past attempts, was not able to get RV32I quite this small, at least
assuming a core that is pipelined and uses full-width registers.
My past attempts at a basic RV64I style core were closer to 9 kLUT, with
around 7 for 32-bit.
For a smaller core, things like shift are fairly expensive, ...
Also expensive:
Supporting misaligned load/store
Can nearly double the cost of the L1 D$.
Dealing efficiently with things like memory RAW/WAW hazards;
Forwarding = expensive;
Stall = slow.
...
Though, at this point, most of the semi-mainline FPGA dev-boards
(excluding ICE40 and similar) don't come with anything much smaller than
an XC7S25 or XC7A35T, which can handle such a core (though, bigger core
does mean less space left for peripheral logic).
Though, most of the still-available boards with these FPGA's also tend
to lack external RAM modules, which severely limits their utility
(without external RAM, can't do that much more than a small
microcontroller).
And, for boards with an XC7S50 or XC7A100T or similar, one can afford a
bigger core.
And, many tiny cores doesn't seem particularly useful for most purposes.
Dunno about ASIC space though.
Well, vs my current core:
64 bit, SIMD, etc;
64x 64-bit registers
Split into two sets of 32 for RV;
3 lanes, 6R3W register file;
SIMD was does in the main registers in my case.
Two ISA's;
Roughly 8 decoders:
3x for my own ISA;
3x for RV;
1 for the 16-bit variant of my ISA;
1 for RVC.
Which weighs in at around 40 kLUT.
As-is, the decoders weigh in at roughly 18% of the LUT cost, but could
try something to reduce this (I am internally debating possible ways to
reduce decoder cost, to reduce the amount of internal MUX'ing needed,
but decided not to go into the specifics here).
Note that the 3rd lane doesn't see much traffic from actual
instructions, but the cost difference between 2 and 3 wide wasn't that
large, and the 3rd lane still serves a use for providing extra register
ports for some cases. Functionally, all it really does at this point is
basic ALU ops (AD/SUB/AND/OR/XOR, constant load, sign/zero extension).
As-is, much of the cost difference seems to be in the decoders and
register file.
Things like FPU and FP-SIMD also eat a lot of LUTs, but the single
biggest consumer of the LUT budget is the L1 D$ (around 27% of the total
LUT budget for the core).
Note that this is with the L1 D$ still only supporting a single memory
port...
...