Could be; if I made the pipeline a little longer, I could possibly make
timing a little easier.
Costs:
More branch latency (if I extend earlier stages);
More interlocks and forwarding (if I extend later stages).
Figuring out length in IF made it easier to advance PC than if had I
done it in D1, since there is a 1c delay between giving it a PC address,
and getting the results for said PC address.
As noted, with my ISA (BJX2):
0zzz_zzzz: 16b
10zz_zzzz: 16b
110z_zzzz: 16b
1110_0zzz: 32b (OP?T / OP?F)
1110_10zz: 32b (OP?T / OP?F)
1111_00zz: 32b / 32b (Scalar / WEX2)
1111_01zz: 32b / 64b (Scalar / WEX2)
1111_100z: 32b / 32b (Scalar / WEX2)
1111_101z: 32b / 64b (Scalar / WEX2)
111z_11zz: 48b
BJX2 Lite uses a simpler rule:
0zz: 16b
10z: 16b
110: 16b
111: 32b
For Lite, 48-bit encodings are disallowed, and WEX is ignored (WEX
encodings always operate in scalar mode).
Length with WEX2 in wider configurations is more complicated. Could
theoretically support up to 4 wide with the current cache-system design
though (but, me doing 3 or 4 wide is unlikely at present given its cost).
>>
>> In the WEX case, current prototypes assume 64 bits (2x 32).
>>
>> Decode is split into 2 stages, as register fetch seems to be kinda
>> problematic in terms of timing.
>>
>> Accessing memory works by sending a request in E1 and waiting for it to
>> complete in E2 (E2 stalls the pipeline until it completes). Currently
>> accessing the L1 D$ takes longer than this, so memory access always
>> stalls a few cycles.
>>
>> Similarly, the approach for both the multiplier and FPU was "stall until
>> done". As with memory access, EX2 will stall until the operation completes.
>>
>>
>>
>>> I lobbed integer multiply and divide over in the FMAC unit as it was already
>>> doing FMUL and FDIV.
>>>
>>
>> I half-considered this, but there is the problem in my case that integer
>> and DP multiply produce/consume different ranges of bits (the DP
>> multiplier would effectively need to produce a full-width output for
>> this to work, rather than discarding the low-order results).
>
> Also note:
> I aligned the int and FP pipes so they can share a RF write port!
> Now one needs one result write port for every 1.3 instructions issued
> and in the simplest machines there are no conflicts for write ports for several variants of 1-wide, to 2-wide designs.
>
OK. The current ISA still uses separate GPRs and FPRs.
Currently, the FPU's execute stages operate in-parallel with the main
integer unit, with FPR fetch and writeback also occurring in parallel.
I had evaluated possibilities for merging them, but noted it wouldn't
save all that much ISA-wise; could either save resources or make timing
harder. Changing this would be a pretty big breaking-change though.
In the Lite profile, FPU is optional and FP arguments would be passed in
GPRs in the C ABI either way.
>>
>> Similarly, the width of the results from the 32-bit integer multiplier
>> are wider than those from the DP multiplier.
>
> It is an interesting trade offs. 53×53 and only perform FMUL and FMAC OR
> 57×57 with FDIV and FSQRT OR 64×64 with IMUL and IDIV and REM.
> We can all agree that if we have a int multiplier 64×64 we can perform all of the calculations
> We can all agree that 53×53 is just over 50% of 64×64
> We should be able to agree that there are external factors pushing a design in either direction.
Yeah.
54*54->54: 6 DSPs (high result, discard low bits of result)
32*32->64: 4 DSPs (full-width, 64b output)
64*64->128: 16 DSPs
A direct 64*64->128 multiplier is a bit steep (both expensive and slow).
Probably, one would need to do a 64*64->64 multiplier which can produce
either the low or high results (could be done with 10 DSPs).
Then an aligned high-result could potentially also be used to implement
a DP multiply.
(A3,A2,A1,A0)*(B3,B2,B1,B0)
Low Result (63:0):
A3B0
A0B3
A2B1
A1B2
A2B0
A0B2
A1B1
A1B0
A0B1
A0B0
High result (127:64):
A3B3
A3B2
A2B3
A2B2
A3B1
A1B3
A3B0
A0B3
A2B1
A1B2
So, effectively, they are partly mirrors.
So, the sub-multipliers could be fed different inputs, and the adder
chains would be assembled differently.
DSP cost would be 10 DSPs (so is the same here), however the cost in
terms of LUTs and latency would likely be a little worse (due mostly to
the more complex adder chains needed).
The amount of additional plumbing between the integer unit and FPU could
possibly also be problematic (vs the FPU using its own multiplier).
>>>>
>>>> With superscalar, it is necessary either to use a more complicated
>>>> organization (no longer a simple scalar pipeline, eg, needs a more
>>>> obvious split between the decode and execute stages).
>>>>
>>>>
>>>> If we know in advance, the pipeline can look just like a normal scalar
>>>> pipeline, just with a few extra parts glued on.
>>>
>>> YEs.
>>>>
>>>> Main costs are mostly the alternate execute unit, additional register
>>>> ports, and the added pipeline machinery.
>>>
>>> One should note:
>>> Branch unit probably does not need a RF Write port 97% of the time.
>>> Store does not ever need an RF write port--and here is where I generally
>>> steal a RF Read port when an instruction either consumes a constant or
>>> is a 1-operand instruction (branch for example).
>>>
>>> So one should be able to get to ~1.3 IPC with a 3R1W RF; even before
>>> considering forwarding of results back as operands.
>>
>> A 2-wide WEX was presumed to use 4R2W:
>> 2 read ports per lane, 1 write port per lane.
>> Ops which need 3R would effectively disallow the second lane.
>
> Even if that lane was a 1-op
> Even if that lane was a 1-op plus constant?
>
Yes, generally.
It is possible it could be made to work, but would require being more
clever, and only work for certain ops (depending on which resources they
use), ...
Many ops expand out to be wider internally, eg, 1R ops might internally
decode as 2R ops, ... Similarly, immediate values are fed though the GPR
read ports (immediate values are treated internally as a special
register; allowing "Reg, Reg, Reg" and "Reg, Imm, Reg" cases and similar
to use the same logic in the EX stages).
Simple case is to make 3R ops effectively disallow Lane 2.
Less clear how this would work with a 3-wide WEX, eg:
ADD R2, R3, R4 | MOV.L R6, (R9, R7)
The MOV.L would use 3R, thus not allow the ADD in Lane 2, but it could
fit in Lane 3 (if it exists).
Either the core would need to be "smart" and detect this case (using
Lane 3 rather than Lane 2, giving the MOV.L both Lanes 1 and 2), or the
assembler would need to do something silly, like:
ADD R2, R3, R4 | NOP3 R6 | MOV.L R6, (R9, R7)
In this case, the 'NOP3' in Lane 2 being simply a placeholder to give
Lane 1 its 3rd source register.
An alternative is always giving Lane 1 all 3 read ports, but this would
require 5R2W for a 2-wide WEX (vs 4R2W).
To what extent to allow (or not allow) FPU ops with WEX is TBD, things like:
ADD R7, R8, R9 | FMUL FR3, FR4, FR5
Or similar are probably OK, as long as the FPU op stays in Lane 1.
>>>>>> A goal was to minimize the "cleverness" needed by the core (the C
>>>>>> compiler and/or ASM programmer would be responsible for all this).
>>>>>
>>>>> Am not a fan of compiler cleverness :) CEDAR Audio had to code in assembler (1993) because the TI DSP compiler just could not cope. With only 50 MFLOPs to play with (12.5 mhz, 2x pipeline, 2x FPUs, one for odd regs one for even, FMAC) the budget was only 1000 cycles per audio sample, and compiler inefficiency just could not be tolerated.
>>>>>
>>>>> Have been wary of VLIW compiler "cleverness" ever since.
>>>>>
>>>>
>>>>
>>>> Either the compiler does it, or the processor needs to be able to do so.
>>>>
>>>> For a core where detecting inter-instruction dependencies (as-needed for
>>>> an in-order superscalar) is too expensive, it can be a viable
>>>> alternative to scalar.
>>>
>>> If you do this::
>>> Be aware that you will want to implement the following tokens::
>>>
>>> This result forwards and writes
>>> This result forwards but write was elided [phantom]
>>> This operand comes from slot[k]
>>>
>>> and depending on how aggressive branch prediction is going to be:
>>>
>>> This operand comes from slot[k] when taken or slot[j] otherwise.
>>
>> While things have to be valid in scalar order, they also need to be
>> valid if executed in parallel.
>>
>>
>> Dependencies between results in different lanes within a block will be
>> effectively undefined.
>>
>> Eg:
>> ADD R5, R7, R9 | ADD R9, R6, R10
>
> I always allowed this to mean last use as operand R9 and write of R9
>
I had ended up going with AT&T ordering for BJX2 ASM, where in this
case, the instructions would give different results with scalar vs
parallel execution, and is thus disallowed.
In WEX, the registers as-read are typically those from the prior cycle;
with a register write within the same cycle not being visible until the
following cycle.
As noted, it does currently retain the use of forwarding/interlocks, and
behaves the same as scalar code regarding control flow.
While an argument could possibly be made for improving performance with
WEX by adding delay-slot branch instructions, these would have the
severe drawback of exhibiting different behavior in scalar mode (and
disallowing WEX in the delay slot would eliminate its main use-case).
Similar would apply to allowing ops with "delayed writeback" (where the
result of the operation would not be required to be visible until
several clock cycles later, but would otherwise allow execution to
proceed as if it were a single-cycle op).
IOW: Doing some stuff similar to the TMS320C6x family or similar...