On 11/27/2020 6:17 AM, Anton Ertl wrote:
> Kyle Hayes <
kyle....@gmail.com> writes:
>> Could a TTA be made OoO?
>
> Sure, you can implement a TTA instruction set on an OoO
> microarchitecture, but I fail to see the point. The point of TTA is
> to go beyond VLIW by making even more of the microarchitecture
> explicit: It makes register file read and write ports explicit, and
> the latches and busses between the functional units. And it does that
> so that the compiler rather than the hardware allocates these
> resources.
>
Yeah...
VLIW style ISA's (in general) don't seem particularly well suited for
OoO IMO, since pretty much the whole point of OoO is for the CPU to
manage all this stuff, so it makes more sense to stick with a simpler
RISC-like ISA design in this case.
> The compiler schedules instructions such that each resource is not
> used more than once during a cycle. Now if you then perform
> instructions out-of-order, you need hardware to ensure that two
> instructions do not use the same resource in the same cycle. So you
> would eventually get the worst of both worlds: The compiler has to do
> all this work and comply with the restrictions, and then the hardware
> has to do it again.
>
> Register names seem to be a pretty good way to specify data flow, with
> the only disadvantage that it does not specify the death of a value
> (except by overwriting the register). The Mill's Belt is another way.
> A stack (or several) is another way. Hardware seems to do ok for
> allocating buses and register ports.
>
I recently had a thought something along these lines, like a VLIW with
almost no decode logic, ..., but the idea I came up with to do so would
have pretty terrible code density.
But, yeah:
Bundle-based, likely either 128 or 256 bits;
Register ID's map directly to internal registers;
Say: R0..R63: GPRs, R64..R127: SPRs and CRs
R64: PC, R65: LR, R66: GBR, ...
The register fields map directly to hardware ports;
No pipeline interlocks, ...;
Probably a branch delay slot;
Almost no multi-cycle ops.
Likely, stalls for memory ops could be an explicit instruction.
In effect, memory ops would be an instruction sequence.
FPU ops would also be split up.
Latency would need to be explicit.
Haven't really expanded it out all that much, as this seems unlikely to
be particularly viable.
But, likely (256 bit bundle with 80 bit ops):
12 bits, opcode (6b major, 6b minor)
21 bits, register (2R,1W)
2 bits, predicate
12 bits, op-specific
33 bits, Immed
...
A version with 128-bit bundles could be similar (3x 40 bits), but would
effectively fold immediate values into ops (they would take up a lane).
40 bit op:
12 bits, opcode (6b major, 6b minor)
21 bits, register (2R,1W)
2 bits, predicate
5 bits, op-specific
40 bit imm:
6 bits, op (NOP/IMMED)
1 bit , flag (NOP or IMMED)
33 bits, Immed
In this case, ops with immediate values could encode which lane contains
their immediate (IMM1, IMM2, IMM3, IMM12, IMM23).
The major/minor opcode would select which unit handles the op, and the
minor selects which operation that unit should perform.
Eg:
NOP/IMM (Does Nothing or Holds Immed)
LOAD/STORE (Mem Op)
ALUA (ALU Op, One Cycle)
ALUB (ALU Op, Two Cycles)
MUL (Integer Multiplier)
SHAD (Integer Shift Unit)
BRAN (Branch Unit)
FPU (Floating Point Op)
...
...
I don't expect it would save that much though:
Most of where the cost is going does not appear to be in the instruction
decoder.
Despite its seemingly large size, most of the decoder logic is magically
transformed into lookup tables (which it turns out the LUTS are pretty
effective at).
In other news, I have been managing to get some significant speedups in
my memory bus (where I have also gone and dropped the bus to 50MHz as
well, where it seems the bus clock speed was not the main issue; and
50MHz makes timing easier).
From my fiddling, I have noted that it seems that it is effectively
possible to get more memory bandwidth by running the bus at 50 MHz,
which allows using more complex cache logic with less forwarding when
compared with 100MHz.
So, I have sort of ended up switching the various internal bus
interfaces over to 50MHz (albeit with the DDR controller operating at a
higher clock frequency internally).
It took a fair chunk of debugging to get it all semi-stable again.
Most issues were due to timing-related edge cases, many of which were in
turn due to prior "make this stuff pass timing checks" hacks. Turns out
there were a lot of edge cases which could either corrupt memory or
deadlock the bus in certain situations.
I beat on it for a while, and hopefully got it a little less buggy.
That, and I have also been able to revive the "full duplex mode", which
also helps with performance.
I did partly redesign the mechanism though to be a little more general
by relying on bus-signaling to make full-duplex operations work, rather
than the L1 needing to be clever (and try to figure out itself what
scenarios the L2 could deal with). The original form of the mechanism
was fairly brittle and also could not deal with MMU, which was a
motivation for this design change.
The basic premise remains the same, namely that in the L1<->L2
interface, cache lines may be passed in both directions in a single
operation by using two different address fields (one for the cache line
being loaded, and another for the cache line being stored).
Also, interestingly, with these changes, the estimated power use also
went down (from ~ 0.9W to 0.6W), possibly because "slower" logic needs
less power or something.
Not sure, but I am starting to suspect that 50MHz may be closer to the
local optimum operating speed of the FPGA or something...