Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

TTA (transport triggered architecture) CPUs?

143 views
Skip to first unread message

Kyle Hayes

unread,
Nov 25, 2020, 10:38:02 PM11/25/20
to
Recently, someone on this newsgroup (sorry, I can't find the post to
note who it was!) sent out a link to the Maxim MAXQ. I knew about a
project in the Netherlands (TU Delft I think) that used TTA processors,
but the MAXQ was the first I have seen that was commercially offered.

Since then I found mention that early forms of Arm GPUs were also TTA
processors, but I have not found any more information on them.

In thinking about TTA processors, I realized that you can (sort of)
think of each "instruction" as defining an edge in a dataflow graph.
Clearly this is mixed in with temporal behavior too.

Could a TTA be made OoO? It seems like it could, ignoring whether this
is a good idea or not. Much like the Mill, it would seem unlikely that
you could have direct machine code compatibility between different
implementations at different performance levels. Perhaps I am simply
not creative enough to see it though.

Any pointers to other TTA products or projects appreciated!

Best,
Kyle

MitchAlsup

unread,
Nov 26, 2020, 10:52:27 AM11/26/20
to
On Wednesday, November 25, 2020 at 9:38:02 PM UTC-6, Kyle Hayes wrote:
> Recently, someone on this newsgroup (sorry, I can't find the post to
> note who it was!) sent out a link to the Maxim MAXQ. I knew about a
> project in the Netherlands (TU Delft I think) that used TTA processors,
> but the MAXQ was the first I have seen that was commercially offered.
>
> Since then I found mention that early forms of Arm GPUs were also TTA
> processors, but I have not found any more information on them.
>
> In thinking about TTA processors, I realized that you can (sort of)
> think of each "instruction" as defining an edge in a dataflow graph.
> Clearly this is mixed in with temporal behavior too.
>
> Could a TTA be made OoO?

This only requires renamable ports to the function units and the renamer
to generate dynamic porrt assignments; and a function unit "picker".

Anton Ertl

unread,
Nov 27, 2020, 7:49:55 AM11/27/20
to
Kyle Hayes <kyle....@gmail.com> writes:
>Could a TTA be made OoO?

Sure, you can implement a TTA instruction set on an OoO
microarchitecture, but I fail to see the point. The point of TTA is
to go beyond VLIW by making even more of the microarchitecture
explicit: It makes register file read and write ports explicit, and
the latches and busses between the functional units. And it does that
so that the compiler rather than the hardware allocates these
resources.

The compiler schedules instructions such that each resource is not
used more than once during a cycle. Now if you then perform
instructions out-of-order, you need hardware to ensure that two
instructions do not use the same resource in the same cycle. So you
would eventually get the worst of both worlds: The compiler has to do
all this work and comply with the restrictions, and then the hardware
has to do it again.

Register names seem to be a pretty good way to specify data flow, with
the only disadvantage that it does not specify the death of a value
(except by overwriting the register). The Mill's Belt is another way.
A stack (or several) is another way. Hardware seems to do ok for
allocating buses and register ports.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

BGB

unread,
Nov 28, 2020, 3:12:52 AM11/28/20
to
On 11/27/2020 6:17 AM, Anton Ertl wrote:
> Kyle Hayes <kyle....@gmail.com> writes:
>> Could a TTA be made OoO?
>
> Sure, you can implement a TTA instruction set on an OoO
> microarchitecture, but I fail to see the point. The point of TTA is
> to go beyond VLIW by making even more of the microarchitecture
> explicit: It makes register file read and write ports explicit, and
> the latches and busses between the functional units. And it does that
> so that the compiler rather than the hardware allocates these
> resources.
>

Yeah...

VLIW style ISA's (in general) don't seem particularly well suited for
OoO IMO, since pretty much the whole point of OoO is for the CPU to
manage all this stuff, so it makes more sense to stick with a simpler
RISC-like ISA design in this case.


> The compiler schedules instructions such that each resource is not
> used more than once during a cycle. Now if you then perform
> instructions out-of-order, you need hardware to ensure that two
> instructions do not use the same resource in the same cycle. So you
> would eventually get the worst of both worlds: The compiler has to do
> all this work and comply with the restrictions, and then the hardware
> has to do it again.
>
> Register names seem to be a pretty good way to specify data flow, with
> the only disadvantage that it does not specify the death of a value
> (except by overwriting the register). The Mill's Belt is another way.
> A stack (or several) is another way. Hardware seems to do ok for
> allocating buses and register ports.
>

I recently had a thought something along these lines, like a VLIW with
almost no decode logic, ..., but the idea I came up with to do so would
have pretty terrible code density.

But, yeah:
Bundle-based, likely either 128 or 256 bits;
Register ID's map directly to internal registers;
Say: R0..R63: GPRs, R64..R127: SPRs and CRs
R64: PC, R65: LR, R66: GBR, ...
The register fields map directly to hardware ports;
No pipeline interlocks, ...;
Probably a branch delay slot;
Almost no multi-cycle ops.
Likely, stalls for memory ops could be an explicit instruction.
In effect, memory ops would be an instruction sequence.
FPU ops would also be split up.
Latency would need to be explicit.


Haven't really expanded it out all that much, as this seems unlikely to
be particularly viable.

But, likely (256 bit bundle with 80 bit ops):
12 bits, opcode (6b major, 6b minor)
21 bits, register (2R,1W)
2 bits, predicate
12 bits, op-specific
33 bits, Immed

...


A version with 128-bit bundles could be similar (3x 40 bits), but would
effectively fold immediate values into ops (they would take up a lane).

40 bit op:
12 bits, opcode (6b major, 6b minor)
21 bits, register (2R,1W)
2 bits, predicate
5 bits, op-specific
40 bit imm:
6 bits, op (NOP/IMMED)
1 bit , flag (NOP or IMMED)
33 bits, Immed

In this case, ops with immediate values could encode which lane contains
their immediate (IMM1, IMM2, IMM3, IMM12, IMM23).

The major/minor opcode would select which unit handles the op, and the
minor selects which operation that unit should perform.

Eg:
NOP/IMM (Does Nothing or Holds Immed)
LOAD/STORE (Mem Op)
ALUA (ALU Op, One Cycle)
ALUB (ALU Op, Two Cycles)
MUL (Integer Multiplier)
SHAD (Integer Shift Unit)
BRAN (Branch Unit)
FPU (Floating Point Op)
...

...

I don't expect it would save that much though:
Most of where the cost is going does not appear to be in the instruction
decoder.

Despite its seemingly large size, most of the decoder logic is magically
transformed into lookup tables (which it turns out the LUTS are pretty
effective at).




In other news, I have been managing to get some significant speedups in
my memory bus (where I have also gone and dropped the bus to 50MHz as
well, where it seems the bus clock speed was not the main issue; and
50MHz makes timing easier).

From my fiddling, I have noted that it seems that it is effectively
possible to get more memory bandwidth by running the bus at 50 MHz,
which allows using more complex cache logic with less forwarding when
compared with 100MHz.

So, I have sort of ended up switching the various internal bus
interfaces over to 50MHz (albeit with the DDR controller operating at a
higher clock frequency internally).

It took a fair chunk of debugging to get it all semi-stable again.

Most issues were due to timing-related edge cases, many of which were in
turn due to prior "make this stuff pass timing checks" hacks. Turns out
there were a lot of edge cases which could either corrupt memory or
deadlock the bus in certain situations.

I beat on it for a while, and hopefully got it a little less buggy.


That, and I have also been able to revive the "full duplex mode", which
also helps with performance.

I did partly redesign the mechanism though to be a little more general
by relying on bus-signaling to make full-duplex operations work, rather
than the L1 needing to be clever (and try to figure out itself what
scenarios the L2 could deal with). The original form of the mechanism
was fairly brittle and also could not deal with MMU, which was a
motivation for this design change.

The basic premise remains the same, namely that in the L1<->L2
interface, cache lines may be passed in both directions in a single
operation by using two different address fields (one for the cache line
being loaded, and another for the cache line being stored).


Also, interestingly, with these changes, the estimated power use also
went down (from ~ 0.9W to 0.6W), possibly because "slower" logic needs
less power or something.

Not sure, but I am starting to suspect that 50MHz may be closer to the
local optimum operating speed of the FPGA or something...

MitchAlsup

unread,
Nov 28, 2020, 12:54:10 PM11/28/20
to
You ALSO need to directly specify that annother operand register in this
same bundle is to consume the result some instruction in this bundle
is going to produce.

You also need the ability to directly specify that after someone in this
bundle consumes said result, that it is dead and does not need to be
written.

You also need to annotate that some of the instructions are after a branch
and are discarded if the branch is taken but remain alive if the branch is
not taken--and the ability to rename both ways (multiple ways depending
on bundle width)

And since you are trying to "take" a branch every cycle (short loops)
you need an index into the cache when the predictor says take so
you don't have to compute the branch addrress.

We did all of the above in Mc 88120 and packed up to 6 instructions
into about 240-ish bits.

> The register fields map directly to hardware ports;

And to forwarding paths.

> No pipeline interlocks, ...;

You will never get rid of all of them.....

> Probably a branch delay slot;
> Almost no multi-cycle ops.

How do you do floating point?

> Likely, stalls for memory ops could be an explicit instruction.

Here is where I like the CDC 6600 solution.........

> In effect, memory ops would be an instruction sequence.

Does not have to be a sequence, the mere calculation of an address
accesses memory, and one waits on inbound data with a register
dependency and waits on a result for outbound data dependency.

> FPU ops would also be split up.

Yech!

> Latency would need to be explicit.
>
>
> Haven't really expanded it out all that much, as this seems unlikely to
> be particularly viable.
>
> But, likely (256 bit bundle with 80 bit ops):

Mc 88120 user 24-ish bits, with two 11-bit ICache indexes and a 3 bit
taken that indicates how to take the various predictors and determine
a single fetch index. Then each register filed was expanded from 5-bits
to 6-bits, and 11-bit minor opcodes moved to the (now 11-bit) major
opcode field. Each instruction was placed in the packet in the "slot"
that could perform that instruction. There were 3 memory slots, 1
FP ADD slot, 1 FP MUL slot and 1 Branch slot. Each slot had 16 reser-
vation stations, and the execution window was 16-packets deap.

> 12 bits, opcode (6b major, 6b minor)

About right

> 21 bits, register (2R,1W)

How are you going to do FMAC ?

BGB

unread,
Nov 28, 2020, 2:30:05 PM11/28/20
to
Or, not allow these cases; eg, intra-bundle dependencies allowed.

I guess one possibility (to improve code density), could be to allow
multiple bundle sizes, but this partly defeats the point of using
bundles of this sort.


> You also need to annotate that some of the instructions are after a branch
> and are discarded if the branch is taken but remain alive if the branch is
> not taken--and the ability to rename both ways (multiple ways depending
> on bundle width)
>

Could be done via predicate bits in the delay slot.


> And since you are trying to "take" a branch every cycle (short loops)
> you need an index into the cache when the predictor says take so
> you don't have to compute the branch addrress.
>
> We did all of the above in Mc 88120 and packed up to 6 instructions
> into about 240-ish bits.
>

I was thinking 3x in 128 or 256 bits.

Granted, 3-ops in 256 bits would be pretty terrible in terms of code
density, but would be pretty close to the contents of the pipeline in my
existing core.

3 ops in 128 bits is at least "plausible".


>> The register fields map directly to hardware ports;
>
> And to forwarding paths.
>
>> No pipeline interlocks, ...;
>
> You will never get rid of all of them.....
>

Could at least try.

In some sense, interlocks are a byproduct of trying to both pipeline
multi-cycle ops and allow the results to be used immediately in the next
instruction.

If we simply disallow this, then no interlocks can occur; either the op
stalls the pipeline, or the result does not appear until later.


>> Probably a branch delay slot;
>> Almost no multi-cycle ops.
>
> How do you do floating point?
>

Probably an op to start the FPU op, and another instruction some-odd
cycles later to store back the result (the destination register for the
first op being ZZR in this case).

FMUL R49, R61
NOP
NOP
NOP
NOP
NOP
FSTR R62


The FPU could be pipelined though, so that these can overlap:

FMUL R24, R32
FMUL R25, R33
FMUL R26, R34
FMUL R27, R35
NOP
NOP
FSTR R48 //R24*R32
FSTR R49 //R25*R33
FSTR R50
FSTR R51
...


>> Likely, stalls for memory ops could be an explicit instruction.
>
> Here is where I like the CDC 6600 solution.........
>
>> In effect, memory ops would be an instruction sequence.
>
> Does not have to be a sequence, the mere calculation of an address
> accesses memory, and one waits on inbound data with a register
> dependency and waits on a result for outbound data dependency.
>

The idea is for a "dead simple" CPU...
Otherwise, this idea is pretty much completely pointless...


>> FPU ops would also be split up.
>
> Yech!
>

Could be worse...


>> Latency would need to be explicit.
>>
>>
>> Haven't really expanded it out all that much, as this seems unlikely to
>> be particularly viable.
>>
>> But, likely (256 bit bundle with 80 bit ops):
>
> Mc 88120 user 24-ish bits, with two 11-bit ICache indexes and a 3 bit
> taken that indicates how to take the various predictors and determine
> a single fetch index. Then each register filed was expanded from 5-bits
> to 6-bits, and 11-bit minor opcodes moved to the (now 11-bit) major
> opcode field. Each instruction was placed in the packet in the "slot"
> that could perform that instruction. There were 3 memory slots, 1
> FP ADD slot, 1 FP MUL slot and 1 Branch slot. Each slot had 16 reser-
> vation stations, and the execution window was 16-packets deap.
>

OK.

>> 12 bits, opcode (6b major, 6b minor)
>
> About right
>

This is currently what my BJX2 core ended up with internally, seems to work.


>> 21 bits, register (2R,1W)
>
> How are you going to do FMAC ?
>

A NOP in Lane 3 is used to encode the 3rd parameter.

Something like a SIMD op could look something like:
NOPA R26, R34 | PFMULSS R25, R33 | PFMULSS R24, R32
NOP
NOP
NOP
NOP
NOP
NOP | FSTR R49 | FSTR R48


Actually, a similar trick is used for BJX2, and was a motivation for
going 3-wide rather than 2-wide. The ability to have 3 source operands
in a 2-wide bundle was an advantage.
...
Not sure if I can squeeze some more speed out of the bus.

With the current settings, was getting memcpy speeds of:
9 MB/s DRAM, 11 MB/s L2, 113 MB/s L1.

This being with a 50MHz bus, and with settings which are able to pass
timing. Had saw faster speeds in some of my experimentation trying to
debug stuff (eg, 11 and 14 MB/s also at 50MHz), so it may be able to be
made a little faster with some further tweaks.

Had also before saw 12 and 18 MB/s at 100MHz, but this was with settings
which could not pass timing.


Currently have this stuff (in full-duplex operation) stable enough to
run Doom at least...

Kyle Hayes

unread,
Dec 1, 2020, 1:44:26 AM12/1/20
to
On 11/28/20 12:12 AM, BGB wrote:
> On 11/27/2020 6:17 AM, Anton Ertl wrote:
>> Kyle Hayes <kyle....@gmail.com> writes:
>>> Could a TTA be made OoO?
>>
>> Sure, you can implement a TTA instruction set on an OoO
>> microarchitecture, but I fail to see the point.  The point of TTA is
>> to go beyond VLIW by making even more of the microarchitecture
>> explicit: It makes register file read and write ports explicit, and
>> the latches and busses between the functional units.  And it does that
>> so that the compiler rather than the hardware allocates these
>> resources.
>>
>
> Yeah...
>
> VLIW style ISA's (in general) don't seem particularly well suited for
> OoO IMO, since pretty much the whole point of OoO is for the CPU to
> manage all this stuff, so it makes more sense to stick with a simpler
> RISC-like ISA design in this case.
>
>

I think Anton's point here was not that a TTA system is a VLIW, it was
that it exposes more of the inner workings of the CPU than even a VLIW
ISA does.

Again, I am not saying it is a _good_ idea to make a TTA system OoO. I
was just asking if it is possible. From the responses I got, I think
the answer was yes.

I am interested in the overlap between TTA and data flow. Seems like
there is something there (warts and all).

Best,
Kyle

Paul A. Clayton

unread,
Dec 1, 2020, 10:45:03 AM12/1/20
to
On Friday, November 27, 2020 at 7:49:55 AM UTC-5, Anton Ertl wrote:
> Kyle Hayes <kyle....@gmail.com> writes:
>> Could a TTA be made OoO?
> Sure, you can implement a TTA instruction set on an OoO
> microarchitecture, but I fail to see the point. The point of TTA is
> to go beyond VLIW by making even more of the microarchitecture
> explicit: It makes register file read and write ports explicit, and
> the latches and busses between the functional units. And it does that
> so that the compiler rather than the hardware allocates these
> resources.

I have wondered if a TTA-'inspired' architecture might be useful,
i.e., encoding communication more directly. Even at the level of
a core, communication is a significant factor, so performing some
routing work for instructions and results at compile time might be
helpful (*if* the communication/storage overhead for this
compile-time information is low [warm code might be the most
appropriate; cold code "does not matter" and hot code can more
easily cache hardware-performed optimizations] and the compatibility
factor is well-managed).

This would not be TTA (operations would not be encoded as
locations) but would use routes for results. Such might also be
not hostile to out-of-order execution.

Even if not entirely done at compile time, there may be formats
that better facilitate cache-install-time optimizations.
0 new messages