watched video, and admittedly it is a little hard for me to fully
understand it (this not really being my main area).
one thought that I did start wondering about is if something like this
could be possible (possibly simpler?):
initially, you grab, say, 256 bits.
either, everything executes all at once, or the block is split in half;
and, either these blocks execute at once, or spread over multiple cycles.
the other part could be rather than the execution proceeding in both
directions, it could via an even/odd path.
for example, something sort of like:
1 Block A Read (E)
2 Block B Read (O)
Block A Split 2x128 bit
3 Sub-Block A1 Executes
Block B Split 4x64
Block C Read (E)
4 Sub-Block A2 Executes
Sub-Block B1 Executes
Block C Direct
Block D Read (E)
5 Sub-Block C1 Executes
Sub-Block B2 Executes
Block D Split 4x64
6 Sub-Block D1 Executes
Sub-Block B3 Executes
Block F Read (O)
7 Sub-Block D2 Executes
Sub-Block B4 Executes
Block F Split 2x64
8 Sub-Block D3 Executes
Sub-Block F1 Executes
...
more so, if needed, it could be split more, say we have 4 paths X,Y,Z,W,
which are in turn interleaved (possibly with 4 different instruction
pointers).
likewise, each block would have a particular instruction form and hold N
opcodes.
possibly block layouts could be something like:
0AAXXXXX_XXXXXXXX_XXXXXXXX_XXXXXXXX_TBBXXXXX_XXXXXXXX_XXXXXXXX_XXXXXXXX
1AAXXXXX_XXXXXXXX_TBBXXXXX_XXXXXXXX_TCCXXXXX_XXXXXXXX_TDDXXXXX_XXXXXXXX
2AAXXXXX_TBBXXXXX_TCCXXXXX_TDDXXXXX_TEEXXXXX_TFFXXXXX_TGGXXXXX_THHXXXXX
3AAXTBBX_TCCXTDDX_TEEXTFFX_TDDXTGGX_THHXTIIX_TJJXTKKX_TLLXTMMX_TNNXTOOX
4AAXXXXX_XXXXXXXX_XXXXXXXX_XXXXXXXX_0BBXXXXX_XXXXXXXX_XXXXXXXX_XXXXXXXX
5AAXXXXX_XXXXXXXX_TBBXXXXX_XXXXXXXX_1CCXXXXX_XXXXXXXX_TDDXXXXX_XXXXXXXX
6AAXXXXX_TBBXXXXX_TCCXXXXX_TDDXXXXX_2EEXXXXX_TFFXXXXX_TGGXXXXX_THHXXXXX
7AAXTBBX_TCCXTDDX_TEEXTFFX_TDDXTGGX_3HHXTIIX_TJJXTKKX_TLLXTMMX_TNNXTOOX
8-
9AAXXXXX_XXXXXXXX_1BBXXXXX_XXXXXXXX_1CCXXXXX_XXXXXXXX_1DDXXXXX_XXXXXXXX
AAAXXXXX_TBBXXXXX_2CCXXXXX_TDDXXXXX_2EEXXXXX_TFFXXXXX_2GGXXXXX_THHXXXXX
BAAXTBBX_TCCXTDDX_3EEXTFFX_TDDXTGGX_3HHXTIIX_TJJXTKKX_3LLXTMMX_TNNXTOOX
where the first nibble mostly indicates block format (encoding both a
layout and temporal split), with in this case typically using 8 bit
opcodes. T would be special bits (to maintain pattern, possibly would
give bits to adjacent opcodes to allow a 6/10-bit opcode or similar),
and X is operand payload.
0-3 would be single-cycle blocks, 4-7 taking 2 cycles, and 8-B taking 4
cycles. in split blocks, the secondary tags indicate the instruction
layout of a given block.
0 would have 128-bits per op, 1 has 64-bits/op, 2 has 32-bits op, and 3
has 16-bits/op.
or such...