It is real easy to get confused if you try to use terminology which
doesn't apply to these kinds of machines. Here's fairly standard
terminology for wide-issue machines (such as VLIW, EPIC, and Mill):
an *operation* is the individual add, load, branch etc performed by a
functional unit.
an *instruction* is the logical unit of issue, comprising all operations
that *must* issue together.
a *bundle* is the unit of physical decode, comprising all operations
that *must* enter decode together.
For a legacy machine, such as x86 or RISC, operation == instruction ==
bundle, and the bundle and instruction sizes are one operation.
For a VLIW, operation <= instruction == bundle.
For an EPIC (Itanium), operation <= instruction and operation <= bundle,
but bundle != instruction (except by accident).
The Mill extends this with new concepts:
A *half-bundle* is the unit of physical decode on one side.
A *block" is a new physical unit, between operation and half-bundle,
comprising a group of operations, all of the same encoded size, and
generally corresponding to a particular kind of execution unit, such as
an ALU or a load/store unit.
So for a Mill:
An instruction == 2* half-bundle
A half-bundle == N*block (N == 3 on the Mill, but other is possible)
A block == M*operation (M == 0..max, for a configured max for each block
for each Mill member).
Every cycle, a Mill decodes one instruction comprising two half-bundles
(either or both of which may be empty), each comprised of three blocks
(any or all of which may be empty) of operations. The sum of the maximum
number of operations possible in all six blocks is the MIMD width of the
machine, and corresponds to the number of execution functional units.
> From this, I was thinking that you'd need to have
> redundant encodings for some common operations for
> each encoding size -- e.g. the 11-bit operation
> encoding format would need an add instruction, but
> so would the 12- and 13-bit formats. If that is
> not so, some formats would be impoverished by not
> having these common operations.
Each kind of operation has a native block in the encoding. Thus all add
operations will always encode in block 2 of the three in the exu
half-bundle. The maximum number of operations that can exist in, and be
decoded from, that particular block corresponds to the number of adders
that the execution units have; it is not possible to encode more add
operations than you have adders, and so on. Hence there is no point to
permitting an add to be encoded in other blocks too; there would not be
an adder to give it to.
>>> Is there any functional grouping of instructions
>>> on the basis of size? For example, do you group
>>> bit manipulation in one format, fused math ops in
>>> another format, etc.?
>>
>> Yes, grouping is based in part on size. It is also based on when we need
>> the decode, because block-1 decodes (on both sides) are available a
>> cycle before block-2 and block-3 decodes. We used to have an exu
>> block-4, which took yet another cycle, but we merged two of the blocks
>> and now have only three.
>
> Here you go again using block-1 and block-2 terminology.
> What's that?
>
I hope the definition above will make it clear. I think you may be
thinking of a *many-issue* machine (like an x86) rather than a
*wide-issue* machine (like a VLIW or a Mill). These two notions are very
different.
On a many-issue machine, if your code has sixteen adds in a row and your
instruction buffer (to use typical OOO terminology) has sixteen
positions then you decode all sixteen into the buffer and then, each
cycle, issue as many adds from the instruction buffer as you have
adders, say two at a time over the next eight cycles.
In a wide-issue machine, you simply cannot encode more operations than
the machine has function units of that kind, and there are no
instruction buffers. If you have two adders (this example) and sixteen
adds to do then you encode eight different instructions, each containing
two add operations, and issue them one instruction at a time over eight
cycles.
Here both the wide-issue and the many-issue machines have taken eight
cycles to get sixteen adds through two adders. What is different is that
the wide-issue machine does not need instruction buffers; issue
immediately follows decode, and the translated operations feed direct to
the adders without needing buffering. Given that scheduling and issue on
a many-issue machine adds at least two cycles to the pipeline, and given
how much a power and area hog the instruction buffers are, this is a big
win.
Hope this clears things up
Ivan