Mill instruction set encoding

Mark Thorson

unread,

Jun 4, 2013, 12:37:39 PM6/4/13

to

Because you require all instructions in a bundle
to be the same size, doesn't that mean common
instructions like move, add, branch (or skip),
no-op, etc. will need to have redundant encodings
in each size? For example, the 9-bit format
will have encodings for all of these, as will
the 10-bit format, 11-bit, etc.?

Is there any functional grouping of instructions
on the basis of size? For example, do you group
bit manipulation in one format, fused math ops in
another format, etc.? I'd guess you don't, because
you'd want to run different kinds of operations
in parallel, so you'd want them to reside in
different formats rather than all in the same
format. But that creates other problems, like
when operation A you want to run in parallel with
operation B are in different formats.

You split the instruction set into two parts
handled by two separate caches and decoders.
Is the instruction size the same in each half,
e.g. are 10-bit computation instructions always
paired with 10-bit I/O instructions?

Ivan Godard

unread,

Jun 4, 2013, 12:06:55 PM6/4/13

to

On 6/4/2013 9:37 AM, Mark Thorson wrote:
> Because you require all instructions in a bundle
> to be the same size, doesn't that mean common
> instructions like move, add, branch (or skip),
> no-op, etc. will need to have redundant encodings
> in each size? For example, the 9-bit format
> will have encodings for all of these, as will
> the 10-bit format, 11-bit, etc.?

We do not require all operations in a bundle to be the same size. We
require all operations in a block to be the same size, but the the
different blocks in a bundle may (and in practice do) have different
operation sizes.

Any particular opcode can exist in only one block. Thus exu block 2 is
the home for all add ops (and other ops that encode and time like adds).
You simply cannot encode them in any other block - there are no bit
patterns for the purpose. Likewise a branch op will only be in flow block 1.

> Is there any functional grouping of instructions
> on the basis of size? For example, do you group
> bit manipulation in one format, fused math ops in
> another format, etc.?

Yes, grouping is based in part on size. It is also based on when we need
the decode, because block-1 decodes (on both sides) are available a
cycle before block-2 and block-3 decodes. We used to have an exu
block-4, which took yet another cycle, but we merged two of the blocks
and now have only three.

I'd guess you don't, because
> you'd want to run different kinds of operations
> in parallel, so you'd want them to reside in
> different formats rather than all in the same
> format.

Remember each block can have as many operations as you have functional
units for that kind of operation. For example, Gold has eight ALUs, so
it can issue eight adds (subs, shifts, compares, etc.) at a time. So exu
block 2, where ALU ops live, can have as many as eight ALU operations in
it on a Gold. Because all ALU ops are the same size and nearly the same
format, we can use standard fixed-length (i.e. RISC-like) decodes to
pick the eight ops out of the block easily. It's isolating the blocks
that is hard, and new.

But that creates other problems, like
> when operation A you want to run in parallel with
> operation B are in different formats.

The will be in different blocks, decode separately, and issue together.

> You split the instruction set into two parts
> handled by two separate caches and decoders.
> Is the instruction size the same in each half,
> e.g. are 10-bit computation instructions always
> paired with 10-bit I/O instructions?
>

No, the bit-level encodings are completely different between the two
sides. On some members it could be that an exu block-element size is the
same as a flow side block-element size, but that would be coincidence.

MitchAlsup

unread,

Jun 4, 2013, 12:12:47 PM6/4/13

to

On Tuesday, June 4, 2013 11:06:55 AM UTC-5, Ivan Godard wrote:
> Any particular opcode can exist in only one block. Thus exu block 2 is
> the home for all add ops (and other ops that encode and time like adds).

So, if you run into a large number of integer Add/Sub/And/Or/Xor instructions
in a row, the width of the machine drops from 30+ to the width of the exu
block 2 <sub>bundle?

Mitch

Stephen Fuld

unread,

Jun 4, 2013, 12:32:22 PM6/4/13

to

As a test of my understanding, let me try to respond; then Ivan, please
correct my response for both Mitch and me. :-)

The answer is yes, sort of. But the width of the exu block sub bundle
is determined by the number of ALUs. So while you can't decode more ALU
instructions that the block contains, there wouldn't be any advantage to
being able to do so, given that you could only execute the number that
you decode.

As for the 30+ number, it isn't 30+ arbitrary instructions. You only
get that high if the mix of instructions "matches" the mix of functional
unit types in the machine. Thus, in your example, if you had a run of
hundreds of ALU type instructions in a row, and you had a machine with 8
ALUs, they you would decode, and execute, 8 instructions per cycle.
Other models of the Mill would have numbers different from 8, but
whatever the number, it applies to the number of ALUs and the number of
ALU instructions decoded per cycle.

Ivan, is that close? :-)

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Ivan Godard

unread,

Jun 4, 2013, 12:35:46 PM6/4/13

to

On 6/4/2013 9:12 AM, MitchAlsup wrote:

That's one way to look at it. What you actually are dropping to is
however many hardware ALUs you have. If you have 20 adds to do and only
four adders then it's going to take five cycles to do them all, no
matter how they are encoded.

Of course, if you are adder-limited then you still have all these other
FUs and their encoding slots in which you can place other ops if you
have a use for them. Except in very power-constrained contexts, we
expect to fill a lot of the otherwise-unused machine with speculatively
executed ops. But there will always be a pinch point somewhere; you try
and configure so that typical code keeps most of the machine busy most
of the time.

Ivan

Ivan Godard

unread,

Jun 4, 2013, 12:38:46 PM6/4/13

to

Bullseye.

I wish I had read your response before I had written my own :-)

Mark Thorson

unread,

Jun 5, 2013, 10:51:55 PM6/5/13

to

Ivan Godard wrote:
>
> On 6/4/2013 9:37 AM, Mark Thorson wrote:
> > Because you require all instructions in a bundle
> > to be the same size, doesn't that mean common
> > instructions like move, add, branch (or skip),
> > no-op, etc. will need to have redundant encodings
> > in each size? For example, the 9-bit format
> > will have encodings for all of these, as will
> > the 10-bit format, 11-bit, etc.?
>
> We do not require all operations in a bundle to be the same size. We
> require all operations in a block to be the same size, but the the
> different blocks in a bundle may (and in practice do) have different
> operation sizes.

I did mean ops in a block. I was confused by the
new terminology.

> Any particular opcode can exist in only one block. Thus exu block 2 is
> the home for all add ops (and other ops that encode and time like adds).
> You simply cannot encode them in any other block - there are no bit
> patterns for the purpose. Likewise a branch op will only be in flow block 1.

I don't get part of this. I get that exu and flow
are your names for the two halves of the instruction
decoder. What do you mean by block 1 and block 2?

I assume a block is the set of operations (what
anybody else would call instructions) that issue
in one machine cycle. All of the encodings for
the instructions in a block must have the same
size (i.e. number of bits in the encoding, not
the size of the data operands).

From this, I was thinking that you'd need to have
redundant encodings for some common operations for
each encoding size -- e.g. the 11-bit operation
encoding format would need an add instruction, but
so would the 12- and 13-bit formats. If that is
not so, some formats would be impoverished by not
having these common operations.

> > Is there any functional grouping of instructions
> > on the basis of size? For example, do you group
> > bit manipulation in one format, fused math ops in
> > another format, etc.?
>
> Yes, grouping is based in part on size. It is also based on when we need
> the decode, because block-1 decodes (on both sides) are available a
> cycle before block-2 and block-3 decodes. We used to have an exu
> block-4, which took yet another cycle, but we merged two of the blocks
> and now have only three.

Here you go again using block-1 and block-2 terminology.
What's that?

Ivan Godard

unread,

Jun 5, 2013, 10:41:16 PM6/5/13

to

It is real easy to get confused if you try to use terminology which
doesn't apply to these kinds of machines. Here's fairly standard
terminology for wide-issue machines (such as VLIW, EPIC, and Mill):

an *operation* is the individual add, load, branch etc performed by a
functional unit.

an *instruction* is the logical unit of issue, comprising all operations
that *must* issue together.

a *bundle* is the unit of physical decode, comprising all operations
that *must* enter decode together.

For a legacy machine, such as x86 or RISC, operation == instruction ==
bundle, and the bundle and instruction sizes are one operation.

For a VLIW, operation <= instruction == bundle.

For an EPIC (Itanium), operation <= instruction and operation <= bundle,
but bundle != instruction (except by accident).

The Mill extends this with new concepts:

A *half-bundle* is the unit of physical decode on one side.

A *block" is a new physical unit, between operation and half-bundle,
comprising a group of operations, all of the same encoded size, and
generally corresponding to a particular kind of execution unit, such as
an ALU or a load/store unit.

So for a Mill:

An instruction == 2* half-bundle

A half-bundle == N*block (N == 3 on the Mill, but other is possible)

A block == M*operation (M == 0..max, for a configured max for each block
for each Mill member).

Every cycle, a Mill decodes one instruction comprising two half-bundles
(either or both of which may be empty), each comprised of three blocks
(any or all of which may be empty) of operations. The sum of the maximum
number of operations possible in all six blocks is the MIMD width of the
machine, and corresponds to the number of execution functional units.

> From this, I was thinking that you'd need to have
> redundant encodings for some common operations for
> each encoding size -- e.g. the 11-bit operation
> encoding format would need an add instruction, but
> so would the 12- and 13-bit formats. If that is
> not so, some formats would be impoverished by not
> having these common operations.

Each kind of operation has a native block in the encoding. Thus all add
operations will always encode in block 2 of the three in the exu
half-bundle. The maximum number of operations that can exist in, and be
decoded from, that particular block corresponds to the number of adders
that the execution units have; it is not possible to encode more add
operations than you have adders, and so on. Hence there is no point to
permitting an add to be encoded in other blocks too; there would not be
an adder to give it to.

>>> Is there any functional grouping of instructions
>>> on the basis of size? For example, do you group
>>> bit manipulation in one format, fused math ops in
>>> another format, etc.?
>>
>> Yes, grouping is based in part on size. It is also based on when we need
>> the decode, because block-1 decodes (on both sides) are available a
>> cycle before block-2 and block-3 decodes. We used to have an exu
>> block-4, which took yet another cycle, but we merged two of the blocks
>> and now have only three.
>
> Here you go again using block-1 and block-2 terminology.
> What's that?
>

I hope the definition above will make it clear. I think you may be
thinking of a *many-issue* machine (like an x86) rather than a
*wide-issue* machine (like a VLIW or a Mill). These two notions are very
different.

On a many-issue machine, if your code has sixteen adds in a row and your
instruction buffer (to use typical OOO terminology) has sixteen
positions then you decode all sixteen into the buffer and then, each
cycle, issue as many adds from the instruction buffer as you have
adders, say two at a time over the next eight cycles.

In a wide-issue machine, you simply cannot encode more operations than
the machine has function units of that kind, and there are no
instruction buffers. If you have two adders (this example) and sixteen
adds to do then you encode eight different instructions, each containing
two add operations, and issue them one instruction at a time over eight
cycles.

Here both the wide-issue and the many-issue machines have taken eight
cycles to get sixteen adds through two adders. What is different is that
the wide-issue machine does not need instruction buffers; issue
immediately follows decode, and the translated operations feed direct to
the adders without needing buffering. Given that scheduling and issue on
a many-issue machine adds at least two cycles to the pipeline, and given
how much a power and area hog the instruction buffers are, this is a big
win.

Hope this clears things up

Ivan

Quadibloc

unread,

Jun 6, 2013, 5:42:38 AM6/6/13

to

I know what I understood: all the operations in a bundle could execute
simultaneously, and have no dependencies on each other. Thus, they had
no relevant sequential order. So they were sorted by opcode length, in
order to make decoding faster; a preamble indicated how many
operations of each length there would be.

John Savard

Ivan Godard

unread,

Jun 6, 2013, 11:28:58 AM6/6/13

to

To a first approximation, yes, but there's a second and third...

Some opcodes (control flow) do have intra-bundle sequential order
dependencies, but these all are in the same block (of the flow
half-bundle as it happens) so the dependencies are expressed as
intra-block order.

Ivan