Another code compression idea

Anton Ertl

unread,

Dec 26, 2021, 1:33:30 PM12/26/21

to

The RISC-V C (compressed) extension (and AFAIK Thumb and MIPS-16) work
by turning three-address instructions into two-address instructions,
reducing literal fields, and/or reducing the number of addressable
registers.

Here's another idea: How about compressing by using the destination
register of the previous instruction as one of the sources?

+ Smaller code.

+ You directly know that you can forward from the previous instruction
to this one. And extending from this, you can directly encode stuff
for an ILDP architecture [kim&smith02].

- The main disadvantage I see is that the meaning of the instructions
is not encoded just in itself. This is a problem when jumping to
such an instruction (just don't allow that, i.e., have an illegal
instruction exception if you jump to such an instruction). A bigger
problem is when there an interrupt or exception returns to such an
instruction; a way to deal with that may be to allow this encoding
only for instructions that cannot cause exceptions, and to delay
interrupts until the next self-contained instruction.

@InProceedings{kim&smith02,
author = {Ho-Seop Kim and James E. Smith},
title = {An Instruction Set and Microarchitecture for
Instruction Level Distributed Processing},
crossref = {isca02},
pages = {71--81},
url = {http://www.ece.wisc.edu/~hskim/papers/kimh_ildp.pdf},
annote = {This paper addresses the problems of wide
superscalars with communication across the chip and
the number of write ports in the register file. The
authors propose an architecture (ILDP) with
general-purpose registers and with accumulators
(with instructions only accessing one accumulator
(read and/or write) and one register (read or
write); for the accumulators their death is
specified explicitly in the instructions. The
microarchitecture builds \emph{strands} from
instructions working on an accumulator; a strand
starts with an instruction writing to an accumulator
without reading from it, continues with instructions
reading from (and possibly writing to) the
accumulator and ends with an instruction that kills
the accumulator. Strands are allocated to one out of
eight processing elements (PEs) dynamically (i.e.,
accumulators are renamed). A PE consists of
mainly one ALU data path (but also a copy of the
GPRs and an L1 cache). They evaluated this
architecture by translating Alpha binaries into it,
and comparing their architecture to a 4-wide or
8-wide Alpha implementation; their architecture has
a lower L1 cache latency, though. The performance of
ILDP in clock cycles is competetive, and one can
expect faster clocks for ILDP. The paper also
presents data for other stuff, e.g. general-purpose
register writes, which have to be promoted between
strands and which are relatively few.}
}

@Proceedings{isca02,
title = "$29^\textit{th}$ Annual International Symposium on Computer Architecture",
booktitle = "$29^\textit{th}$ Annual International Symposium on Computer Architecture",
year = "2002",
key = "ISCA 29",
}

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

MitchAlsup

unread,

Dec 26, 2021, 3:00:56 PM12/26/21

to

On Sunday, December 26, 2021 at 12:33:30 PM UTC-6, Anton Ertl wrote:
> The RISC-V C (compressed) extension (and AFAIK Thumb and MIPS-16) work
> by turning three-address instructions into two-address instructions,
> reducing literal fields, and/or reducing the number of addressable
> registers.
<

With no disregard to Anton:
<
I wish we stopped using this terminology. Is 3-address {2-operands and 1-
result}, or 3-operands with an implied result (accumulator)??
<
A few years ago, I started to use the terminology x-operand (and implying
1-result)
<
But I digress.

>
> Here's another idea: How about compressing by using the destination
> register of the previous instruction as one of the sources?
<

Works about 70% of the time, but only when code has not been scheduled.
>
> + Smaller code.
<
27-bits instead of 32-bits ? One still has a lot of pruning to do to get to 16-bits.

>
> + You directly know that you can forward from the previous instruction
> to this one. And extending from this, you can directly encode stuff
> for an ILDP architecture [kim&smith02].
<

Yes, you know, you also know that the pipeline will be stalled 30% of the
time (LDs) plus however often one has multicycle calculate instructions.

>
> - The main disadvantage I see is that the meaning of the instructions
> is not encoded just in itself. This is a problem when jumping to
> such an instruction (just don't allow that, i.e., have an illegal
> instruction exception if you jump to such an instruction).
<

How do you tell a-priori?

<
< A bigger
> problem is when there an interrupt or exception returns to such an
> instruction; a way to deal with that may be to allow this encoding
> only for instructions that cannot cause exceptions, and to delay
> interrupts until the next self-contained instruction.
<

There are other problems, too.

Bill Findlay

unread,

Dec 26, 2021, 4:20:07 PM12/26/21

to

On 26 Dec 2021, Anton Ertl wrote
(in article<2021Dec2...@mips.complang.tuwien.ac.at>):

> The RISC-V C (compressed) extension (and AFAIK Thumb and MIPS-16) work
> by turning three-address instructions into two-address instructions,
> reducing literal fields, and/or reducing the number of addressable
> registers.
>
> Here's another idea: How about compressing by using the destination
> register of the previous instruction as one of the sources?

The NCR Century did something similar; the *destination* of a 1-address
instruction was taken to be the destination of the most recently
executed 2-address instruction.

--
Bill Findlay

Anton Ertl

unread,

Dec 26, 2021, 6:09:41 PM12/26/21

to

MitchAlsup <Mitch...@aol.com> writes:
>On Sunday, December 26, 2021 at 12:33:30 PM UTC-6, Anton Ertl wrote:
>> Here's another idea: How about compressing by using the destination
>> register of the previous instruction as one of the sources?
><
>Works about 70% of the time, but only when code has not been scheduled.

Yes, you would not use compiler instruction scheduling if you want to
make use of this compression idea.

>> + Smaller code.
><
>27-bits instead of 32-bits ? One still has a lot of pruning to do to get to 16-bits.

Sure, like all the other ways other compressed instruction subsets
use: shorter literals, only a subset of the operations, etc.
Basically, we would do something like replace src1=dest (in existing
compressed instruction subsets) with src1=olddest.

>> + You directly know that you can forward from the previous instruction
>> to this one. And extending from this, you can directly encode stuff
>> for an ILDP architecture [kim&smith02].
><
>Yes, you know, you also know that the pipeline will be stalled 30% of the
>time (LDs) plus however often one has multicycle calculate instructions.

For compiler-scheduled code, this compression simply will not work
much of the time. But for OoO, this kind of encoding might be more
useful than the src1=dest encoding.

>> - The main disadvantage I see is that the meaning of the instructions
>> is not encoded just in itself. This is a problem when jumping to
>> such an instruction (just don't allow that, i.e., have an illegal
>> instruction exception if you jump to such an instruction).
><
>How do you tell a-priori?

The hardware does not need to tell a-priori. It has a "last
destination" register in the decoder, that is set to invalid on
control-flow, and when it encounters an instruction that refers to the
last destination, when that is invalid, it produces an illegal
instruction exception.

The assembler sees if there is a label before the instruction. If
there is, the instruction is encoded not to refer to the last
destination.

>There are other problems, too.

Please elaborate.

Ivan Godard

unread,

Dec 26, 2021, 6:26:49 PM12/26/21

to

Bill Wulf had a scheme where each instruction had registers 3 in and one
out, and two opcodes, and computed (X a Y) b C.

Ivan Godard

unread,

Dec 26, 2021, 6:30:03 PM12/26/21

to

On 12/26/2021 2:55 PM, Anton Ertl wrote:
> MitchAlsup <Mitch...@aol.com> writes:
>> On Sunday, December 26, 2021 at 12:33:30 PM UTC-6, Anton Ertl wrote:
>>> Here's another idea: How about compressing by using the destination
>>> register of the previous instruction as one of the sources?
>> <
>> Works about 70% of the time, but only when code has not been scheduled.
>
> Yes, you would not use compiler instruction scheduling if you want to
> make use of this compression idea.
>
>>> + Smaller code.
>> <
>> 27-bits instead of 32-bits ? One still has a lot of pruning to do to get to 16-bits.
>
> Sure, like all the other ways other compressed instruction subsets
> use: shorter literals, only a subset of the operations, etc.
> Basically, we would do something like replace src1=dest (in existing
> compressed instruction subsets) with src1=olddest.

Sounds like a belt of length one?

Anton Ertl

unread,

Dec 27, 2021, 2:28:51 AM12/27/21

to

I thought about whether to mention Mill's belt, but knew that you
would do it for me:-)

Yes, it can be viewed like that, but

* The previous instruction still writes its result in the register set
(at least architecturally).

* The belt has works across control-flow joins (with appropriate
complications), my suggestion is to not allow that to avoid the
complication; but of course you are free to explore whether the
benefits of working across control-flow joins is worth the costs.

Quadibloc

unread,

Dec 27, 2021, 5:07:22 AM12/27/21

to

On Sunday, December 26, 2021 at 11:33:30 AM UTC-7, Anton Ertl wrote:
> A bigger
> problem is when there an interrupt or exception returns to such an
> instruction; a way to deal with that may be to allow this encoding
> only for instructions that cannot cause exceptions, and to delay
> interrupts until the next self-contained instruction.

I'd tend to be inclined to deal with it by regarding such instructions,
along with the instruction they follow, as a _single_ instruction for
such purposes.

John Savard

Ivan Godard

unread,

Dec 27, 2021, 6:32:34 AM12/27/21

to

Every architecture has a problem with control flow joins. All live data
must be congruent (whatever that means in the architecture) after the
join. It may be more or less easy to extend that congruence back over
the control flow, for example by using a location that is global over
all control to hold the value. However, the available locations are
limited, and the amount of live values is not.

The usual solution in positional addressing (i.e. registers) is to use
global allocation, with a graph coloring allocation algorithm and spill
(to effectively unbounded memory), with fill after the join to achieve
congruence. The amount of spill depends on the global union of live
sets, not on the size of the live set at the join. More registers
reduces the coloring pressure.

The Mill solution for temporal addressing is to rename the addresses of
the live values on each path into the join to some congruent naming. The
amount of rename depends on the size of the live set at the join, not on
the sizes of the sets elsewhere even if the sets have values in common.
That is why positional needs a bigger regfile than temporal needs belt.
Rename is cheaper than spill, but wastes entropy when a spill is not
needed by positional addressing.

Quadibloc

unread,

Dec 27, 2021, 12:05:01 PM12/27/21

to

On Monday, December 27, 2021 at 4:32:34 AM UTC-7, Ivan Godard wrote:

> Every architecture has a problem with control flow joins. All live data
> must be congruent (whatever that means in the architecture) after the
> join.

I must say that it's even hard to understand the language you're
speaking here.

Later on, you mention register rename in conventional register-based
architectures, as opposed to the Mill.

So, it's taken some time, but I *think* I've figured out what you're
saying.

In order to allow a branch into a code sequence on the Mill,
you have to save and restore the Belt, something like how
registers are saved and restored on a subroutine call. As
this involves memory, or at least cache, it's expensive, but
it's only done on the occasional times when it's needed.

You were comparing the Mill to an *out-of-order* implementation
of a conventional architecture.

Said out-of-order implementation is always renaming
registers, during all code everywhere, so that a mapping
between the actual hardware registers that contain data
and the registers in the architectural model is maintained.

This lets one branch into an existing code sequence,
because the code following the branch target refers to the
architectural registers, so it will work with the different
rename pointers from either preceding sequence.

So that already means that I have an issue with the
*first* sentence I quoted. An *in-order* implementation
of a conventional architecture - that wretchedly slow
obsolete design people used in the old days when
they used core memory and discrete transistors, or
tiny little integrated circuits that weren't much better -
does *not* have a problem with control flow joins.

And, of course, that has been the paradigm for
computer architecture design all along. Out-of-order
implementations then go to great lengths to make
a modern processor pretend, in order to conform
to its ISA so designed, to be an in-order processor
from the good old days.

Which, of course, in itself is a very strong argument
for the Mill. Or for RISC designs which because they
use 32 registers, can be in-order with the compiler
doing the rename stuff... at least that _was_ true
20 years (or more?) ago, now the RISC designs have
to be out-of-order to give competitive performance
as well.

It is good that you've escaped the constraint of
still thinking you live in the good old days. But
forgetting they ever happened may make it hard
for other people to understand you.

John Savard

BGB

unread,

Dec 27, 2021, 2:16:22 PM12/27/21

to

This is one area where having lots of registers, and then statically
assigning all the variables to registers within the function, seems to
come in handy...

To a lesser degree, predicated instructions and paths can also help
here, because this allows the "if()" branch to be compiled in a way
which does not necessarily require spill and reload. If the register
spill and reload operations are unconditional, then the end result will
be the same regardless of whether the "then" or "else" branch was taken.

Though, as can be noted, there is still the usual tradeoff for
predication that it is primarily a benefit when the "then" or "else"
blocks are smaller than the cost of a branch mispredict.

Some people claim that predication is useless if one has a branch
predictor, but this ignores the cost of spill/reload, or that in many of
the cases where predication is likely to come up in the first place, one
is likely dealing with unpredictable branches.

> The Mill solution for temporal addressing is to rename the addresses of
> the live values on each path into the join to some congruent naming. The
> amount of rename depends on the size of the live set at the join, not on
> the sizes of the sets elsewhere even if the sets have values in common.
> That is why positional needs a bigger regfile than temporal needs belt.
> Rename is cheaper than spill, but wastes entropy when a spill is not
> needed by positional addressing.
>

Possible...

Lots of potential cost tradeoffs here.

I guess it is also likely that in "actual hardware", the relative cost
of LUTRAMs may not be as cheap as FPGAs make them seem.

Thomas Koenig

unread,

Dec 27, 2021, 4:18:51 PM12/27/21

to

Quadibloc <jsa...@ecn.ab.ca> schrieb:

> On Monday, December 27, 2021 at 4:32:34 AM UTC-7, Ivan Godard wrote:
>
>> Every architecture has a problem with control flow joins. All live data
>> must be congruent (whatever that means in the architecture) after the
>> join.
>
> I must say that it's even hard to understand the language you're
> speaking here.
>
> Later on, you mention register rename in conventional register-based
> architectures, as opposed to the Mill.
>
> So, it's taken some time, but I *think* I've figured out what you're
> saying.
>
> In order to allow a branch into a code sequence on the Mill,
> you have to save and restore the Belt, something like how
> registers are saved and restored on a subroutine call.

Think of an if/else statement for SSA, the intermediate form
which is state of the art for compilers. It has only a single
point where each variable is defined, which makes things like
constant propagation very easy.

So,

a = 32
// use a
a = 44
// use a

is translated to

a.1 = 32
// use a.1
a.2 = 44
// use a.2

The question then is: What do you do with

if (condition) then
a = 32
else
a = 44
endif ?

The answer are the so-called "phi nodes", which are functions which take
their values from the respective branches, so it could look something
like

if (condition) then
a.1 = 32
else
a.2 = 44
endif

a.3 = PHI(a.1,a.2)

where the PHI selects the value from whatever branch was taken.

You translate the PHI nodes by putting the right value into the
right register (or memory location). For the Mill, it has to
involve setting the variables into the right slots on the belt,
which I can imagine is quite complex.

Ivan Godard

unread,

Dec 27, 2021, 7:14:18 PM12/27/21

to

On 12/27/2021 1:18 PM, Thomas Koenig wrote:

<good explanation of phi omitted>

> You translate the PHI nodes by putting the right value into the
> right register (or memory location). For the Mill, it has to
> involve setting the variables into the right slots on the belt,
> which I can imagine is quite complex.

Not very complex. The logical-to-physical mapping can be thought of as
in effect a hardware array of physical numbers indexed by belt number,
say 16x6 bits on a Silver. Replace the array at a join (or call, or
return, etc.)

That's a logical picture; it doesn't have to actually work that way -
rename can be done in several different ways, as Mitch has explained at
length.

Brett

unread,

Dec 28, 2021, 12:06:58 AM12/28/21

to

Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
> The RISC-V C (compressed) extension (and AFAIK Thumb and MIPS-16) work
> by turning three-address instructions into two-address instructions,
> reducing literal fields, and/or reducing the number of addressable
> registers.
>
> Here's another idea: How about compressing by using the destination
> register of the previous instruction as one of the sources?
>
> + Smaller code.
>
> + You directly know that you can forward from the previous instruction
> to this one. And extending from this, you can directly encode stuff
> for an ILDP architecture [kim&smith02].

You just described a register plus accumulator design, the most well known
of which is the 8086, though you can argue the intent. Very successful but
not enough registers due to tech reasons.

The biggest downside is you need byte sized instructions to get that good
code density, which hardware guys hate.

The next step is deciding if you want a belt in front of your accumulator
so you can push two loads to ACC and then do an operation, a very common
occurrence. You save on destination register specifiers for the loads and
on source register specifiers for the integer operation.

This is a mix of push and pull ACC ops, you can go pure push or pure pull
as well.

Then you can add preferred registers for ops like X86-64 has for more
compression. Say four address registers and four integer registers for hot
data, and extension opcodes to get at the other registers. You can get near
50% compression compared to current standards, with no performance loss.

Thomas Koenig

unread,

Dec 28, 2021, 3:05:58 AM12/28/21

to

Brett <gg...@yahoo.com> schrieb:

> Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>> The RISC-V C (compressed) extension (and AFAIK Thumb and MIPS-16) work
>> by turning three-address instructions into two-address instructions,
>> reducing literal fields, and/or reducing the number of addressable
>> registers.
>>
>> Here's another idea: How about compressing by using the destination
>> register of the previous instruction as one of the sources?
>>
>> + Smaller code.
>>
>> + You directly know that you can forward from the previous instruction
>> to this one. And extending from this, you can directly encode stuff
>> for an ILDP architecture [kim&smith02].
>
> You just described a register plus accumulator design, the most well known
> of which is the 8086, though you can argue the intent. Very successful but
> not enough registers due to tech reasons.

Does not have to be - the value of the accumulator has to be
retained after an operation. What Anton described can have a
temporary value only, which only lives in store forwarding.

> The biggest downside is you need byte sized instructions to get that good
> code density, which hardware guys hate.

Even a reduction from 32 to 16 bits could make a lot of sense.

Anton Ertl

unread,

Dec 28, 2021, 6:06:34 AM12/28/21

to

Brett <gg...@yahoo.com> writes:
>Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>> The RISC-V C (compressed) extension (and AFAIK Thumb and MIPS-16) work
>> by turning three-address instructions into two-address instructions,
>> reducing literal fields, and/or reducing the number of addressable
>> registers.
>>
>> Here's another idea: How about compressing by using the destination
>> register of the previous instruction as one of the sources?
>>
>> + Smaller code.
>>
>> + You directly know that you can forward from the previous instruction
>> to this one. And extending from this, you can directly encode stuff
>> for an ILDP architecture [kim&smith02].
>
>You just described a register plus accumulator design, the most well known
>of which is the 8086

Not quite. Architecturally the previous instruction still writes to
any register (not just a specific one like on the 8086), and the
current instruction reads from that register. Microarchitecturally,
on a simple single-pipeline implementation you can see the forwarding
as using an accumulator (the latch on the forwarding path), but you
also get that when the current instruction names the register and it
happens to be the same as the destination register of the previous
instruction.

>The biggest downside is you need byte sized instructions to get that good
>code density, which hardware guys hate.

It depends on what else you encode and how many registers you have. I
was thinking about 16-bit instructions, as a replacement for RISC-V's
compressed instructions. The benefit over those would be that you
would have a free hand at selecting the other operand and the target;
the cost would be that you could use this compression method only if
the previous instruction writes to a source register of a current
instruction.

>The next step is deciding if you want a belt in front of your accumulator
>so you can push two loads to ACC and then do an operation, a very common
>occurrence.

How common? I don't think it's common in most code I have written.

Anton Ertl

unread,

Dec 28, 2021, 6:11:32 AM12/28/21

to

Thomas Koenig <tko...@netcologne.de> writes:
>Does not have to be - the value of the accumulator has to be
>retained after an operation. What Anton described can have a
>temporary value only, which only lives in store forwarding.

No, the value lives on in the destination register of the previous
instruction. Of course, if the current instruction overwrites this, a
sufficiently smart microarchitecture could skip the writeback of the
previous instruction, but that's not specific to my compression idea;
and there are pitfalls wrt. interrupts and exceptions to take into
account.

Quadibloc

unread,

Dec 28, 2021, 10:30:11 AM12/28/21

to

On Monday, December 27, 2021 at 10:06:58 PM UTC-7, gg...@yahoo.com wrote:

> You just described a register plus accumulator design, the most well known
> of which is the 8086, though you can argue the intent. Very successful but
> not enough registers due to tech reasons.

Hmm. I had never thought of the 8086 as a "register plus accumulator"
architecture; but when I look up its architecture, I see it doesn't just have
four 16-bit registers or eight 8-bit registers; the 16-bit registers aren't
just A, B, C, and D, but Accumulator, Base, Count, and Data.

As if it was designed to have automatically looping instructions to
process strings or something.

John Savard

Anton Ertl

unread,

Dec 28, 2021, 12:58:19 PM12/28/21

to

Quadibloc <jsa...@ecn.ab.ca> writes:
>On Monday, December 27, 2021 at 10:06:58 PM UTC-7, gg...@yahoo.com wrote:
>
>> You just described a register plus accumulator design, the most well known
>> of which is the 8086, though you can argue the intent. Very successful but
>> not enough registers due to tech reasons.
>
>Hmm. I had never thought of the 8086 as a "register plus accumulator"
>architecture;

For the programming side it mostly isn't. But on the encoding side
there are shorter encodings for certain instructions when working on
AX (or AL), somewhat like the 16-bit encodings in RISC-V C, which are
just additional encodings for the more general stuff. The 8086
architecture also has a number of instructions with implicit
registers, and AX or AL tend to be used by these instructions (e.g.,
LODSW loads into AX).

>but when I look up its architecture, I see it doesn't just have
>four 16-bit registers or eight 8-bit registers;

8 16-bit registers, and a number of instructions treat them alike.
However, you can use only some of them for addressing (fixed in
IA-32), so they are not general-purpose registers.

>As if it was designed to have automatically looping instructions to
>process strings or something.

The implicit-register instructions certainly were designed for
specific patterns; if your usage fits the pattern, it's cool, if not,
it's nasty.

BGB

unread,

Dec 28, 2021, 1:59:10 PM12/28/21

to

Not to forget that SI and DI were "Source Index" and "Destination Index".

This just leaves SP and BP as Stack Pointer and Base Pointer...

It seems almost like things like "REP MOVSB" and similar were considered
to be fairly significant design features.

Though, I guess it is possible that a more modern CPU with such an
instruction could have it move 8 or 16 bytes at a time.

Say (in hardware):
do {
if(RCX>=8)
{ Q=[RSI]; [RDI]=Q; RSI+=8; RDI+=8; RCX-=8; }
else
{ B=[RSI]; [RDI]=B; RSI++; RDI++; RCX--; }
}while(RCX!=0);

> John Savard

Terje Mathisen

unread,

Dec 28, 2021, 2:32:37 PM12/28/21

to

Therefore, all asm was written from the inside out, starting with the
innermost loop which got to use AX/BX/CX/DX/SI/DI as Intel intended,
then everything else had to make do with global or stack variables
unless that inner loop left anything unused. Later on, when we got
SP-relative addressing (E)BP could be used as a 7th regular register.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Brett

unread,

Dec 28, 2021, 8:27:42 PM12/28/21

to

Quadibloc <jsa...@ecn.ab.ca> wrote:
> On Monday, December 27, 2021 at 10:06:58 PM UTC-7, gg...@yahoo.com wrote:
>
>> You just described a register plus accumulator design, the most well known
>> of which is the 8086, though you can argue the intent. Very successful but
>> not enough registers due to tech reasons.
>
> Hmm. I had never thought of the 8086 as a "register plus accumulator"
> architecture; but when I look up its architecture, I see it doesn't just have
> four 16-bit registers or eight 8-bit registers; the 16-bit registers aren't
> just A, B, C, and D, but Accumulator, Base, Count, and Data.

Yes, then Intel got sucked into the RISC philosophy.

It would be interesting to see how much better the design would have become
if Intel went down the implied register road instead, for better code
density.

The difference between four different loads for four registers verses more
bits for a load register specifier is not what I am talking about, as many
ops will use ACC and it’s shorter opcodes.

I would support 64 registers, which none of the RISC do in the base ISA.
Best code density plus most registers plus load/store pair is the holly
Trinity needed to crack a market. And you can support a fixed 32 bit
instruction mode as well, as moronic as that is.

For constant extenders you could set the high bit of each extension byte,
this wastes some entropy but makes decoding instruction boundaries near
trivial to make the hardware guys happier. Problem is you only get 128 base
opcodes, but you don’t have to have this limit on the second or third byte
if your encoding is easy on deciding opcode base length.

Timothy McCaffrey

unread,

Jan 3, 2022, 7:13:27 PM1/3/22

to

The 8086 was designed as an 8080 replacement and as a Z-80 competitor.
(Hence the string instructions, and SI/DI).
It was also a stop-gap until the 432 was ready (stop laughing!)

I think most of the accumulator orientation of the 8086 was because it was
designed to run 8080 codes fast. (oh look, we have lots of instructions that
do stuff with A! We better make that efficient in the 8086!)
I think support for modern programming
languages (like Pascal) was kind of an after thought.
Interviews at the time said they used the PDP-11 as inspiration (everybody
said that about their arch.), I think that was mostly marketing. I don't think
they really understood why the PDP-11 was designed like it was.

- Tim

Timothy McCaffrey

unread,

Jan 3, 2022, 7:21:44 PM1/3/22

to

You could just implement an expression stack. Keep the stack in your registers, have another register
that points to "top of stack" in your register file. Normal stack expressions rarely exceed 8 levels.
Lets see, then you could do:
Z = A*Y + B*X
load R1 <- A
load R2 <- B
load R3 <- X
load R4 <- Y
set tos to R30
push R1 ;R30 = A
MUL R4 ;R30 = A*Y
push R2 ;R29 = B
MUL R3 ;R20 = B*X
ADD (tos, tos+1) ;R31 = A*Y+B*X
store R31, Z

This is an off the cuff answer, it seems like there are more efficient encodings than what I just came up with.

- Tim

MitchAlsup

unread,

Jan 3, 2022, 7:27:26 PM1/3/22

to

CMU report indicates the 432 could have been 2× faster if they added a single wire
between the 2 chips. But I digress.

>
> I think most of the accumulator orientation of the 8086 was because it was
> designed to run 8080 codes fast. (oh look, we have lots of instructions that
> do stuff with A! We better make that efficient in the 8086!)
> I think support for modern programming
> languages (like Pascal) was kind of an after thought.
<
> Interviews at the time said they used the PDP-11 as inspiration (everybody
> said that about their arch.), I think that was mostly marketing. I don't think
> they really understood why the PDP-11 was designed like it was.
<

PDP-11 was the inspiration of 68000 and 32032
>
> - Tim

Brett

unread,

Jan 3, 2022, 9:04:05 PM1/3/22

to

With the stack in registers you are 90% of the way to a belt, all you have
to do is disallow pops which you will because of the nightmares they cause
for scheduling.

Brett

unread,

Jan 11, 2022, 6:36:08 PM1/11/22

to

The start of every formula calculation has two loads.

A = (B + C) * D + whatever etc.

You load B and C first.

Makes sense to have at least a two position belt for your accumulator.

Of course one alternative is x86 add from memory…
I have promoted this in the past for better code density, but it is
justifiably hated by the hardware guys. ;)

MitchAlsup

unread,

Jan 11, 2022, 6:59:23 PM1/11/22

to

What is one or the other was already in a register ?

>
> Makes sense to have at least a two position belt for your accumulator.
>
> Of course one alternative is x86 add from memory…
> I have promoted this in the past for better code density, but it is
> justifiably hated by the hardware guys. ;)
<

HW guys (like me) do not have a negative wrt LD-ops since the pipeline
cam be easily arranged such that inbound LDs which hit the cache appear
to have register latency. IBM 360 circa 1964, Gould 32/87, Interdata 32
<
Where we draw the line is Op-STs (and of course, LD-op-STs.) These
cannot be accommodated with straight pipeline design. However, once
you get to reservation station machines, it no longer matters.

MitchAlsup

unread,

Jan 11, 2022, 7:11:58 PM1/11/22

to

I think it is also fair to take issue with LD-ops--while one can configure
a pipeline to efficiently process LD-ops--the real question is should one
configure the ISA to support LD-ops (and/or ST-ops or even LD-op-STs.)
<
On this I take the position of no--but it is a fine balance between code
density and speed of execution: the former favoring -op-s the later
favoring individual instructions. Perhaps other pressures in the design
environment tip the balance differently than my current position.
<
Also note: it seems that My 66000 ISA is achieving code densities similar
to x86-64 in what smells (at first glance) as a pure RISC design (with
variable length instructions, and a few other tricks.)

Anton Ertl

unread,

Jan 13, 2022, 12:51:29 PM1/13/22

to

Brett <gg...@yahoo.com> writes:
>Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:

>> Brett <gg...@yahoo.com> writes:
>>> The next step is deciding if you want a belt in front of your accumulator
>>> so you can push two loads to ACC and then do an operation, a very common
>>> occurrence.
>>
>> How common? I don't think it's common in most code I have written.
>
>The start of every formula calculation has two loads.
>
>A = (B + C) * D + whatever etc.
>
>You load B and C first.

Reality check:

[~/tmp:127655] cat brett.c
long brett(long B, long C, long D, long whatever)
{
long A;
A = (B + C) * D + whatever;
return A;
}
[~/tmp:127656] gcc -O -c brett.c
[~/tmp:127657] objdump -d brett.o

brett.o: file format elf64-x86-64

Disassembly of section .text:

0000000000000000 <brett>:
0: 48 01 f7 add %rsi,%rdi
3: 48 0f af fa imul %rdx,%rdi
7: 48 8d 04 0f lea (%rdi,%rcx,1),%rax
b: c3 retq

No loads to be seen here.

WRT my code compression idea, the second instruction takes the output
of the first as input, and the third takes the output of the second as
input. You also see the AMD64 output=input compression at work here
in the first and second instruction.