A Linguistic Contribution to Condition Code-less Programming

1969 views
Skip to first unread message

Quadibloc

unread,
Apr 5, 2020, 3:18:38 AM4/5/20
to
I am a few days too late with this post... as its title recalls that of a famous
April Fool's Day article from _Datamation_.

However, I'm not so sure this idea isn't a good and serious one.

A number of modern processors, from the DEC Alpha to RISC-V, do not have condition
codes. On the other hand, the Power PC tries to mitigate the problems of having
this pervasive persistent state by having multiple condition codes, with
instrucitons indicating which one is to be used.

An idea just came to me of a way to achieve these goals:

- Avoid having condition codes existing as a persistent state, but

- Retain the full generality that the classical architectures with condition codes
provided, so that converting programs would normally not be difficult.

Here is the strange idea I came up with:

Have instructions that "look" like a conditional branch or ocnditional jump
instruction in every way.

But they _do_ something different.

There is a "delay slot" of sorts... what happens is this:

the following instruction must perform an arithmetic operation of a kind that
would normally set the condition codes on a conventional computer that had them;

that instruction is performed normally;

and then the branch or jump specified by the preceding instruction is taken, or
not, as its branch condition specifies with respect to the condition codes the
instruction following would have set if the architecture had them.

So you take a conventional architecture with conditional branch instructions,
and arithmetic instructions that set condition codes - and the *only change* you
make to the ISA is that the conditional branch instruction is to be put before
the instruction on the result of which it depends.

Of course, this is not really a change that makes _no_ difference.

Since condition codes as a persistent state have been eliminated, the one thing
you _can't_ do any more is put instructions that _don't_ change the condition
codes (i.e. memory load and store, unconditional jump) between the arithmetic
instruction that set a condition and the conditional branch.

John Savard

Michael Barry

unread,
Apr 5, 2020, 4:51:51 AM4/5/20
to
On Sunday, April 5, 2020 at 12:18:38 AM UTC-7, Quadibloc wrote:
> Since condition codes as a persistent state have been eliminated, the one thing
> you _can't_ do any more is put instructions that _don't_ change the condition
> codes (i.e. memory load and store, unconditional jump) between the arithmetic
> instruction that set a condition and the conditional branch.
>
> John Savard

So, you're still holding state (condition and conditional branch address),
but it only has to persist for the following instruction? As long as the
programmer cooperates and no interrupts hit in the middle, it should work,
at least in theory. Whether or not it offers an advantage meritorious
of further study is a decision well above my pay scale ... and if this is
an April Fool's joke, you got me!

Mike B.

Anton Ertl

unread,
Apr 5, 2020, 5:00:11 AM4/5/20
to
Quadibloc <jsa...@ecn.ab.ca> writes:
>However, I'm not so sure this idea isn't a good and serious one.
...
>Have instructions that "look" like a conditional branch or ocnditional jump
>instruction in every way.
>
>But they _do_ something different.
>
>There is a "delay slot" of sorts... what happens is this:
>
>the following instruction must perform an arithmetic operation of a kind that
>would normally set the condition codes on a conventional computer that had them;
>
>that instruction is performed normally;
>
>and then the branch or jump specified by the preceding instruction is taken, or
>not, as its branch condition specifies with respect to the condition codes the
>instruction following would have set if the architecture had them.
...
>Since condition codes as a persistent state have been eliminated, the one thing
>you _can't_ do any more is put instructions that _don't_ change the condition
>codes (i.e. memory load and store, unconditional jump) between the arithmetic
>instruction that set a condition and the conditional branch.

An alternative is to just let every instruction set the condition
codes, and have conventional order.

But given that ARM does ok with condition codes in both tiny and
high-performance implementations, condition codes seem to be a solved
problem. The fact that there is so much variation in this area
(unlike almost everywhere else) indicates that no solution has a
decisive advantage over the others.

The one thing where the approach of the original MIPS has the biggest
disadvantage is long addition/subtraction. Architectures with
condition codes have a carry bit for that, original MIPS and its
descendants replace this by performing more instructions. I dimly
remember that a recent MIPS revision incorporates a carry bit. Your
approach does not solve this problem.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

Quadibloc

unread,
Apr 5, 2020, 12:06:00 PM4/5/20
to
On Sunday, April 5, 2020 at 2:51:51 AM UTC-6, Michael Barry wrote:

> So, you're still holding state (condition and conditional branch address),
> but it only has to persist for the following instruction? As long as the
> programmer cooperates and no interrupts hit in the middle, it should work,
> at least in theory. Whether or not it offers an advantage meritorious
> of further study is a decision well above my pay scale ... and if this is
> an April Fool's joke, you got me!

I had been doing further thinking about this, and, no, it isn't a joke. Even if
it's analogous to a previous April Fool's joke that I referenced.

This was the Datamation article about the COME FROM statement. It eliminated the
GO TO by making it go backwards. Here, by making the order of a jump instruction
and an operate instruction backwards, I eliminate condition codes - but for
real, instead of having something that adds no structure to programs as in the
joke.

Since the jump happens _after_ the operation, I thought, this scheme slows
things down. Why couldn't I keep the conventional order of instructions, and
just have the state last for one instruction?

And I realized that this was why the jump needs to come first. Because otherwise
there's nothing to say "turn interrupts off" to prevent just the problem you
identified. By putting the jump first in the instruction stream, it can also act
as basically an instruction prefix.

So the jump plus the operate are really *a single instruction*, thus preventing
the need to put these "non-persistent" condition code bits somewhere that is
persistent enough to be saved and restored by interrupt service routines.

John Savard

Quadibloc

unread,
Apr 5, 2020, 12:09:49 PM4/5/20
to
On Sunday, April 5, 2020 at 3:00:11 AM UTC-6, Anton Ertl wrote:

> The one thing where the approach of the original MIPS has the biggest
> disadvantage is long addition/subtraction. Architectures with
> condition codes have a carry bit for that, original MIPS and its
> descendants replace this by performing more instructions. I dimly
> remember that a recent MIPS revision incorporates a carry bit. Your
> approach does not solve this problem.

My approach certainly *does* solve the problem, and perfectly:

ADD 1,2
JCC CARRY

becomes

PCBC CARRY
ADD 1,2

which does exactly the same thing -

add r2 to r1
jump conditional on carry to CARRY

becomes

prepare conditional branch to CARRY
add r2 to r1

because now without any persistent condition code bits, I can still use exactly
the same general set of conditional branch instructions as I would have used if
there were condition code bits, so I don't lose the ability to branch on carry,
and I don't need to add new instructions or add bits to operate instructions.

John Savard

Stephen Fuld

unread,
Apr 5, 2020, 12:43:53 PM4/5/20
to
Interesting idea. You eliminate a "long" persistent state, that has to
be saved across interrupts, but you still have the state that interrupts
are prevented for until the completion of the next instruction. So you
have to prevent a denial of service attack if someone maliciously codes
a long sequence of these instructions. You can have some sort of
"inter instruction" syntax test enforced in the hardware, or a timer
that prevents interrupts from being locked out for too long, but these
add additional circuitry. Is this better than one or two additional
bits in the saved state? I don't know. And it doesn't address the
problem that PPC's multiple CCs were designed to solve, though I am not
sure how important that is.



--
- Stephen Fuld
(e-mail address disguised to prevent spam)

EricP

unread,
Apr 5, 2020, 1:27:46 PM4/5/20
to
So basically you want an Add-Branch-Carry instruction.
Instead of the existing Add, Compare, Branch 3 instruction sequence,
or even better Add, CompareAndBranch 2 instruction sequence
that are generally useful and require no special design.

The problem with dropping the Carry flag is for multi-precision
adds and subtracts, because it injects a branch sequence into a
loop that no one wants. Making that branch sequence slightly
more efficient isn't really a help.

What might be helpful is a 3 input add, but that has implications
for minimum hardware design, read ports, renaming, operand buses, etc.

And you have totally ignored the other flag that is not
directly available by looking at the result: Overflow.


Quadibloc

unread,
Apr 5, 2020, 2:47:07 PM4/5/20
to
On Sunday, April 5, 2020 at 11:27:46 AM UTC-6, EricP wrote:

> And you have totally ignored the other flag that is not
> directly available by looking at the result: Overflow.

I haven't ignored _any_ flags. The idea is that *every* conditional branch
instruction still works the way it did before, except now it's merged with the
operate instruction it would have followed - and can only immediately follow
that instruction.

Of course, not having persistent condition codes *does* create problems with
using an overflow as a _trap_.

John Savard

Quadibloc

unread,
Apr 5, 2020, 2:48:56 PM4/5/20
to
On Sunday, April 5, 2020 at 10:43:53 AM UTC-6, Stephen Fuld wrote:

> Interesting idea. You eliminate a "long" persistent state, that has to
> be saved across interrupts, but you still have the state that interrupts
> are prevented for until the completion of the next instruction. So you
> have to prevent a denial of service attack if someone maliciously codes
> a long sequence of these instructions.

Well, a conditional branch instruction can only merge with a following operate
instruction.

So that means interrupts are turned off only for one instruction - the jump and
the operate become a compound atomic instruction. An attempt to prefix a jump
with a jump produces an illegal instruction.

John Savard

Quadibloc

unread,
Apr 5, 2020, 2:50:23 PM4/5/20
to
On Sunday, April 5, 2020 at 10:43:53 AM UTC-6, Stephen Fuld wrote:
> You can have some sort of
> "inter instruction" syntax test enforced in the hardware

Ah, you did anticipate the solution I chose, although I didn't think of it in
those terms.

John Savard

MitchAlsup

unread,
Apr 5, 2020, 3:23:53 PM4/5/20
to
What you have discovered, is a way to add encoding "bits" to an instruction
(the conditional instruction in the shadow of the branch) without wasting
bits in the instruction (the one generating the CCs).

I have called this casting--the branch instruction casts its conditional
requirement over the arithmetic instruction that has an innate ability
to generate condition codes. I use casting for predication in My 66000
and again in the Virtual Vector Method. Casting places the encoding bits
in a different instruction than the instruction which will ultimately
perform what the casts bits required.

This kind of casting improves the entropy of the ISA encoding.

Przemysław Cieszyński

unread,
Apr 5, 2020, 3:39:41 PM4/5/20
to
Change of name doesn't change the substance.
You want to introduce "CISC" instruction(s) having "ortogonal" encoding.
Condition codes are lesser evil.

Quadibloc

unread,
Apr 5, 2020, 4:14:18 PM4/5/20
to
On Sunday, April 5, 2020 at 1:23:53 PM UTC-6, MitchAlsup wrote:

> What you have discovered, is a way to add encoding "bits" to an instruction
> (the conditional instruction in the shadow of the branch) without wasting
> bits in the instruction (the one generating the CCs).
>
> I have called this casting--the branch instruction casts its conditional
> requirement over the arithmetic instruction that has an innate ability
> to generate condition codes. I use casting for predication in My 66000
> and again in the Virtual Vector Method. Casting places the encoding bits
> in a different instruction than the instruction which will ultimately
> perform what the casts bits required.
>
> This kind of casting improves the entropy of the ISA encoding.

Ah, yes.

Basically, if I change an architecture that has condition code bits to one that
doesn't...

now I need, apparently, to indicate whether a particular operate instruction is
going to have a conditional branch associated with it.

But the old conditional branch instructions themselves, unlike operate
instructions, already implied the presence... of a conditional branch
instruction.

So just put the branch first, and the information saying that something special
needs to be done is presented in time - without having to re-distribute the
encodings, and without, apparently, needing to have a less efficient encoding.

So you're quite right, what I've suggested is an example of the technique you
were already using.

John Savard

Quadibloc

unread,
Apr 5, 2020, 4:21:16 PM4/5/20
to
On Sunday, April 5, 2020 at 1:39:41 PM UTC-6, Przemysław Cieszyński wrote:

> Change of name doesn't change the substance.
> You want to introduce "CISC" instruction(s) having "ortogonal" encoding.
> Condition codes are lesser evil.

Now, that is another point. This scheme, by having prefix instructions, is
forcing decoding to be sequential. If the instruction set is already CISC,
that's not a problem. If it was RISC, though, then the good properties of every
instruction being the same length, and so on, are lost.

There are, however, ways to fix that.

For example: think of a semi-CISC scheme, where each 32-bit block can consist of
either one 32-bit instruction, or two 16-bit instructions.

Let's suppose that the block starts with 0 if it consists of two 16-bit
instructions, both starting with 0. Or it starts with 1 if it's a single 32-bit
instruction.

Then if the second 16-bit instruction instead starts with 1, it's a relative
branch instruction atomically linked to the 16-bit operate instruction in the
first half. This time, there's no need to reverse the order. Decoding remains
RISC-like!

I didn't illustrate this kind of possibility initially, as I didn't want to make
my initial post too complicated and confusing.

John Savard

EricP

unread,
Apr 5, 2020, 4:59:41 PM4/5/20
to
Quadibloc wrote:
> On Sunday, April 5, 2020 at 11:27:46 AM UTC-6, EricP wrote:
>
>> And you have totally ignored the other flag that is not
>> directly available by looking at the result: Overflow.
>
> I haven't ignored _any_ flags. The idea is that *every* conditional branch
> instruction still works the way it did before, except now it's merged with the
> operate instruction it would have followed - and can only immediately follow
> that instruction.

Its not a branch that is "merged" into the following instruction.
It is a single instruction that has a specific behavior, AddBranchCarry.
How you chose the bit pattern for that instruction is philosophy.

You don't need to "merge branches into other instructions" because
branch already works just fine for those other tests, except overflow.

When you look at the usage cases for carry and overflow flags,
they are not justified either.

> Of course, not having persistent condition codes *does* create problems with
> using an overflow as a _trap_.
>
> John Savard
>

The point is that no one needs this, either for carry or overflow.

When propagating a carry, one wants branch-less code.
Two instructions can do it: ADD3 adds 3 registers and writes the
truncated result to a single register, and AGC3 Add Generate Carry
adds 3 registers and writes just the carry bit to a register.
Similar for subtract SUB3 and SGB3.
These 4 instructions replace the long-ish RISC sequences.

If testing for carry then compare and branch works just fine.

For overflow/wrap detection one wants to trap at the point of detection,
not branch, because that leaves the program counter pointing at the
problem instruction.

While Overflow requires long-ish sequence to generate manually,
and if various trap-on-overflow instructions are already
present then there is minimal case for additional support.
(here I'm thinking of Add Generate Overflow or Subtract Generate
Overflow instructions which write the overflow flag to a register.)

Ivan Godard

unread,
Apr 5, 2020, 5:27:52 PM4/5/20
to
I thought everyone used CMOVE for extended precision? Why a branch? It's
going to predict badly.

MitchAlsup

unread,
Apr 5, 2020, 5:30:49 PM4/5/20
to
On Sunday, April 5, 2020 at 3:21:16 PM UTC-5, Quadibloc wrote:
> On Sunday, April 5, 2020 at 1:39:41 PM UTC-6, Przemysław Cieszyński wrote:
>
> > Change of name doesn't change the substance.
> > You want to introduce "CISC" instruction(s) having "ortogonal" encoding.
> > Condition codes are lesser evil.
>
> Now, that is another point. This scheme, by having prefix instructions, is
> forcing decoding to be sequential. If the instruction set is already CISC,
> that's not a problem. If it was RISC, though, then the good properties of every
> instruction being the same length, and so on, are lost.

I question whether those are actually "good" properties.

They mandate an additional 6% of instructions to exist--that is the RISC
instruction philosophy requires 106 instructions because of its <stubborn>
fixed length that a variable length instruction does in 100 instructions.
Most of this 6% are instructions used to 'paste' constants together. And
this data was from 32-bit only implementations--pasting 48-bit or 64-bit
constants together requires a lot more instructions, or memory reference
to a table of constants.
>
> There are, however, ways to fix that.
>
> For example: think of a semi-CISC scheme, where each 32-bit block can consist of
> either one 32-bit instruction, or two 16-bit instructions.

I wish you would use the word 'container' where you wrote 'block'
>
> Let's suppose that the block starts with 0 if it consists of two 16-bit
> instructions, both starting with 0. Or it starts with 1 if it's a single 32-bit
> instruction.

This encoding wastes entropy.
>
> Then if the second 16-bit instruction instead starts with 1, it's a relative
> branch instruction atomically linked to the 16-bit operate instruction in the
> first half. This time, there's no need to reverse the order. Decoding remains
> RISC-like!

Parsing of instructions can be performed with marker bit technology (like
Athlon or Opteron) or it can be performed with a careful arrangement of
bits in the instruction encoding template.

Decoding of instructions has to do with what resources the instruction
utilizes and what function units are appropriate to perform the pre-
scribed data manipulations; not how its length is determined.

MitchAlsup

unread,
Apr 5, 2020, 5:39:14 PM4/5/20
to
On Sunday, April 5, 2020 at 3:21:16 PM UTC-5, Quadibloc wrote:
> On Sunday, April 5, 2020 at 1:39:41 PM UTC-6, Przemysław Cieszyński wrote:
>
> > Change of name doesn't change the substance.
> > You want to introduce "CISC" instruction(s) having "ortogonal" encoding.
> > Condition codes are lesser evil.
>
> Now, that is another point. This scheme, by having prefix instructions, is
> forcing decoding to be sequential. If the instruction set is already CISC,
> that's not a problem. If it was RISC, though, then the good properties of every
> instruction being the same length, and so on, are lost.

I question whether those are 'good' properties !

The fixed length nature of RISC causes a 6% expansion of the number of
instructions executed due to having to paste constants together. And this
is from 32-bit code+32-bit address space stuff. It can only get worse
with 64-bit constants and address spaces. Having to (shudder) load
constants from a table (in memory) wastes a) instructions, b) data cache,
c) time, d) instruction cache. Pasting constants together only wastes
instruction cache.
>
> There are, however, ways to fix that.
>
> For example: think of a semi-CISC scheme, where each 32-bit block can consist of
> either one 32-bit instruction, or two 16-bit instructions.
>
> Let's suppose that the block starts with 0 if it consists of two 16-bit
> instructions, both starting with 0. Or it starts with 1 if it's a single 32-bit
> instruction.
>
> Then if the second 16-bit instruction instead starts with 1, it's a relative
> branch instruction atomically linked to the 16-bit operate instruction in the
> first half. This time, there's no need to reverse the order. Decoding remains
> RISC-like!

This wastes entropy and will eat your lunch in instruction encoding.

But there is no particular reason one cannot decode variable length
instructions rapidly, too. AMD/Intel have figured it out, it is just
not "that hard".

What is hard, is making up for the extra instructions one has to execute
if the instruction set does not allow variable length.

Quadibloc

unread,
Apr 5, 2020, 6:35:34 PM4/5/20
to
On Sunday, April 5, 2020 at 3:30:49 PM UTC-6, MitchAlsup wrote:

> Decoding of instructions has to do with what resources the instruction
> utilizes and what function units are appropriate to perform the pre-
> scribed data manipulations; not how its length is determined.

Well, in a segment of code that contains no branches, if all the instructions
are the same length, then one can decode them all - and do further processing on
the instructions that is not precluded by dependencies - in parallel, instead of
seqeuntially.

That's why having all the instructions the same length is so attractive.

Even on architectures with variable-length instructions, generally speaking long
immediate values are not provided for. (i.e. the IBM System/360, with 16-bit,
32-bit, and 48-bit instructions, has no floating-point instructions with 32-bit
immediates, let alone 64-bit immediates. It had 8-bit immediates, and successors
added 16-bit integer immediates.)

Hence, I came up with an idea, described here some time ago, where instructions
of uniform length had short pointers into unused storage in an instruction
"container", as you put it, to the immediates. (As they're pointed to, of
course, they're not "really" immediates, but they're close by in the instruction
stream, so they have the advantages of immediates instead of the drawbacks of
constants fetched from 'way over in data storage.)

One needs some way of saying how many instructions are in the container, though;
using 1 01 001 0001 coding of the first bit of each 32-bit instruction, while
efficient, would *strongly* tempt implementors to do it in a serial rather than
a parallel way. So I went off to using seven 36-bit instructions in a 256-bit
container to avoid that, along with other weirdness, and even I fear the result
is too complicated.

John Savard

Andy Valencia

unread,
Apr 5, 2020, 7:13:04 PM4/5/20
to
MitchAlsup <Mitch...@aol.com> writes:
> Most of this 6% are instructions used to 'paste' constants together. And=20
> this data was from 32-bit only implementations

But can't the decoder basically treat the obvious LIU/ORI sequence
as one decoded instruction? Or is it the increased icache pressure
which causes the biggest problem?

Andy Valencia

Michael Barry

unread,
Apr 5, 2020, 9:33:35 PM4/5/20
to
On Sunday, April 5, 2020 at 3:35:34 PM UTC-7, Quadibloc wrote:> added 16-bit integer immediates.)
> [...]
> Hence, I came up with an idea, described here some time ago, where
> instructions of uniform length had short pointers into unused storage in
> an instruction "container", as you put it, to the immediates. (As they're
> pointed to, of course, they're not "really" immediates, but they're close
> by in the instruction stream, so they have the advantages of immediates
> instead of the drawbacks of constants fetched from 'way over in data
> storage.)
> [...]

I believe that concept is equivalent (or nearly so) to a "literal pool",
which has been around for many decades.

I always thought of "COME FROM" as equivalent to a breakpoint, and I found
some corroborating opinions during a quick search.

Mike B.

Przemysław Cieszyński

unread,
Apr 5, 2020, 10:02:29 PM4/5/20
to
So what about regular compressed (16 bit) operations?
Block/Container has no practical advantage over "single complex instruction". It is better to organize it as "mandatory fusion", so you can use compressor logic to independant parts of "instruction". Anyway, you loose pipeline simplicity.

MitchAlsup

unread,
Apr 5, 2020, 10:12:26 PM4/5/20
to
On Sunday, April 5, 2020 at 6:13:04 PM UTC-5, Andy Valencia wrote:
Having actually done this, I think it is easier to simply provide, in the
ISA long constants. The 6% instruction penalty can be ameliorated with
heroic instruction parsing techniques, but never (In my opinion) eliminated.
The I$ footprint is smallest when the constant occupies as many bits as it
needs but no more.

Certainly we can agree that pasting constant together is simply harder
than providing access to whatever constant size is required.

It is inarguable that providing large constants through Load instructions
is simply "bad".

What you allude to is amelioration not a substitute for having properly
sized constants.

MitchAlsup

unread,
Apr 5, 2020, 10:14:47 PM4/5/20
to
On Sunday, April 5, 2020 at 8:33:35 PM UTC-5, Michael Barry wrote:
> On Sunday, April 5, 2020 at 3:35:34 PM UTC-7, Quadibloc wrote:> added 16-bit integer immediates.)
> > [...]
> > Hence, I came up with an idea, described here some time ago, where
> > instructions of uniform length had short pointers into unused storage in
> > an instruction "container", as you put it, to the immediates. (As they're
> > pointed to, of course, they're not "really" immediates, but they're close
> > by in the instruction stream, so they have the advantages of immediates
> > instead of the drawbacks of constants fetched from 'way over in data
> > storage.)
> > [...]
>
> I believe that concept is equivalent (or nearly so) to a "literal pool",
> which has been around for many decades.

Provided on machines that do not properly support constants.

And it also hurts D$ performance.

Bruce Hoult

unread,
Apr 5, 2020, 11:41:18 PM4/5/20
to
On Monday, April 6, 2020 at 11:13:04 AM UTC+12, Andy Valencia wrote:
> MitchAlsup <Mitch...@aol.com> writes:
> > Most of this 6% are instructions used to 'paste' constants together. And=20
> > this data was from 32-bit only implementations
>
> But can't the decoder basically treat the obvious LIU/ORI sequence
> as one decoded instruction?

Yes, in a mid to high end implementation. So while the number of instructions is higher (let's assume Mitch's 6%) the number of uOps isn't if merging happens.

Also I'm not sure whether Mitch meant static or dynamic instructions. Constant loads can usually be moved outside loops.


>Or is it the increased icache pressure which causes the biggest problem?

Again assuming the number of static instructions is 6% higher, the code size isn't 6% higher because the two fixed size instructions aren't twice as big as the single variable-length instruction. If i recall correctly, the basic instruction size in MY 66000 is 32 bits, with any literals adding to that. If that's correct then a 32 bit instruction with appended 32 bit literal is 64 bits, exactly the same as two 32 bit MIPS or RISC-V instructions.

There is a gain for 64 bit literals though. Mitch's machine would need 32 + 64 bits (96), while the base RV64I instruction set needs up to 6x 32 bit instructions (192 bits) in the general case. This is reduced with the C extension as the final shift and OR can with suitable register allocation always be 16 bit instructions (so 160 bits). Specific constants can be smaller as one or more of the LUIs or ADDIs may be able to be encoded as 16 bit instructions or even omitted.

The B extension will reduce the general case to 5 instructions, using PACK instead of shift/or.

Aarch64 can do the general case with 4 instructions as it can insert a 16 bit literal into any of the 4 possible positions in a 64 bit register, so 128 bits of instructions total. That's four instructions instead of one, but only 33% more bits.

Quadibloc

unread,
Apr 6, 2020, 2:22:20 AM4/6/20
to
On Sunday, April 5, 2020 at 7:33:35 PM UTC-6, Michael Barry wrote:

> I always thought of "COME FROM" as equivalent to a breakpoint, and I found
> some corroborating opinions during a quick search.

Interesting.

The way I thought of it was that it was isomorphic to the way programs were
written: COME FROM -> label, or line number; line number -> GO TO.

Which was, of course, the joke.

John Savard

Quadibloc

unread,
Apr 6, 2020, 2:28:41 AM4/6/20
to
On Sunday, April 5, 2020 at 7:33:35 PM UTC-6, Michael Barry wrote:
> On Sunday, April 5, 2020 at 3:35:34 PM UTC-7, Quadibloc wrote:> added 16-bit integer immediates.)
> > [...]
> > Hence, I came up with an idea, described here some time ago, where
> > instructions of uniform length had short pointers into unused storage in
> > an instruction "container", as you put it, to the immediates. (As they're
> > pointed to, of course, they're not "really" immediates, but they're close
> > by in the instruction stream, so they have the advantages of immediates
> > instead of the drawbacks of constants fetched from 'way over in data
> > storage.)
> > [...]
>
> I believe that concept is equivalent (or nearly so) to a "literal pool",
> which has been around for many decades.

Maybe it is _nearly_ equivalent to that. Put the literals on page zero of an
architecture like the PDP-8 (or HP 211x or Honeywel 316/516).

But what I'm doing is putting the constants *in the instruction stream*, so I'm
getting _close_ to what Mitch Alsup wants. Instead of having no overhead, I have
a three or four bit field in the instruction corresponding to the constant, but
that avoids having to have instructions whose lengths vary all over the place,
which will also involve overhead of a different kind.

Instead, the instructions can be decoded as if they're all the same length, even
if, in addition to immediates, the same mechanism is used to point to additional
parts of instructions that need to be longer.

John Savard

Anton Ertl

unread,
Apr 6, 2020, 4:36:47 AM4/6/20
to
Bruce Hoult <bruce...@gmail.com> writes:
>There is a gain for 64 bit literals though. Mitch's machine would need 32 +=
> 64 bits (96), while the base RV64I instruction set needs up to 6x 32 bit i=
>nstructions (192 bits) in the general case.

IIRC on Alpha general 64-bit literals are loaded with one 32-bit
instruction, from the constant table. So in terms of code size and
I-cache it is a win (but I think that the obsession of many posters
here with code size misguided). Costs: 1) D-cache pressure (and there
the utilization of the cache line may be significantly lower than in
the I-cache). 2) It consumes a register as pointer to the constant
table. 3) It consumes additional instructions when calling/returning
between compilation units.

Also, when I looked at the code generated for the big constants in my
programs, gcc often had managed to produce a short sequence of
instructions instead; somebody must have invested quite a bit of work
into that.

>This is reduced with the C exte=
>nsion as the final shift and OR can with suitable register allocation alway=
>s be 16 bit instructions (so 160 bits). Specific constants can be smaller a=
>s one or more of the LUIs or ADDIs may be able to be encoded as 16 bit inst=
>ructions or even omitted.
>
>The B extension will reduce the general case to 5 instructions, using PACK =
>instead of shift/or.

An instruction that does

dest = lit16|(src<<16)

would allow to do it in 128 bits. Only one instruction would be
needed (use zero as src for the first one). Maybe a second one with

dest = (~lit16)|(src<<16)

to make the sequences for -2^95<=n<0 shorter.

Alternatively, with 22-bit literals and

dest = lit22|(dest<<22)

it could be done in 96 bits. You would need a second instruction for
the initial literal:

dest = sext(lit22)

So why do neither Aarch64 nor RISC-V have it (and not Alpha, which was
also designed from the start as a 64-bit architecture)? Maybe the
sequential nature of the sequences is the reason; but is it a real
problem?

1) In single-issue implementations it is not.

2) In OoO implementations, it should not be, either: With good branch
prediction (the usual case) the front end runs far ahead of the
execution of instructions on the critical data-flow path. These
constant sequences would depend on nothing and their instructions
would be executed soon after being decoded, and their results would be
available early enough to avoid delaying the instructions on the
critical path. Also instructions from these sequences can be combined
in the decoder.

3) Superscalar in-order implementations pose the biggest problem, but
even there it does not seem to be particularly bad: a) The compiler
can interleave instructions from constant sequences with other
instructions to provide ILP. b) The decoder can combine these
sequences into a single instruction. However, one probably has to
decide in advance which way such implementations should go because
dealing with it in the decoder probably becomes much harder if the
instructions are interleaved.

Anton Ertl

unread,
Apr 6, 2020, 4:43:23 AM4/6/20
to
Quadibloc <jsa...@ecn.ab.ca> writes:
>On Sunday, April 5, 2020 at 3:00:11 AM UTC-6, Anton Ertl wrote:
>
>> The one thing where the approach of the original MIPS has the biggest
>> disadvantage is long addition/subtraction. Architectures with
>> condition codes have a carry bit for that, original MIPS and its
>> descendants replace this by performing more instructions. I dimly
>> remember that a recent MIPS revision incorporates a carry bit. Your
>> approach does not solve this problem.
>
>My approach certainly *does* solve the problem, and perfectly:
>
>ADD 1,2
>JCC CARRY
>
>becomes
>
>PCBC CARRY
>ADD 1,2
>
>which does exactly the same thing -

But it does not solve the problem.

A 256-bit addition on AMD64 looks as follows:

add rsi,r9
adc rdi,r10
adc rax,r11
adc rbc,r12

with each instruction propagating the carry bit to the next one. How
would you do that?

Terje Mathisen

unread,
Apr 6, 2020, 4:57:26 AM4/6/20
to
Your pair of ADD3 and AGC3 instructions is just a way to split a single
ADC with two explicit and one implicit source into two instructions that
need three explicit inputs each, and which do exactly the same
underlying operation, so the only rasonable implementation would be to
detect such a pair and internally turn it into a three-source and
two-destination operation.

BTW, it also has the additional issue that the third input cannot be
general, i.e. it pretty much has to use only the bottom bit and/or
require it to be either 1 or 0: If you allow any three inputs then you
must also allow the AGC3 part to generate two-bit outputs.
>
> If testing for carry then compare and branch works just fine.
>
> For overflow/wrap detection one wants to trap at the point of detection,
> not branch, because that leaves the program counter pointing at the
> problem instruction.
>
> While Overflow requires long-ish sequence to generate manually,
> and if various trap-on-overflow instructions are already
> present then there is minimal case for additional support.
> (here I'm thinking of Add Generate Overflow or Subtract Generate
> Overflow instructions which write the overflow flag to a register.)
>
This is much more reasonable if you cannot just have the full orthogonal
set, which includes add_trap_on_overflow.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Terje Mathisen

unread,
Apr 6, 2020, 5:23:00 AM4/6/20
to
CMOVE is mostly quite slow, so for limited sizes, some of those carry
chains can be short enough to make at least the final (rare overflow)
carry better handled with a branch.

Anton Ertl

unread,
Apr 6, 2020, 5:49:18 AM4/6/20
to
EricP <ThatWould...@thevillage.com> writes:
>For overflow/wrap detection one wants to trap at the point of detection,
>not branch, because that leaves the program counter pointing at the
>problem instruction.
>
>While Overflow requires long-ish sequence to generate manually,
>and if various trap-on-overflow instructions are already
>present then there is minimal case for additional support.
>(here I'm thinking of Add Generate Overflow or Subtract Generate
>Overflow instructions which write the overflow flag to a register.)

There are a few use cases for branch on overflow:

1) comparing with minint:

dec %edi
jno ...

This extends to other operand sizes and variations can be used for
maxint, and for comparing with a range minint...minint+c, or
maxint-c...maxint.

2) Forth's +LOOP is specified in a way that maps pretty well to integer
overflow and branch on overflow. E.g., in SwiftForth "5 +LOOP"
compiles to

add 5,0(%esp)
jno ...

while the C code in Gforth is as follows:

Cell olddiff = n1-nlimit;
n2=n1+n;
if (((olddiff^(olddiff+n)) &(olddiff^n)) >=0) {
/* perform the branch */
}
/* fall-through path */

with the resulting code being

mov %rax,%r9
sub 0x8(%r13),%r9
add %r14,%rax
lea (%r9,%r14,1),%rsi
xor %r9,%r14
xor %r9,%rsi
test %r14,%rsi
js fallthrough
... perform the branch

This is not completely comparable, because SwiftForth keeps a biased
value (I think olddiff+maxint) in 0(%esp), but even if we wanted to
use a similar trick in Gforth, it would only eliminate the first two
instructions.

Alex McDonald

unread,
Apr 6, 2020, 7:57:38 AM4/6/20
to
I have never understood why ISAs insist on having literal forms of
instructions where the literal is in /this/ instruction. I know the
square root of fanny adams about the subject, but why not have the
literal with its own size opcode follow?

ldz-lit-following reg ; load literal zero extend into reg
32bits 1234567 ; opcode for 32bit literal

or

ldz-lit-following reg ; same op as above
8bits 10 ;



--
Alex

Anton Ertl

unread,
Apr 6, 2020, 8:46:10 AM4/6/20
to
Alex McDonald <al...@rivadpm.com> writes:
>I have never understood why ISAs insist on having literal forms of
>instructions where the literal is in /this/ instruction. I know the
>square root of fanny adams about the subject, but why not have the
>literal with its own size opcode follow?

There are such instruction sets, e.g., VAX, IA-32, and b16. However,
if you do a fixed-width instruction set for easy decoding, you don't
want some data item in the middle of the instruction stream.

EricP

unread,
Apr 6, 2020, 10:39:13 AM4/6/20
to
> .... perform the branch
>
> This is not completely comparable, because SwiftForth keeps a biased
> value (I think olddiff+maxint) in 0(%esp), but even if we wanted to
> use a similar trick in Gforth, it would only eliminate the first two
> instructions.
>
> - anton

Those are examples of doing overflow error detection using branches
because that is the only tool available to them. And it changes
the program counter away from the erroneous instruction loosing
traceability to the error.

An AddTrapOverflow instruction is the proper tool because it has no cost
and leaves the program counter pointing at the erroneous instruction.

Other than error detection I can't think of any other use for an
Overflow flag. But even if there is such an example, its frequency
of occurrence is so low that it does not warrant