Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

A Linguistic Contribution to Condition Code-less Programming

1,969 views
Skip to first unread message

Quadibloc

unread,
Apr 5, 2020, 3:18:38 AM4/5/20
to
I am a few days too late with this post... as its title recalls that of a famous
April Fool's Day article from _Datamation_.

However, I'm not so sure this idea isn't a good and serious one.

A number of modern processors, from the DEC Alpha to RISC-V, do not have condition
codes. On the other hand, the Power PC tries to mitigate the problems of having
this pervasive persistent state by having multiple condition codes, with
instrucitons indicating which one is to be used.

An idea just came to me of a way to achieve these goals:

- Avoid having condition codes existing as a persistent state, but

- Retain the full generality that the classical architectures with condition codes
provided, so that converting programs would normally not be difficult.

Here is the strange idea I came up with:

Have instructions that "look" like a conditional branch or ocnditional jump
instruction in every way.

But they _do_ something different.

There is a "delay slot" of sorts... what happens is this:

the following instruction must perform an arithmetic operation of a kind that
would normally set the condition codes on a conventional computer that had them;

that instruction is performed normally;

and then the branch or jump specified by the preceding instruction is taken, or
not, as its branch condition specifies with respect to the condition codes the
instruction following would have set if the architecture had them.

So you take a conventional architecture with conditional branch instructions,
and arithmetic instructions that set condition codes - and the *only change* you
make to the ISA is that the conditional branch instruction is to be put before
the instruction on the result of which it depends.

Of course, this is not really a change that makes _no_ difference.

Since condition codes as a persistent state have been eliminated, the one thing
you _can't_ do any more is put instructions that _don't_ change the condition
codes (i.e. memory load and store, unconditional jump) between the arithmetic
instruction that set a condition and the conditional branch.

John Savard

Michael Barry

unread,
Apr 5, 2020, 4:51:51 AM4/5/20
to
On Sunday, April 5, 2020 at 12:18:38 AM UTC-7, Quadibloc wrote:
> Since condition codes as a persistent state have been eliminated, the one thing
> you _can't_ do any more is put instructions that _don't_ change the condition
> codes (i.e. memory load and store, unconditional jump) between the arithmetic
> instruction that set a condition and the conditional branch.
>
> John Savard

So, you're still holding state (condition and conditional branch address),
but it only has to persist for the following instruction? As long as the
programmer cooperates and no interrupts hit in the middle, it should work,
at least in theory. Whether or not it offers an advantage meritorious
of further study is a decision well above my pay scale ... and if this is
an April Fool's joke, you got me!

Mike B.

Anton Ertl

unread,
Apr 5, 2020, 5:00:11 AM4/5/20
to
Quadibloc <jsa...@ecn.ab.ca> writes:
>However, I'm not so sure this idea isn't a good and serious one.
...
>Have instructions that "look" like a conditional branch or ocnditional jump
>instruction in every way.
>
>But they _do_ something different.
>
>There is a "delay slot" of sorts... what happens is this:
>
>the following instruction must perform an arithmetic operation of a kind that
>would normally set the condition codes on a conventional computer that had them;
>
>that instruction is performed normally;
>
>and then the branch or jump specified by the preceding instruction is taken, or
>not, as its branch condition specifies with respect to the condition codes the
>instruction following would have set if the architecture had them.
...
>Since condition codes as a persistent state have been eliminated, the one thing
>you _can't_ do any more is put instructions that _don't_ change the condition
>codes (i.e. memory load and store, unconditional jump) between the arithmetic
>instruction that set a condition and the conditional branch.

An alternative is to just let every instruction set the condition
codes, and have conventional order.

But given that ARM does ok with condition codes in both tiny and
high-performance implementations, condition codes seem to be a solved
problem. The fact that there is so much variation in this area
(unlike almost everywhere else) indicates that no solution has a
decisive advantage over the others.

The one thing where the approach of the original MIPS has the biggest
disadvantage is long addition/subtraction. Architectures with
condition codes have a carry bit for that, original MIPS and its
descendants replace this by performing more instructions. I dimly
remember that a recent MIPS revision incorporates a carry bit. Your
approach does not solve this problem.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

Quadibloc

unread,
Apr 5, 2020, 12:06:00 PM4/5/20
to
On Sunday, April 5, 2020 at 2:51:51 AM UTC-6, Michael Barry wrote:

> So, you're still holding state (condition and conditional branch address),
> but it only has to persist for the following instruction? As long as the
> programmer cooperates and no interrupts hit in the middle, it should work,
> at least in theory. Whether or not it offers an advantage meritorious
> of further study is a decision well above my pay scale ... and if this is
> an April Fool's joke, you got me!

I had been doing further thinking about this, and, no, it isn't a joke. Even if
it's analogous to a previous April Fool's joke that I referenced.

This was the Datamation article about the COME FROM statement. It eliminated the
GO TO by making it go backwards. Here, by making the order of a jump instruction
and an operate instruction backwards, I eliminate condition codes - but for
real, instead of having something that adds no structure to programs as in the
joke.

Since the jump happens _after_ the operation, I thought, this scheme slows
things down. Why couldn't I keep the conventional order of instructions, and
just have the state last for one instruction?

And I realized that this was why the jump needs to come first. Because otherwise
there's nothing to say "turn interrupts off" to prevent just the problem you
identified. By putting the jump first in the instruction stream, it can also act
as basically an instruction prefix.

So the jump plus the operate are really *a single instruction*, thus preventing
the need to put these "non-persistent" condition code bits somewhere that is
persistent enough to be saved and restored by interrupt service routines.

John Savard

Quadibloc

unread,
Apr 5, 2020, 12:09:49 PM4/5/20
to
On Sunday, April 5, 2020 at 3:00:11 AM UTC-6, Anton Ertl wrote:

> The one thing where the approach of the original MIPS has the biggest
> disadvantage is long addition/subtraction. Architectures with
> condition codes have a carry bit for that, original MIPS and its
> descendants replace this by performing more instructions. I dimly
> remember that a recent MIPS revision incorporates a carry bit. Your
> approach does not solve this problem.

My approach certainly *does* solve the problem, and perfectly:

ADD 1,2
JCC CARRY

becomes

PCBC CARRY
ADD 1,2

which does exactly the same thing -

add r2 to r1
jump conditional on carry to CARRY

becomes

prepare conditional branch to CARRY
add r2 to r1

because now without any persistent condition code bits, I can still use exactly
the same general set of conditional branch instructions as I would have used if
there were condition code bits, so I don't lose the ability to branch on carry,
and I don't need to add new instructions or add bits to operate instructions.

John Savard

Stephen Fuld

unread,
Apr 5, 2020, 12:43:53 PM4/5/20
to
Interesting idea. You eliminate a "long" persistent state, that has to
be saved across interrupts, but you still have the state that interrupts
are prevented for until the completion of the next instruction. So you
have to prevent a denial of service attack if someone maliciously codes
a long sequence of these instructions. You can have some sort of
"inter instruction" syntax test enforced in the hardware, or a timer
that prevents interrupts from being locked out for too long, but these
add additional circuitry. Is this better than one or two additional
bits in the saved state? I don't know. And it doesn't address the
problem that PPC's multiple CCs were designed to solve, though I am not
sure how important that is.



--
- Stephen Fuld
(e-mail address disguised to prevent spam)

EricP

unread,
Apr 5, 2020, 1:27:46 PM4/5/20
to
So basically you want an Add-Branch-Carry instruction.
Instead of the existing Add, Compare, Branch 3 instruction sequence,
or even better Add, CompareAndBranch 2 instruction sequence
that are generally useful and require no special design.

The problem with dropping the Carry flag is for multi-precision
adds and subtracts, because it injects a branch sequence into a
loop that no one wants. Making that branch sequence slightly
more efficient isn't really a help.

What might be helpful is a 3 input add, but that has implications
for minimum hardware design, read ports, renaming, operand buses, etc.

And you have totally ignored the other flag that is not
directly available by looking at the result: Overflow.


Quadibloc

unread,
Apr 5, 2020, 2:47:07 PM4/5/20
to
On Sunday, April 5, 2020 at 11:27:46 AM UTC-6, EricP wrote:

> And you have totally ignored the other flag that is not
> directly available by looking at the result: Overflow.

I haven't ignored _any_ flags. The idea is that *every* conditional branch
instruction still works the way it did before, except now it's merged with the
operate instruction it would have followed - and can only immediately follow
that instruction.

Of course, not having persistent condition codes *does* create problems with
using an overflow as a _trap_.

John Savard

Quadibloc

unread,
Apr 5, 2020, 2:48:56 PM4/5/20
to
On Sunday, April 5, 2020 at 10:43:53 AM UTC-6, Stephen Fuld wrote:

> Interesting idea. You eliminate a "long" persistent state, that has to
> be saved across interrupts, but you still have the state that interrupts
> are prevented for until the completion of the next instruction. So you
> have to prevent a denial of service attack if someone maliciously codes
> a long sequence of these instructions.

Well, a conditional branch instruction can only merge with a following operate
instruction.

So that means interrupts are turned off only for one instruction - the jump and
the operate become a compound atomic instruction. An attempt to prefix a jump
with a jump produces an illegal instruction.

John Savard

Quadibloc

unread,
Apr 5, 2020, 2:50:23 PM4/5/20
to
On Sunday, April 5, 2020 at 10:43:53 AM UTC-6, Stephen Fuld wrote:
> You can have some sort of
> "inter instruction" syntax test enforced in the hardware

Ah, you did anticipate the solution I chose, although I didn't think of it in
those terms.

John Savard

MitchAlsup

unread,
Apr 5, 2020, 3:23:53 PM4/5/20
to
What you have discovered, is a way to add encoding "bits" to an instruction
(the conditional instruction in the shadow of the branch) without wasting
bits in the instruction (the one generating the CCs).

I have called this casting--the branch instruction casts its conditional
requirement over the arithmetic instruction that has an innate ability
to generate condition codes. I use casting for predication in My 66000
and again in the Virtual Vector Method. Casting places the encoding bits
in a different instruction than the instruction which will ultimately
perform what the casts bits required.

This kind of casting improves the entropy of the ISA encoding.

Przemysław Cieszyński

unread,
Apr 5, 2020, 3:39:41 PM4/5/20
to
Change of name doesn't change the substance.
You want to introduce "CISC" instruction(s) having "ortogonal" encoding.
Condition codes are lesser evil.

Quadibloc

unread,
Apr 5, 2020, 4:14:18 PM4/5/20
to
On Sunday, April 5, 2020 at 1:23:53 PM UTC-6, MitchAlsup wrote:

> What you have discovered, is a way to add encoding "bits" to an instruction
> (the conditional instruction in the shadow of the branch) without wasting
> bits in the instruction (the one generating the CCs).
>
> I have called this casting--the branch instruction casts its conditional
> requirement over the arithmetic instruction that has an innate ability
> to generate condition codes. I use casting for predication in My 66000
> and again in the Virtual Vector Method. Casting places the encoding bits
> in a different instruction than the instruction which will ultimately
> perform what the casts bits required.
>
> This kind of casting improves the entropy of the ISA encoding.

Ah, yes.

Basically, if I change an architecture that has condition code bits to one that
doesn't...

now I need, apparently, to indicate whether a particular operate instruction is
going to have a conditional branch associated with it.

But the old conditional branch instructions themselves, unlike operate
instructions, already implied the presence... of a conditional branch
instruction.

So just put the branch first, and the information saying that something special
needs to be done is presented in time - without having to re-distribute the
encodings, and without, apparently, needing to have a less efficient encoding.

So you're quite right, what I've suggested is an example of the technique you
were already using.

John Savard

Quadibloc

unread,
Apr 5, 2020, 4:21:16 PM4/5/20
to
On Sunday, April 5, 2020 at 1:39:41 PM UTC-6, Przemysław Cieszyński wrote:

> Change of name doesn't change the substance.
> You want to introduce "CISC" instruction(s) having "ortogonal" encoding.
> Condition codes are lesser evil.

Now, that is another point. This scheme, by having prefix instructions, is
forcing decoding to be sequential. If the instruction set is already CISC,
that's not a problem. If it was RISC, though, then the good properties of every
instruction being the same length, and so on, are lost.

There are, however, ways to fix that.

For example: think of a semi-CISC scheme, where each 32-bit block can consist of
either one 32-bit instruction, or two 16-bit instructions.

Let's suppose that the block starts with 0 if it consists of two 16-bit
instructions, both starting with 0. Or it starts with 1 if it's a single 32-bit
instruction.

Then if the second 16-bit instruction instead starts with 1, it's a relative
branch instruction atomically linked to the 16-bit operate instruction in the
first half. This time, there's no need to reverse the order. Decoding remains
RISC-like!

I didn't illustrate this kind of possibility initially, as I didn't want to make
my initial post too complicated and confusing.

John Savard

EricP

unread,
Apr 5, 2020, 4:59:41 PM4/5/20
to
Quadibloc wrote:
> On Sunday, April 5, 2020 at 11:27:46 AM UTC-6, EricP wrote:
>
>> And you have totally ignored the other flag that is not
>> directly available by looking at the result: Overflow.
>
> I haven't ignored _any_ flags. The idea is that *every* conditional branch
> instruction still works the way it did before, except now it's merged with the
> operate instruction it would have followed - and can only immediately follow
> that instruction.

Its not a branch that is "merged" into the following instruction.
It is a single instruction that has a specific behavior, AddBranchCarry.
How you chose the bit pattern for that instruction is philosophy.

You don't need to "merge branches into other instructions" because
branch already works just fine for those other tests, except overflow.

When you look at the usage cases for carry and overflow flags,
they are not justified either.

> Of course, not having persistent condition codes *does* create problems with
> using an overflow as a _trap_.
>
> John Savard
>

The point is that no one needs this, either for carry or overflow.

When propagating a carry, one wants branch-less code.
Two instructions can do it: ADD3 adds 3 registers and writes the
truncated result to a single register, and AGC3 Add Generate Carry
adds 3 registers and writes just the carry bit to a register.
Similar for subtract SUB3 and SGB3.
These 4 instructions replace the long-ish RISC sequences.

If testing for carry then compare and branch works just fine.

For overflow/wrap detection one wants to trap at the point of detection,
not branch, because that leaves the program counter pointing at the
problem instruction.

While Overflow requires long-ish sequence to generate manually,
and if various trap-on-overflow instructions are already
present then there is minimal case for additional support.
(here I'm thinking of Add Generate Overflow or Subtract Generate
Overflow instructions which write the overflow flag to a register.)

Ivan Godard

unread,
Apr 5, 2020, 5:27:52 PM4/5/20
to
I thought everyone used CMOVE for extended precision? Why a branch? It's
going to predict badly.

MitchAlsup

unread,
Apr 5, 2020, 5:30:49 PM4/5/20
to
On Sunday, April 5, 2020 at 3:21:16 PM UTC-5, Quadibloc wrote:
> On Sunday, April 5, 2020 at 1:39:41 PM UTC-6, Przemysław Cieszyński wrote:
>
> > Change of name doesn't change the substance.
> > You want to introduce "CISC" instruction(s) having "ortogonal" encoding.
> > Condition codes are lesser evil.
>
> Now, that is another point. This scheme, by having prefix instructions, is
> forcing decoding to be sequential. If the instruction set is already CISC,
> that's not a problem. If it was RISC, though, then the good properties of every
> instruction being the same length, and so on, are lost.

I question whether those are actually "good" properties.

They mandate an additional 6% of instructions to exist--that is the RISC
instruction philosophy requires 106 instructions because of its <stubborn>
fixed length that a variable length instruction does in 100 instructions.
Most of this 6% are instructions used to 'paste' constants together. And
this data was from 32-bit only implementations--pasting 48-bit or 64-bit
constants together requires a lot more instructions, or memory reference
to a table of constants.
>
> There are, however, ways to fix that.
>
> For example: think of a semi-CISC scheme, where each 32-bit block can consist of
> either one 32-bit instruction, or two 16-bit instructions.

I wish you would use the word 'container' where you wrote 'block'
>
> Let's suppose that the block starts with 0 if it consists of two 16-bit
> instructions, both starting with 0. Or it starts with 1 if it's a single 32-bit
> instruction.

This encoding wastes entropy.
>
> Then if the second 16-bit instruction instead starts with 1, it's a relative
> branch instruction atomically linked to the 16-bit operate instruction in the
> first half. This time, there's no need to reverse the order. Decoding remains
> RISC-like!

Parsing of instructions can be performed with marker bit technology (like
Athlon or Opteron) or it can be performed with a careful arrangement of
bits in the instruction encoding template.

Decoding of instructions has to do with what resources the instruction
utilizes and what function units are appropriate to perform the pre-
scribed data manipulations; not how its length is determined.

MitchAlsup

unread,
Apr 5, 2020, 5:39:14 PM4/5/20
to
On Sunday, April 5, 2020 at 3:21:16 PM UTC-5, Quadibloc wrote:
> On Sunday, April 5, 2020 at 1:39:41 PM UTC-6, Przemysław Cieszyński wrote:
>
> > Change of name doesn't change the substance.
> > You want to introduce "CISC" instruction(s) having "ortogonal" encoding.
> > Condition codes are lesser evil.
>
> Now, that is another point. This scheme, by having prefix instructions, is
> forcing decoding to be sequential. If the instruction set is already CISC,
> that's not a problem. If it was RISC, though, then the good properties of every
> instruction being the same length, and so on, are lost.

I question whether those are 'good' properties !

The fixed length nature of RISC causes a 6% expansion of the number of
instructions executed due to having to paste constants together. And this
is from 32-bit code+32-bit address space stuff. It can only get worse
with 64-bit constants and address spaces. Having to (shudder) load
constants from a table (in memory) wastes a) instructions, b) data cache,
c) time, d) instruction cache. Pasting constants together only wastes
instruction cache.
>
> There are, however, ways to fix that.
>
> For example: think of a semi-CISC scheme, where each 32-bit block can consist of
> either one 32-bit instruction, or two 16-bit instructions.
>
> Let's suppose that the block starts with 0 if it consists of two 16-bit
> instructions, both starting with 0. Or it starts with 1 if it's a single 32-bit
> instruction.
>
> Then if the second 16-bit instruction instead starts with 1, it's a relative
> branch instruction atomically linked to the 16-bit operate instruction in the
> first half. This time, there's no need to reverse the order. Decoding remains
> RISC-like!

This wastes entropy and will eat your lunch in instruction encoding.

But there is no particular reason one cannot decode variable length
instructions rapidly, too. AMD/Intel have figured it out, it is just
not "that hard".

What is hard, is making up for the extra instructions one has to execute
if the instruction set does not allow variable length.

Quadibloc

unread,
Apr 5, 2020, 6:35:34 PM4/5/20
to
On Sunday, April 5, 2020 at 3:30:49 PM UTC-6, MitchAlsup wrote:

> Decoding of instructions has to do with what resources the instruction
> utilizes and what function units are appropriate to perform the pre-
> scribed data manipulations; not how its length is determined.

Well, in a segment of code that contains no branches, if all the instructions
are the same length, then one can decode them all - and do further processing on
the instructions that is not precluded by dependencies - in parallel, instead of
seqeuntially.

That's why having all the instructions the same length is so attractive.

Even on architectures with variable-length instructions, generally speaking long
immediate values are not provided for. (i.e. the IBM System/360, with 16-bit,
32-bit, and 48-bit instructions, has no floating-point instructions with 32-bit
immediates, let alone 64-bit immediates. It had 8-bit immediates, and successors
added 16-bit integer immediates.)

Hence, I came up with an idea, described here some time ago, where instructions
of uniform length had short pointers into unused storage in an instruction
"container", as you put it, to the immediates. (As they're pointed to, of
course, they're not "really" immediates, but they're close by in the instruction
stream, so they have the advantages of immediates instead of the drawbacks of
constants fetched from 'way over in data storage.)

One needs some way of saying how many instructions are in the container, though;
using 1 01 001 0001 coding of the first bit of each 32-bit instruction, while
efficient, would *strongly* tempt implementors to do it in a serial rather than
a parallel way. So I went off to using seven 36-bit instructions in a 256-bit
container to avoid that, along with other weirdness, and even I fear the result
is too complicated.

John Savard

Andy Valencia

unread,
Apr 5, 2020, 7:13:04 PM4/5/20
to
MitchAlsup <Mitch...@aol.com> writes:
> Most of this 6% are instructions used to 'paste' constants together. And=20
> this data was from 32-bit only implementations

But can't the decoder basically treat the obvious LIU/ORI sequence
as one decoded instruction? Or is it the increased icache pressure
which causes the biggest problem?

Andy Valencia

Michael Barry

unread,
Apr 5, 2020, 9:33:35 PM4/5/20
to
On Sunday, April 5, 2020 at 3:35:34 PM UTC-7, Quadibloc wrote:> added 16-bit integer immediates.)
> [...]
> Hence, I came up with an idea, described here some time ago, where
> instructions of uniform length had short pointers into unused storage in
> an instruction "container", as you put it, to the immediates. (As they're
> pointed to, of course, they're not "really" immediates, but they're close
> by in the instruction stream, so they have the advantages of immediates
> instead of the drawbacks of constants fetched from 'way over in data
> storage.)
> [...]

I believe that concept is equivalent (or nearly so) to a "literal pool",
which has been around for many decades.

I always thought of "COME FROM" as equivalent to a breakpoint, and I found
some corroborating opinions during a quick search.

Mike B.

Przemysław Cieszyński

unread,
Apr 5, 2020, 10:02:29 PM4/5/20
to
So what about regular compressed (16 bit) operations?
Block/Container has no practical advantage over "single complex instruction". It is better to organize it as "mandatory fusion", so you can use compressor logic to independant parts of "instruction". Anyway, you loose pipeline simplicity.

MitchAlsup

unread,
Apr 5, 2020, 10:12:26 PM4/5/20
to
On Sunday, April 5, 2020 at 6:13:04 PM UTC-5, Andy Valencia wrote:
Having actually done this, I think it is easier to simply provide, in the
ISA long constants. The 6% instruction penalty can be ameliorated with
heroic instruction parsing techniques, but never (In my opinion) eliminated.
The I$ footprint is smallest when the constant occupies as many bits as it
needs but no more.

Certainly we can agree that pasting constant together is simply harder
than providing access to whatever constant size is required.

It is inarguable that providing large constants through Load instructions
is simply "bad".

What you allude to is amelioration not a substitute for having properly
sized constants.

MitchAlsup

unread,
Apr 5, 2020, 10:14:47 PM4/5/20
to
On Sunday, April 5, 2020 at 8:33:35 PM UTC-5, Michael Barry wrote:
> On Sunday, April 5, 2020 at 3:35:34 PM UTC-7, Quadibloc wrote:> added 16-bit integer immediates.)
> > [...]
> > Hence, I came up with an idea, described here some time ago, where
> > instructions of uniform length had short pointers into unused storage in
> > an instruction "container", as you put it, to the immediates. (As they're
> > pointed to, of course, they're not "really" immediates, but they're close
> > by in the instruction stream, so they have the advantages of immediates
> > instead of the drawbacks of constants fetched from 'way over in data
> > storage.)
> > [...]
>
> I believe that concept is equivalent (or nearly so) to a "literal pool",
> which has been around for many decades.

Provided on machines that do not properly support constants.

And it also hurts D$ performance.

Bruce Hoult

unread,
Apr 5, 2020, 11:41:18 PM4/5/20
to
On Monday, April 6, 2020 at 11:13:04 AM UTC+12, Andy Valencia wrote:
> MitchAlsup <Mitch...@aol.com> writes:
> > Most of this 6% are instructions used to 'paste' constants together. And=20
> > this data was from 32-bit only implementations
>
> But can't the decoder basically treat the obvious LIU/ORI sequence
> as one decoded instruction?

Yes, in a mid to high end implementation. So while the number of instructions is higher (let's assume Mitch's 6%) the number of uOps isn't if merging happens.

Also I'm not sure whether Mitch meant static or dynamic instructions. Constant loads can usually be moved outside loops.


>Or is it the increased icache pressure which causes the biggest problem?

Again assuming the number of static instructions is 6% higher, the code size isn't 6% higher because the two fixed size instructions aren't twice as big as the single variable-length instruction. If i recall correctly, the basic instruction size in MY 66000 is 32 bits, with any literals adding to that. If that's correct then a 32 bit instruction with appended 32 bit literal is 64 bits, exactly the same as two 32 bit MIPS or RISC-V instructions.

There is a gain for 64 bit literals though. Mitch's machine would need 32 + 64 bits (96), while the base RV64I instruction set needs up to 6x 32 bit instructions (192 bits) in the general case. This is reduced with the C extension as the final shift and OR can with suitable register allocation always be 16 bit instructions (so 160 bits). Specific constants can be smaller as one or more of the LUIs or ADDIs may be able to be encoded as 16 bit instructions or even omitted.

The B extension will reduce the general case to 5 instructions, using PACK instead of shift/or.

Aarch64 can do the general case with 4 instructions as it can insert a 16 bit literal into any of the 4 possible positions in a 64 bit register, so 128 bits of instructions total. That's four instructions instead of one, but only 33% more bits.

Quadibloc

unread,
Apr 6, 2020, 2:22:20 AM4/6/20
to
On Sunday, April 5, 2020 at 7:33:35 PM UTC-6, Michael Barry wrote:

> I always thought of "COME FROM" as equivalent to a breakpoint, and I found
> some corroborating opinions during a quick search.

Interesting.

The way I thought of it was that it was isomorphic to the way programs were
written: COME FROM -> label, or line number; line number -> GO TO.

Which was, of course, the joke.

John Savard

Quadibloc

unread,
Apr 6, 2020, 2:28:41 AM4/6/20
to
On Sunday, April 5, 2020 at 7:33:35 PM UTC-6, Michael Barry wrote:
> On Sunday, April 5, 2020 at 3:35:34 PM UTC-7, Quadibloc wrote:> added 16-bit integer immediates.)
> > [...]
> > Hence, I came up with an idea, described here some time ago, where
> > instructions of uniform length had short pointers into unused storage in
> > an instruction "container", as you put it, to the immediates. (As they're
> > pointed to, of course, they're not "really" immediates, but they're close
> > by in the instruction stream, so they have the advantages of immediates
> > instead of the drawbacks of constants fetched from 'way over in data
> > storage.)
> > [...]
>
> I believe that concept is equivalent (or nearly so) to a "literal pool",
> which has been around for many decades.

Maybe it is _nearly_ equivalent to that. Put the literals on page zero of an
architecture like the PDP-8 (or HP 211x or Honeywel 316/516).

But what I'm doing is putting the constants *in the instruction stream*, so I'm
getting _close_ to what Mitch Alsup wants. Instead of having no overhead, I have
a three or four bit field in the instruction corresponding to the constant, but
that avoids having to have instructions whose lengths vary all over the place,
which will also involve overhead of a different kind.

Instead, the instructions can be decoded as if they're all the same length, even
if, in addition to immediates, the same mechanism is used to point to additional
parts of instructions that need to be longer.

John Savard

Anton Ertl

unread,
Apr 6, 2020, 4:36:47 AM4/6/20
to
Bruce Hoult <bruce...@gmail.com> writes:
>There is a gain for 64 bit literals though. Mitch's machine would need 32 +=
> 64 bits (96), while the base RV64I instruction set needs up to 6x 32 bit i=
>nstructions (192 bits) in the general case.

IIRC on Alpha general 64-bit literals are loaded with one 32-bit
instruction, from the constant table. So in terms of code size and
I-cache it is a win (but I think that the obsession of many posters
here with code size misguided). Costs: 1) D-cache pressure (and there
the utilization of the cache line may be significantly lower than in
the I-cache). 2) It consumes a register as pointer to the constant
table. 3) It consumes additional instructions when calling/returning
between compilation units.

Also, when I looked at the code generated for the big constants in my
programs, gcc often had managed to produce a short sequence of
instructions instead; somebody must have invested quite a bit of work
into that.

>This is reduced with the C exte=
>nsion as the final shift and OR can with suitable register allocation alway=
>s be 16 bit instructions (so 160 bits). Specific constants can be smaller a=
>s one or more of the LUIs or ADDIs may be able to be encoded as 16 bit inst=
>ructions or even omitted.
>
>The B extension will reduce the general case to 5 instructions, using PACK =
>instead of shift/or.

An instruction that does

dest = lit16|(src<<16)

would allow to do it in 128 bits. Only one instruction would be
needed (use zero as src for the first one). Maybe a second one with

dest = (~lit16)|(src<<16)

to make the sequences for -2^95<=n<0 shorter.

Alternatively, with 22-bit literals and

dest = lit22|(dest<<22)

it could be done in 96 bits. You would need a second instruction for
the initial literal:

dest = sext(lit22)

So why do neither Aarch64 nor RISC-V have it (and not Alpha, which was
also designed from the start as a 64-bit architecture)? Maybe the
sequential nature of the sequences is the reason; but is it a real
problem?

1) In single-issue implementations it is not.

2) In OoO implementations, it should not be, either: With good branch
prediction (the usual case) the front end runs far ahead of the
execution of instructions on the critical data-flow path. These
constant sequences would depend on nothing and their instructions
would be executed soon after being decoded, and their results would be
available early enough to avoid delaying the instructions on the
critical path. Also instructions from these sequences can be combined
in the decoder.

3) Superscalar in-order implementations pose the biggest problem, but
even there it does not seem to be particularly bad: a) The compiler
can interleave instructions from constant sequences with other
instructions to provide ILP. b) The decoder can combine these
sequences into a single instruction. However, one probably has to
decide in advance which way such implementations should go because
dealing with it in the decoder probably becomes much harder if the
instructions are interleaved.

Anton Ertl

unread,
Apr 6, 2020, 4:43:23 AM4/6/20
to
Quadibloc <jsa...@ecn.ab.ca> writes:
>On Sunday, April 5, 2020 at 3:00:11 AM UTC-6, Anton Ertl wrote:
>
>> The one thing where the approach of the original MIPS has the biggest
>> disadvantage is long addition/subtraction. Architectures with
>> condition codes have a carry bit for that, original MIPS and its
>> descendants replace this by performing more instructions. I dimly
>> remember that a recent MIPS revision incorporates a carry bit. Your
>> approach does not solve this problem.
>
>My approach certainly *does* solve the problem, and perfectly:
>
>ADD 1,2
>JCC CARRY
>
>becomes
>
>PCBC CARRY
>ADD 1,2
>
>which does exactly the same thing -

But it does not solve the problem.

A 256-bit addition on AMD64 looks as follows:

add rsi,r9
adc rdi,r10
adc rax,r11
adc rbc,r12

with each instruction propagating the carry bit to the next one. How
would you do that?

Terje Mathisen

unread,
Apr 6, 2020, 4:57:26 AM4/6/20
to
Your pair of ADD3 and AGC3 instructions is just a way to split a single
ADC with two explicit and one implicit source into two instructions that
need three explicit inputs each, and which do exactly the same
underlying operation, so the only rasonable implementation would be to
detect such a pair and internally turn it into a three-source and
two-destination operation.

BTW, it also has the additional issue that the third input cannot be
general, i.e. it pretty much has to use only the bottom bit and/or
require it to be either 1 or 0: If you allow any three inputs then you
must also allow the AGC3 part to generate two-bit outputs.
>
> If testing for carry then compare and branch works just fine.
>
> For overflow/wrap detection one wants to trap at the point of detection,
> not branch, because that leaves the program counter pointing at the
> problem instruction.
>
> While Overflow requires long-ish sequence to generate manually,
> and if various trap-on-overflow instructions are already
> present then there is minimal case for additional support.
> (here I'm thinking of Add Generate Overflow or Subtract Generate
> Overflow instructions which write the overflow flag to a register.)
>
This is much more reasonable if you cannot just have the full orthogonal
set, which includes add_trap_on_overflow.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Terje Mathisen

unread,
Apr 6, 2020, 5:23:00 AM4/6/20
to
CMOVE is mostly quite slow, so for limited sizes, some of those carry
chains can be short enough to make at least the final (rare overflow)
carry better handled with a branch.

Anton Ertl

unread,
Apr 6, 2020, 5:49:18 AM4/6/20
to
EricP <ThatWould...@thevillage.com> writes:
>For overflow/wrap detection one wants to trap at the point of detection,
>not branch, because that leaves the program counter pointing at the
>problem instruction.
>
>While Overflow requires long-ish sequence to generate manually,
>and if various trap-on-overflow instructions are already
>present then there is minimal case for additional support.
>(here I'm thinking of Add Generate Overflow or Subtract Generate
>Overflow instructions which write the overflow flag to a register.)

There are a few use cases for branch on overflow:

1) comparing with minint:

dec %edi
jno ...

This extends to other operand sizes and variations can be used for
maxint, and for comparing with a range minint...minint+c, or
maxint-c...maxint.

2) Forth's +LOOP is specified in a way that maps pretty well to integer
overflow and branch on overflow. E.g., in SwiftForth "5 +LOOP"
compiles to

add 5,0(%esp)
jno ...

while the C code in Gforth is as follows:

Cell olddiff = n1-nlimit;
n2=n1+n;
if (((olddiff^(olddiff+n)) &(olddiff^n)) >=0) {
/* perform the branch */
}
/* fall-through path */

with the resulting code being

mov %rax,%r9
sub 0x8(%r13),%r9
add %r14,%rax
lea (%r9,%r14,1),%rsi
xor %r9,%r14
xor %r9,%rsi
test %r14,%rsi
js fallthrough
... perform the branch

This is not completely comparable, because SwiftForth keeps a biased
value (I think olddiff+maxint) in 0(%esp), but even if we wanted to
use a similar trick in Gforth, it would only eliminate the first two
instructions.

Alex McDonald

unread,
Apr 6, 2020, 7:57:38 AM4/6/20
to
I have never understood why ISAs insist on having literal forms of
instructions where the literal is in /this/ instruction. I know the
square root of fanny adams about the subject, but why not have the
literal with its own size opcode follow?

ldz-lit-following reg ; load literal zero extend into reg
32bits 1234567 ; opcode for 32bit literal

or

ldz-lit-following reg ; same op as above
8bits 10 ;



--
Alex

Anton Ertl

unread,
Apr 6, 2020, 8:46:10 AM4/6/20
to
Alex McDonald <al...@rivadpm.com> writes:
>I have never understood why ISAs insist on having literal forms of
>instructions where the literal is in /this/ instruction. I know the
>square root of fanny adams about the subject, but why not have the
>literal with its own size opcode follow?

There are such instruction sets, e.g., VAX, IA-32, and b16. However,
if you do a fixed-width instruction set for easy decoding, you don't
want some data item in the middle of the instruction stream.

EricP

unread,
Apr 6, 2020, 10:39:13 AM4/6/20
to
> .... perform the branch
>
> This is not completely comparable, because SwiftForth keeps a biased
> value (I think olddiff+maxint) in 0(%esp), but even if we wanted to
> use a similar trick in Gforth, it would only eliminate the first two
> instructions.
>
> - anton

Those are examples of doing overflow error detection using branches
because that is the only tool available to them. And it changes
the program counter away from the erroneous instruction loosing
traceability to the error.

An AddTrapOverflow instruction is the proper tool because it has no cost
and leaves the program counter pointing at the erroneous instruction.

Other than error detection I can't think of any other use for an
Overflow flag. But even if there is such an example, its frequency
of occurrence is so low that it does not warrant support in an ISA.


already...@yahoo.com

unread,
Apr 6, 2020, 11:14:42 AM4/6/20
to
At least on x86/Arm/SPARC Overflow flag is used for signed integer comparisons.
It sounds like FAR more common use than error detection.

EricP

unread,
Apr 6, 2020, 11:38:55 AM4/6/20
to
Ivan Godard wrote:
>
> I thought everyone used CMOVE for extended precision? Why a branch? It's
> going to predict badly.

This is about what happens when one eliminates the condition code
register from an ISA. Some condition flags, sign, zero, odd/even,
are redundant with the result value so there is no loss.
Only Carry and Overflow represent a potential loss of information
and one needs to ensure equivalent functionality is provided.

One common usage for a carry flag as a distinct register
is Add With Carry, Subtract With Borrow instructions.

An Add With Carry on an ISA without a carry flag takes 5 instructions

tmp_res = A + B
result = tmp_res + carry_in
tmp_cr1 = tmp_res < A
tmp_cr2 = result < tmp_res
carry_out = tmp_cr1 | tmp_cr2

That might be too expensive for some applications.
It could be done in a single instruction that has 3 source registers
and 2 destination registers. Most ISA's would find that unacceptable.

A compromise is my suggestion of 2 instructions with 3 source registers
and 1 dest register, one to calc the sum, one to generate the carry out.

Another use of carry out is as an error indicator on unsigned overflow.
An AddTrapCarry instruction is best suited for that.

And anyone who really wants to detect a carry out and save it in a
register can use a CMP instruction, and branch on it if they like.

None of this requires merging branches into arithmetic operations.

Ivan Godard

unread,
Apr 6, 2020, 11:54:15 AM4/6/20
to
Another data point:

Various wide-encoding (bundled, VLIW) ISAs have encoded big literals by
repurposing other slots in the bundle. In effect the decoder provides
anonymous registers holding the un-decoded bits of instructions in all
or a defined set of the instructions in the bundle. An instruction
elsewhere in the bundle can then treat the instruction pseudo-registers
as real registers in arithmetic operations. In the only case I
personally worked on the whole bundle was really more like horizontal
microcode than a conventional VLIW bundle, but that's not relevant to
the method.

The decoder of course had to suppress treating the literal as an actual
instruction. IIRC only one slot could use the "literal register" and use
of the register did the suppress as a side effect; references to the
"literal register" in other slots read as zero, similar to the common
"zero register" technique. Extension to more than one reference, or
other means to disambiguate, seems straightforward.

Mill doesn't repurpose bits like the above; instead adjacent
instructions in the bundle can concatenate their carried literals to
make a larger one. Every slot is assigned 32+(3*morsel) bits in a
literal buffer, which bits present as zero for all bits not specified by
the instruction in the slot. Most instructions just use (or don't) the
bits in their own part of the buffer, but one opcode says "consider that
the owned bits of the instruction to my left actually extend into my
part of the buffer". Note that there is no actual data movement in this.

The "ownership extension" facility is used mostly by the Mill
instructions that take a list of things as an argument, such as call and
its argument list for example. However it is also used to build wide
literals, including quad (128 bit) literals on members with hardware
support for quad. In a worst case (Gold, with a 5-bit morsel size and a
32 entry belt) a predicated call instruction with 32 arguments and more
than one result needs 5 (predicate) + 5 (result count) + 32*5
(arguments) + 32 (address), or 202 bits, while each slot owns 32+(3*5)
or 47 bits. The instruction thus encodes as one slot with a call
operation and four slots with the "flowArgs" instruction that extends
the part of the literal buffer owned by the call.

Gold has only five flow slots (as presently configured), so a monster
call like this would prevent the bundle from having any other flow
instruction in that bundle. More typical calls use only one or two
slots, so a bundle with a call can also encode some loads, stores, or
other calls too.

Mitch hasn't talked about native quad support (that I noticed anyway),
but it looks like his encoding scheme could be naturally extended to
quad literals.

Machines with SIMD might also have a need for very big literals. So far
the Mill splat and iota hardware instructions seem to be sufficient in
the very limited exploration of SIMD we have yet done. It's not clear
whether we could extend the present Mill scheme to say 512-bit literals
for SIMD; it would take 10 flow slots at 47 bits each, and many of those
slots would be otherwise useless.

EricP

unread,
Apr 6, 2020, 12:09:37 PM4/6/20
to
On a RISC-ish ISA the Compare instruction provides for that.

So to rephrase my question, when an ISA eliminates the condition
code register, and adds CMP rd,rs1,rs2 and AddTrapOverflow
instructions to provide equivalent or better functionality,
is there any other use for the prior Overflow condition code
flag that needs to be provided for?

Anton Ertl

unread,
Apr 6, 2020, 1:08:43 PM4/6/20
to
EricP <ThatWould...@thevillage.com> writes:
>Anton Ertl wrote:
>> There are a few use cases for branch on overflow:
>>
>> 1) comparing with minint:
>>
>> dec %edi
>> jno ...
>>
>> This extends to other operand sizes and variations can be used for
>> maxint, and for comparing with a range minint...minint+c, or
>> maxint-c...maxint.
>>
>> 2) Forth's +LOOP is specified in a way that maps pretty well to integer
>> overflow and branch on overflow. E.g., in SwiftForth "5 +LOOP"
>> compiles to
>>
>> add 5,0(%esp)
>> jno ...
...
>Those are examples of doing overflow error detection using branches
>because that is the only tool available to them.

No. If you compare with minint (use case 1), equality is not an
error. The exit of a counted loop (use case 2) is not an error,
either.

>Other than error detection I can't think of any other use for an
>Overflow flag.

Or you choose to ignore them.

>But even if there is such an example, its frequency
>of occurrence is so low that it does not warrant support in an ISA.

The original MIPS and its descendants seem to support your position.
However, microMIPS32 r6 has added BOVC and BNVC, which are
add-and-and-branch-on-overflow instructions. It seems that, after 30
years, they found that the needs are frequent enough to warrant
support in an ISA.

MitchAlsup

unread,
Apr 6, 2020, 1:12:35 PM4/6/20
to
On Sunday, April 5, 2020 at 10:41:18 PM UTC-5, Bruce Hoult wrote:
> On Monday, April 6, 2020 at 11:13:04 AM UTC+12, Andy Valencia wrote:
> > MitchAlsup <?????@aol.com> writes:
> > > Most of this 6% are instructions used to 'paste' constants together. And=20
> > > this data was from 32-bit only implementations
> >
> > But can't the decoder basically treat the obvious LIU/ORI sequence
> > as one decoded instruction?
>
> Yes, in a mid to high end implementation. So while the number of instructions is higher (let's assume Mitch's 6%) the number of uOps isn't if merging happens.
>
> Also I'm not sure whether Mitch meant static or dynamic instructions. Constant loads can usually be moved outside loops.
>
>
> >Or is it the increased icache pressure which causes the biggest problem?
>
> Again assuming the number of static instructions is 6% higher, the code size isn't 6% higher because the two fixed size instructions aren't twice as big as the single variable-length instruction. If i recall correctly, the basic instruction size in MY 66000 is 32 bits, with any literals adding to that. If that's correct then a 32 bit instruction with appended 32 bit literal is 64 bits, exactly the same as two 32 bit MIPS or RISC-V instructions.

But MIPS has only assembled the constant by this point, not used it as
an operand as My 66000 has.
>
> There is a gain for 64 bit literals though. Mitch's machine would need 32 + 64 bits (96), while the base RV64I instruction set needs up to 6x 32 bit instructions (192 bits) in the general case. This is reduced with the C extension as the final shift and OR can with suitable register allocation always be 16 bit instructions (so 160 bits). Specific constants can be smaller as one or more of the LUIs or ADDIs may be able to be encoded as 16 bit instructions or even omitted.

Which means 64-bit codes will have more overhead than mentioned above.
>
> The B extension will reduce the general case to 5 instructions, using PACK instead of shift/or.
>
> Aarch64 can do the general case with 4 instructions as it can insert a 16 bit literal into any of the 4 possible positions in a 64 bit register, so 128 bits of instructions total. That's four instructions instead of one, but only 33% more bits.

How much nicer it is to just have constants available.

MitchAlsup

unread,
Apr 6, 2020, 1:20:53 PM4/6/20
to
You are loading the literal for a reason--to use it. in My 66000 ISA
you grab the constant from the Istream and use it in the instruction
carrying the literal. Here, you just "got" the literal, whereas::

DIV R7,123987456873,R19

You have both gotten the literal and used it. large Immediates are almost
always use once. Displacements can be used several times.

EricP

unread,
Apr 6, 2020, 1:26:40 PM4/6/20
to
Terje Mathisen wrote:
> EricP wrote:
>>
>> When propagating a carry, one wants branch-less code.
>> Two instructions can do it: ADD3 adds 3 registers and writes the
>> truncated result to a single register, and AGC3 Add Generate Carry
>> adds 3 registers and writes just the carry bit to a register.
>> Similar for subtract SUB3 and SGB3.
>> These 4 instructions replace the long-ish RISC sequences.
>
> Your pair of ADD3 and AGC3 instructions is just a way to split a single
> ADC with two explicit and one implicit source into two instructions that
> need three explicit inputs each, and which do exactly the same
> underlying operation, so the only rasonable implementation would be to
> detect such a pair and internally turn it into a three-source and
> two-destination operation.

Actually what I was trying to do was not have instructions that are
(a) rarely used and (b) expensive for a minimum implementation.

I felt 2 dest registers in a single instruction, whether specified
in an ISA instruction or created by fusion, has a bunch of nasty
uArch implications for rename, writeback ports and writeback
arbitration that I did not want to force on all implementations.

3 source registers is already required for a base-index store
so there is no additional cost if an ISA has that (which I would).
Also there are some bitfield extract-merge instruction that can use
3 source registers.

So 2 instructions ADD3 and AGC3 seems a reasonable compromise.

> BTW, it also has the additional issue that the third input cannot be
> general, i.e. it pretty much has to use only the bottom bit and/or
> require it to be either 1 or 0: If you allow any three inputs then you
> must also allow the AGC3 part to generate two-bit outputs.

I was only thinking of 1 carry bit but I see no reason to
limit it to that - so 2 carry bits it is!

Quadibloc

unread,
Apr 6, 2020, 1:29:02 PM4/6/20
to
On Monday, April 6, 2020 at 8:39:13 AM UTC-6, EricP wrote:
> And it changes
> the program counter away from the erroneous instruction loosing
> traceability to the error.

Ah, so that's an argument for having a conditional jump to subroutine instruction.

John Savard

Quadibloc

unread,
Apr 6, 2020, 1:33:48 PM4/6/20
to
On Monday, April 6, 2020 at 2:43:23 AM UTC-6, Anton Ertl wrote:

> But it does not solve the problem.
>
> A 256-bit addition on AMD64 looks as follows:
>
> add rsi,r9
> adc rdi,r10
> adc rax,r11
> adc rbc,r12
>
> with each instruction propagating the carry bit to the next one. How
> would you do that?

A 128-bit addition would work, simply by treating "add with carry" like a
conditional branch, as an instruction prefix to the plain add instruction.

Making it both a prefix and prefixable leads to the problem of excessively long
sequences that could block interrupts. So there is a problem.

John Savard

Stefan Monnier

unread,
Apr 6, 2020, 1:44:07 PM4/6/20
to
> So 2 instructions ADD3 and AGC3 seems a reasonable compromise.

As mentioned by someone else before, the issue of condition codes has
seen many different variations in different architectures, and none seem
very clearly superior to the other in general.

Does any know of an architecture which stored the carry/overflow bits as
extra "metadata" bits in the registers? So you'd need, say, 34 bit
registers to hold the "32bit data, plus overflow and carry", but those
extra bits wouldn't be sent to cache&dram (except for context
switching, of course). It seems like an extra 2 bits in each register
should be a fairly cheap, and it eliminates the "implicit state" of
tradition condition codes. It also gives you multiple condition
codes, similarly to what Power does. It would still require a 3-input
instruction for "add with carry", but it'd be otherwise just as
efficient as i386's `addc` when doing a 256-bit addition.


Stefan

EricP

unread,
Apr 6, 2020, 1:55:32 PM4/6/20