A Linguistic Contribution to Condition Code-less Programming

Quadibloc

unread,

Apr 5, 2020, 7:18:38 AM4/5/20

to

I am a few days too late with this post... as its title recalls that of a famous
April Fool's Day article from _Datamation_.

However, I'm not so sure this idea isn't a good and serious one.

A number of modern processors, from the DEC Alpha to RISC-V, do not have condition
codes. On the other hand, the Power PC tries to mitigate the problems of having
this pervasive persistent state by having multiple condition codes, with
instrucitons indicating which one is to be used.

An idea just came to me of a way to achieve these goals:

- Avoid having condition codes existing as a persistent state, but

- Retain the full generality that the classical architectures with condition codes
provided, so that converting programs would normally not be difficult.

Here is the strange idea I came up with:

Have instructions that "look" like a conditional branch or ocnditional jump
instruction in every way.

But they _do_ something different.

There is a "delay slot" of sorts... what happens is this:

the following instruction must perform an arithmetic operation of a kind that
would normally set the condition codes on a conventional computer that had them;

that instruction is performed normally;

and then the branch or jump specified by the preceding instruction is taken, or
not, as its branch condition specifies with respect to the condition codes the
instruction following would have set if the architecture had them.

So you take a conventional architecture with conditional branch instructions,
and arithmetic instructions that set condition codes - and the *only change* you
make to the ISA is that the conditional branch instruction is to be put before
the instruction on the result of which it depends.

Of course, this is not really a change that makes _no_ difference.

Since condition codes as a persistent state have been eliminated, the one thing
you _can't_ do any more is put instructions that _don't_ change the condition
codes (i.e. memory load and store, unconditional jump) between the arithmetic
instruction that set a condition and the conditional branch.

John Savard

Michael Barry

unread,

Apr 5, 2020, 8:51:51 AM4/5/20

to

On Sunday, April 5, 2020 at 12:18:38 AM UTC-7, Quadibloc wrote:
> Since condition codes as a persistent state have been eliminated, the one thing
> you _can't_ do any more is put instructions that _don't_ change the condition
> codes (i.e. memory load and store, unconditional jump) between the arithmetic
> instruction that set a condition and the conditional branch.
>
> John Savard

So, you're still holding state (condition and conditional branch address),
but it only has to persist for the following instruction? As long as the
programmer cooperates and no interrupts hit in the middle, it should work,
at least in theory. Whether or not it offers an advantage meritorious
of further study is a decision well above my pay scale ... and if this is
an April Fool's joke, you got me!

Mike B.

Anton Ertl

unread,

Apr 5, 2020, 9:00:11 AM4/5/20

to

Quadibloc <jsa...@ecn.ab.ca> writes:
>However, I'm not so sure this idea isn't a good and serious one.

...

>Have instructions that "look" like a conditional branch or ocnditional jump
>instruction in every way.
>
>But they _do_ something different.
>
>There is a "delay slot" of sorts... what happens is this:
>
>the following instruction must perform an arithmetic operation of a kind that
>would normally set the condition codes on a conventional computer that had them;
>
>that instruction is performed normally;
>
>and then the branch or jump specified by the preceding instruction is taken, or
>not, as its branch condition specifies with respect to the condition codes the
>instruction following would have set if the architecture had them.

...

>Since condition codes as a persistent state have been eliminated, the one thing
>you _can't_ do any more is put instructions that _don't_ change the condition
>codes (i.e. memory load and store, unconditional jump) between the arithmetic
>instruction that set a condition and the conditional branch.

An alternative is to just let every instruction set the condition
codes, and have conventional order.

But given that ARM does ok with condition codes in both tiny and
high-performance implementations, condition codes seem to be a solved
problem. The fact that there is so much variation in this area
(unlike almost everywhere else) indicates that no solution has a
decisive advantage over the others.

The one thing where the approach of the original MIPS has the biggest
disadvantage is long addition/subtraction. Architectures with
condition codes have a carry bit for that, original MIPS and its
descendants replace this by performing more instructions. I dimly
remember that a recent MIPS revision incorporates a carry bit. Your
approach does not solve this problem.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

Quadibloc

unread,

Apr 5, 2020, 4:06:00 PM4/5/20

to

On Sunday, April 5, 2020 at 2:51:51 AM UTC-6, Michael Barry wrote:

> So, you're still holding state (condition and conditional branch address),
> but it only has to persist for the following instruction? As long as the
> programmer cooperates and no interrupts hit in the middle, it should work,
> at least in theory. Whether or not it offers an advantage meritorious
> of further study is a decision well above my pay scale ... and if this is
> an April Fool's joke, you got me!

I had been doing further thinking about this, and, no, it isn't a joke. Even if
it's analogous to a previous April Fool's joke that I referenced.

This was the Datamation article about the COME FROM statement. It eliminated the
GO TO by making it go backwards. Here, by making the order of a jump instruction
and an operate instruction backwards, I eliminate condition codes - but for
real, instead of having something that adds no structure to programs as in the
joke.

Since the jump happens _after_ the operation, I thought, this scheme slows
things down. Why couldn't I keep the conventional order of instructions, and
just have the state last for one instruction?

And I realized that this was why the jump needs to come first. Because otherwise
there's nothing to say "turn interrupts off" to prevent just the problem you
identified. By putting the jump first in the instruction stream, it can also act
as basically an instruction prefix.

So the jump plus the operate are really *a single instruction*, thus preventing
the need to put these "non-persistent" condition code bits somewhere that is
persistent enough to be saved and restored by interrupt service routines.

John Savard

Quadibloc

unread,

Apr 5, 2020, 4:09:49 PM4/5/20

to

On Sunday, April 5, 2020 at 3:00:11 AM UTC-6, Anton Ertl wrote:

> The one thing where the approach of the original MIPS has the biggest
> disadvantage is long addition/subtraction. Architectures with
> condition codes have a carry bit for that, original MIPS and its
> descendants replace this by performing more instructions. I dimly
> remember that a recent MIPS revision incorporates a carry bit. Your
> approach does not solve this problem.

My approach certainly *does* solve the problem, and perfectly:

ADD 1,2
JCC CARRY

becomes

PCBC CARRY
ADD 1,2

which does exactly the same thing -

add r2 to r1
jump conditional on carry to CARRY

becomes

prepare conditional branch to CARRY
add r2 to r1

because now without any persistent condition code bits, I can still use exactly
the same general set of conditional branch instructions as I would have used if
there were condition code bits, so I don't lose the ability to branch on carry,
and I don't need to add new instructions or add bits to operate instructions.

John Savard

Stephen Fuld

unread,

Apr 5, 2020, 4:43:53 PM4/5/20

to

Interesting idea. You eliminate a "long" persistent state, that has to
be saved across interrupts, but you still have the state that interrupts
are prevented for until the completion of the next instruction. So you
have to prevent a denial of service attack if someone maliciously codes
a long sequence of these instructions. You can have some sort of
"inter instruction" syntax test enforced in the hardware, or a timer
that prevents interrupts from being locked out for too long, but these
add additional circuitry. Is this better than one or two additional
bits in the saved state? I don't know. And it doesn't address the
problem that PPC's multiple CCs were designed to solve, though I am not
sure how important that is.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

EricP

unread,

Apr 5, 2020, 5:27:46 PM4/5/20

to

So basically you want an Add-Branch-Carry instruction.
Instead of the existing Add, Compare, Branch 3 instruction sequence,
or even better Add, CompareAndBranch 2 instruction sequence
that are generally useful and require no special design.

The problem with dropping the Carry flag is for multi-precision
adds and subtracts, because it injects a branch sequence into a
loop that no one wants. Making that branch sequence slightly
more efficient isn't really a help.

What might be helpful is a 3 input add, but that has implications
for minimum hardware design, read ports, renaming, operand buses, etc.

And you have totally ignored the other flag that is not
directly available by looking at the result: Overflow.

Quadibloc

unread,

Apr 5, 2020, 6:47:07 PM4/5/20

to

On Sunday, April 5, 2020 at 11:27:46 AM UTC-6, EricP wrote:

> And you have totally ignored the other flag that is not
> directly available by looking at the result: Overflow.

I haven't ignored _any_ flags. The idea is that *every* conditional branch
instruction still works the way it did before, except now it's merged with the
operate instruction it would have followed - and can only immediately follow
that instruction.

Of course, not having persistent condition codes *does* create problems with
using an overflow as a _trap_.

John Savard

Quadibloc

unread,

Apr 5, 2020, 6:48:56 PM4/5/20

to

On Sunday, April 5, 2020 at 10:43:53 AM UTC-6, Stephen Fuld wrote:

> Interesting idea. You eliminate a "long" persistent state, that has to
> be saved across interrupts, but you still have the state that interrupts
> are prevented for until the completion of the next instruction. So you
> have to prevent a denial of service attack if someone maliciously codes
> a long sequence of these instructions.

Well, a conditional branch instruction can only merge with a following operate
instruction.

So that means interrupts are turned off only for one instruction - the jump and
the operate become a compound atomic instruction. An attempt to prefix a jump
with a jump produces an illegal instruction.

John Savard

Quadibloc

unread,

Apr 5, 2020, 6:50:23 PM4/5/20

to

On Sunday, April 5, 2020 at 10:43:53 AM UTC-6, Stephen Fuld wrote:

> You can have some sort of
> "inter instruction" syntax test enforced in the hardware

Ah, you did anticipate the solution I chose, although I didn't think of it in
those terms.

John Savard

MitchAlsup

unread,

Apr 5, 2020, 7:23:53 PM4/5/20

to

What you have discovered, is a way to add encoding "bits" to an instruction
(the conditional instruction in the shadow of the branch) without wasting
bits in the instruction (the one generating the CCs).

I have called this casting--the branch instruction casts its conditional
requirement over the arithmetic instruction that has an innate ability
to generate condition codes. I use casting for predication in My 66000
and again in the Virtual Vector Method. Casting places the encoding bits
in a different instruction than the instruction which will ultimately
perform what the casts bits required.

This kind of casting improves the entropy of the ISA encoding.

Przemysław Cieszyński

unread,

Apr 5, 2020, 7:39:41 PM4/5/20

to

Change of name doesn't change the substance.
You want to introduce "CISC" instruction(s) having "ortogonal" encoding.
Condition codes are lesser evil.

Quadibloc

unread,

Apr 5, 2020, 8:14:18 PM4/5/20

to

On Sunday, April 5, 2020 at 1:23:53 PM UTC-6, MitchAlsup wrote:

> What you have discovered, is a way to add encoding "bits" to an instruction
> (the conditional instruction in the shadow of the branch) without wasting
> bits in the instruction (the one generating the CCs).
>
> I have called this casting--the branch instruction casts its conditional
> requirement over the arithmetic instruction that has an innate ability
> to generate condition codes. I use casting for predication in My 66000
> and again in the Virtual Vector Method. Casting places the encoding bits
> in a different instruction than the instruction which will ultimately
> perform what the casts bits required.
>
> This kind of casting improves the entropy of the ISA encoding.

Ah, yes.

Basically, if I change an architecture that has condition code bits to one that
doesn't...

now I need, apparently, to indicate whether a particular operate instruction is
going to have a conditional branch associated with it.

But the old conditional branch instructions themselves, unlike operate
instructions, already implied the presence... of a conditional branch
instruction.

So just put the branch first, and the information saying that something special
needs to be done is presented in time - without having to re-distribute the
encodings, and without, apparently, needing to have a less efficient encoding.

So you're quite right, what I've suggested is an example of the technique you
were already using.

John Savard

Quadibloc

unread,

Apr 5, 2020, 8:21:16 PM4/5/20

to

On Sunday, April 5, 2020 at 1:39:41 PM UTC-6, Przemysław Cieszyński wrote:

> Change of name doesn't change the substance.
> You want to introduce "CISC" instruction(s) having "ortogonal" encoding.
> Condition codes are lesser evil.

Now, that is another point. This scheme, by having prefix instructions, is
forcing decoding to be sequential. If the instruction set is already CISC,
that's not a problem. If it was RISC, though, then the good properties of every
instruction being the same length, and so on, are lost.

There are, however, ways to fix that.

For example: think of a semi-CISC scheme, where each 32-bit block can consist of
either one 32-bit instruction, or two 16-bit instructions.

Let's suppose that the block starts with 0 if it consists of two 16-bit
instructions, both starting with 0. Or it starts with 1 if it's a single 32-bit
instruction.

Then if the second 16-bit instruction instead starts with 1, it's a relative
branch instruction atomically linked to the 16-bit operate instruction in the
first half. This time, there's no need to reverse the order. Decoding remains
RISC-like!

I didn't illustrate this kind of possibility initially, as I didn't want to make
my initial post too complicated and confusing.

John Savard

EricP

unread,

Apr 5, 2020, 8:59:41 PM4/5/20

to

Quadibloc wrote:
> On Sunday, April 5, 2020 at 11:27:46 AM UTC-6, EricP wrote:
>
>> And you have totally ignored the other flag that is not
>> directly available by looking at the result: Overflow.
>
> I haven't ignored _any_ flags. The idea is that *every* conditional branch
> instruction still works the way it did before, except now it's merged with the
> operate instruction it would have followed - and can only immediately follow
> that instruction.

Its not a branch that is "merged" into the following instruction.
It is a single instruction that has a specific behavior, AddBranchCarry.
How you chose the bit pattern for that instruction is philosophy.

You don't need to "merge branches into other instructions" because
branch already works just fine for those other tests, except overflow.

When you look at the usage cases for carry and overflow flags,
they are not justified either.

> Of course, not having persistent condition codes *does* create problems with
> using an overflow as a _trap_.
>
> John Savard
>

The point is that no one needs this, either for carry or overflow.

When propagating a carry, one wants branch-less code.
Two instructions can do it: ADD3 adds 3 registers and writes the
truncated result to a single register, and AGC3 Add Generate Carry
adds 3 registers and writes just the carry bit to a register.
Similar for subtract SUB3 and SGB3.
These 4 instructions replace the long-ish RISC sequences.

If testing for carry then compare and branch works just fine.

For overflow/wrap detection one wants to trap at the point of detection,
not branch, because that leaves the program counter pointing at the
problem instruction.

While Overflow requires long-ish sequence to generate manually,
and if various trap-on-overflow instructions are already
present then there is minimal case for additional support.
(here I'm thinking of Add Generate Overflow or Subtract Generate
Overflow instructions which write the overflow flag to a register.)

Ivan Godard

unread,

Apr 5, 2020, 9:27:52 PM4/5/20

to

I thought everyone used CMOVE for extended precision? Why a branch? It's
going to predict badly.

MitchAlsup

unread,

Apr 5, 2020, 9:30:49 PM4/5/20

to

On Sunday, April 5, 2020 at 3:21:16 PM UTC-5, Quadibloc wrote:
> On Sunday, April 5, 2020 at 1:39:41 PM UTC-6, Przemysław Cieszyński wrote:
>
> > Change of name doesn't change the substance.
> > You want to introduce "CISC" instruction(s) having "ortogonal" encoding.
> > Condition codes are lesser evil.
>
> Now, that is another point. This scheme, by having prefix instructions, is
> forcing decoding to be sequential. If the instruction set is already CISC,
> that's not a problem. If it was RISC, though, then the good properties of every
> instruction being the same length, and so on, are lost.

I question whether those are actually "good" properties.

They mandate an additional 6% of instructions to exist--that is the RISC
instruction philosophy requires 106 instructions because of its <stubborn>
fixed length that a variable length instruction does in 100 instructions.
Most of this 6% are instructions used to 'paste' constants together. And
this data was from 32-bit only implementations--pasting 48-bit or 64-bit
constants together requires a lot more instructions, or memory reference
to a table of constants.

>
> There are, however, ways to fix that.
>
> For example: think of a semi-CISC scheme, where each 32-bit block can consist of
> either one 32-bit instruction, or two 16-bit instructions.

I wish you would use the word 'container' where you wrote 'block'

>
> Let's suppose that the block starts with 0 if it consists of two 16-bit
> instructions, both starting with 0. Or it starts with 1 if it's a single 32-bit
> instruction.

This encoding wastes entropy.

>
> Then if the second 16-bit instruction instead starts with 1, it's a relative
> branch instruction atomically linked to the 16-bit operate instruction in the
> first half. This time, there's no need to reverse the order. Decoding remains
> RISC-like!

Parsing of instructions can be performed with marker bit technology (like
Athlon or Opteron) or it can be performed with a careful arrangement of
bits in the instruction encoding template.

Decoding of instructions has to do with what resources the instruction
utilizes and what function units are appropriate to perform the pre-
scribed data manipulations; not how its length is determined.

MitchAlsup

unread,

Apr 5, 2020, 9:39:14 PM4/5/20

to

On Sunday, April 5, 2020 at 3:21:16 PM UTC-5, Quadibloc wrote:

> On Sunday, April 5, 2020 at 1:39:41 PM UTC-6, Przemysław Cieszyński wrote:
>
> > Change of name doesn't change the substance.
> > You want to introduce "CISC" instruction(s) having "ortogonal" encoding.
> > Condition codes are lesser evil.
>
> Now, that is another point. This scheme, by having prefix instructions, is
> forcing decoding to be sequential. If the instruction set is already CISC,
> that's not a problem. If it was RISC, though, then the good properties of every
> instruction being the same length, and so on, are lost.

I question whether those are 'good' properties !

The fixed length nature of RISC causes a 6% expansion of the number of
instructions executed due to having to paste constants together. And this
is from 32-bit code+32-bit address space stuff. It can only get worse
with 64-bit constants and address spaces. Having to (shudder) load
constants from a table (in memory) wastes a) instructions, b) data cache,
c) time, d) instruction cache. Pasting constants together only wastes
instruction cache.

>
> There are, however, ways to fix that.
>
> For example: think of a semi-CISC scheme, where each 32-bit block can consist of
> either one 32-bit instruction, or two 16-bit instructions.
>
> Let's suppose that the block starts with 0 if it consists of two 16-bit
> instructions, both starting with 0. Or it starts with 1 if it's a single 32-bit
> instruction.
>
> Then if the second 16-bit instruction instead starts with 1, it's a relative
> branch instruction atomically linked to the 16-bit operate instruction in the
> first half. This time, there's no need to reverse the order. Decoding remains
> RISC-like!

This wastes entropy and will eat your lunch in instruction encoding.

But there is no particular reason one cannot decode variable length
instructions rapidly, too. AMD/Intel have figured it out, it is just
not "that hard".

What is hard, is making up for the extra instructions one has to execute
if the instruction set does not allow variable length.

Quadibloc

unread,

Apr 5, 2020, 10:35:34 PM4/5/20

to

On Sunday, April 5, 2020 at 3:30:49 PM UTC-6, MitchAlsup wrote:

> Decoding of instructions has to do with what resources the instruction
> utilizes and what function units are appropriate to perform the pre-
> scribed data manipulations; not how its length is determined.

Well, in a segment of code that contains no branches, if all the instructions
are the same length, then one can decode them all - and do further processing on
the instructions that is not precluded by dependencies - in parallel, instead of
seqeuntially.

That's why having all the instructions the same length is so attractive.

Even on architectures with variable-length instructions, generally speaking long
immediate values are not provided for. (i.e. the IBM System/360, with 16-bit,
32-bit, and 48-bit instructions, has no floating-point instructions with 32-bit
immediates, let alone 64-bit immediates. It had 8-bit immediates, and successors
added 16-bit integer immediates.)

Hence, I came up with an idea, described here some time ago, where instructions
of uniform length had short pointers into unused storage in an instruction
"container", as you put it, to the immediates. (As they're pointed to, of
course, they're not "really" immediates, but they're close by in the instruction
stream, so they have the advantages of immediates instead of the drawbacks of
constants fetched from 'way over in data storage.)

One needs some way of saying how many instructions are in the container, though;
using 1 01 001 0001 coding of the first bit of each 32-bit instruction, while
efficient, would *strongly* tempt implementors to do it in a serial rather than
a parallel way. So I went off to using seven 36-bit instructions in a 256-bit
container to avoid that, along with other weirdness, and even I fear the result
is too complicated.

John Savard

Andy Valencia

unread,

Apr 5, 2020, 11:13:04 PM4/5/20

to

MitchAlsup <Mitch...@aol.com> writes:
> Most of this 6% are instructions used to 'paste' constants together. And=20

> this data was from 32-bit only implementations

But can't the decoder basically treat the obvious LIU/ORI sequence
as one decoded instruction? Or is it the increased icache pressure
which causes the biggest problem?

Andy Valencia

Michael Barry

unread,

Apr 6, 2020, 1:33:35 AM4/6/20

to

On Sunday, April 5, 2020 at 3:35:34 PM UTC-7, Quadibloc wrote:> added 16-bit integer immediates.)
> [...]

> Hence, I came up with an idea, described here some time ago, where
> instructions of uniform length had short pointers into unused storage in
> an instruction "container", as you put it, to the immediates. (As they're
> pointed to, of course, they're not "really" immediates, but they're close
> by in the instruction stream, so they have the advantages of immediates
> instead of the drawbacks of constants fetched from 'way over in data
> storage.)

> [...]

I believe that concept is equivalent (or nearly so) to a "literal pool",
which has been around for many decades.

I always thought of "COME FROM" as equivalent to a breakpoint, and I found
some corroborating opinions during a quick search.

Mike B.

Przemysław Cieszyński

unread,

Apr 6, 2020, 2:02:29 AM4/6/20

to

So what about regular compressed (16 bit) operations?
Block/Container has no practical advantage over "single complex instruction". It is better to organize it as "mandatory fusion", so you can use compressor logic to independant parts of "instruction". Anyway, you loose pipeline simplicity.

MitchAlsup

unread,

Apr 6, 2020, 2:12:26 AM4/6/20

to

On Sunday, April 5, 2020 at 6:13:04 PM UTC-5, Andy Valencia wrote:

Having actually done this, I think it is easier to simply provide, in the
ISA long constants. The 6% instruction penalty can be ameliorated with
heroic instruction parsing techniques, but never (In my opinion) eliminated.
The I$ footprint is smallest when the constant occupies as many bits as it
needs but no more.

Certainly we can agree that pasting constant together is simply harder
than providing access to whatever constant size is required.

It is inarguable that providing large constants through Load instructions
is simply "bad".

What you allude to is amelioration not a substitute for having properly
sized constants.

MitchAlsup

unread,

Apr 6, 2020, 2:14:47 AM4/6/20

to

On Sunday, April 5, 2020 at 8:33:35 PM UTC-5, Michael Barry wrote:
> On Sunday, April 5, 2020 at 3:35:34 PM UTC-7, Quadibloc wrote:> added 16-bit integer immediates.)
> > [...]
> > Hence, I came up with an idea, described here some time ago, where
> > instructions of uniform length had short pointers into unused storage in
> > an instruction "container", as you put it, to the immediates. (As they're
> > pointed to, of course, they're not "really" immediates, but they're close
> > by in the instruction stream, so they have the advantages of immediates
> > instead of the drawbacks of constants fetched from 'way over in data
> > storage.)
> > [...]
>
> I believe that concept is equivalent (or nearly so) to a "literal pool",
> which has been around for many decades.

Provided on machines that do not properly support constants.

And it also hurts D$ performance.

Bruce Hoult

unread,

Apr 6, 2020, 3:41:18 AM4/6/20

to

On Monday, April 6, 2020 at 11:13:04 AM UTC+12, Andy Valencia wrote:
> MitchAlsup <Mitch...@aol.com> writes:
> > Most of this 6% are instructions used to 'paste' constants together. And=20
> > this data was from 32-bit only implementations
>
> But can't the decoder basically treat the obvious LIU/ORI sequence
> as one decoded instruction?

Yes, in a mid to high end implementation. So while the number of instructions is higher (let's assume Mitch's 6%) the number of uOps isn't if merging happens.

Also I'm not sure whether Mitch meant static or dynamic instructions. Constant loads can usually be moved outside loops.

>Or is it the increased icache pressure which causes the biggest problem?

Again assuming the number of static instructions is 6% higher, the code size isn't 6% higher because the two fixed size instructions aren't twice as big as the single variable-length instruction. If i recall correctly, the basic instruction size in MY 66000 is 32 bits, with any literals adding to that. If that's correct then a 32 bit instruction with appended 32 bit literal is 64 bits, exactly the same as two 32 bit MIPS or RISC-V instructions.

There is a gain for 64 bit literals though. Mitch's machine would need 32 + 64 bits (96), while the base RV64I instruction set needs up to 6x 32 bit instructions (192 bits) in the general case. This is reduced with the C extension as the final shift and OR can with suitable register allocation always be 16 bit instructions (so 160 bits). Specific constants can be smaller as one or more of the LUIs or ADDIs may be able to be encoded as 16 bit instructions or even omitted.

The B extension will reduce the general case to 5 instructions, using PACK instead of shift/or.

Aarch64 can do the general case with 4 instructions as it can insert a 16 bit literal into any of the 4 possible positions in a 64 bit register, so 128 bits of instructions total. That's four instructions instead of one, but only 33% more bits.

Quadibloc

unread,

Apr 6, 2020, 6:22:20 AM4/6/20

to

On Sunday, April 5, 2020 at 7:33:35 PM UTC-6, Michael Barry wrote:

> I always thought of "COME FROM" as equivalent to a breakpoint, and I found
> some corroborating opinions during a quick search.

Interesting.

The way I thought of it was that it was isomorphic to the way programs were
written: COME FROM -> label, or line number; line number -> GO TO.

Which was, of course, the joke.

John Savard

Quadibloc

unread,

Apr 6, 2020, 6:28:41 AM4/6/20

to

On Sunday, April 5, 2020 at 7:33:35 PM UTC-6, Michael Barry wrote:

> On Sunday, April 5, 2020 at 3:35:34 PM UTC-7, Quadibloc wrote:> added 16-bit integer immediates.)
> > [...]
> > Hence, I came up with an idea, described here some time ago, where
> > instructions of uniform length had short pointers into unused storage in
> > an instruction "container", as you put it, to the immediates. (As they're
> > pointed to, of course, they're not "really" immediates, but they're close
> > by in the instruction stream, so they have the advantages of immediates
> > instead of the drawbacks of constants fetched from 'way over in data
> > storage.)
> > [...]
>
> I believe that concept is equivalent (or nearly so) to a "literal pool",
> which has been around for many decades.

Maybe it is _nearly_ equivalent to that. Put the literals on page zero of an
architecture like the PDP-8 (or HP 211x or Honeywel 316/516).

But what I'm doing is putting the constants *in the instruction stream*, so I'm
getting _close_ to what Mitch Alsup wants. Instead of having no overhead, I have
a three or four bit field in the instruction corresponding to the constant, but
that avoids having to have instructions whose lengths vary all over the place,
which will also involve overhead of a different kind.

Instead, the instructions can be decoded as if they're all the same length, even
if, in addition to immediates, the same mechanism is used to point to additional
parts of instructions that need to be longer.

John Savard

Anton Ertl

unread,

Apr 6, 2020, 8:36:47 AM4/6/20

to

Bruce Hoult <bruce...@gmail.com> writes:
>There is a gain for 64 bit literals though. Mitch's machine would need 32 +=
> 64 bits (96), while the base RV64I instruction set needs up to 6x 32 bit i=

>nstructions (192 bits) in the general case.

IIRC on Alpha general 64-bit literals are loaded with one 32-bit
instruction, from the constant table. So in terms of code size and
I-cache it is a win (but I think that the obsession of many posters
here with code size misguided). Costs: 1) D-cache pressure (and there
the utilization of the cache line may be significantly lower than in
the I-cache). 2) It consumes a register as pointer to the constant
table. 3) It consumes additional instructions when calling/returning
between compilation units.

Also, when I looked at the code generated for the big constants in my
programs, gcc often had managed to produce a short sequence of
instructions instead; somebody must have invested quite a bit of work
into that.

>This is reduced with the C exte=
>nsion as the final shift and OR can with suitable register allocation alway=
>s be 16 bit instructions (so 160 bits). Specific constants can be smaller a=
>s one or more of the LUIs or ADDIs may be able to be encoded as 16 bit inst=
>ructions or even omitted.
>
>The B extension will reduce the general case to 5 instructions, using PACK =
>instead of shift/or.

An instruction that does

dest = lit16|(src<<16)

would allow to do it in 128 bits. Only one instruction would be
needed (use zero as src for the first one). Maybe a second one with

dest = (~lit16)|(src<<16)

to make the sequences for -2^95<=n<0 shorter.

Alternatively, with 22-bit literals and

dest = lit22|(dest<<22)

it could be done in 96 bits. You would need a second instruction for
the initial literal:

dest = sext(lit22)

So why do neither Aarch64 nor RISC-V have it (and not Alpha, which was
also designed from the start as a 64-bit architecture)? Maybe the
sequential nature of the sequences is the reason; but is it a real
problem?

1) In single-issue implementations it is not.

2) In OoO implementations, it should not be, either: With good branch
prediction (the usual case) the front end runs far ahead of the
execution of instructions on the critical data-flow path. These
constant sequences would depend on nothing and their instructions
would be executed soon after being decoded, and their results would be
available early enough to avoid delaying the instructions on the
critical path. Also instructions from these sequences can be combined
in the decoder.

3) Superscalar in-order implementations pose the biggest problem, but
even there it does not seem to be particularly bad: a) The compiler
can interleave instructions from constant sequences with other
instructions to provide ILP. b) The decoder can combine these
sequences into a single instruction. However, one probably has to
decide in advance which way such implementations should go because
dealing with it in the decoder probably becomes much harder if the
instructions are interleaved.

Anton Ertl

unread,

Apr 6, 2020, 8:43:23 AM4/6/20

to

Quadibloc <jsa...@ecn.ab.ca> writes:
>On Sunday, April 5, 2020 at 3:00:11 AM UTC-6, Anton Ertl wrote:
>
>> The one thing where the approach of the original MIPS has the biggest
>> disadvantage is long addition/subtraction. Architectures with
>> condition codes have a carry bit for that, original MIPS and its
>> descendants replace this by performing more instructions. I dimly
>> remember that a recent MIPS revision incorporates a carry bit. Your
>> approach does not solve this problem.
>
>My approach certainly *does* solve the problem, and perfectly:
>
>ADD 1,2
>JCC CARRY
>
>becomes
>
>PCBC CARRY
>ADD 1,2
>
>which does exactly the same thing -

But it does not solve the problem.

A 256-bit addition on AMD64 looks as follows:

add rsi,r9
adc rdi,r10
adc rax,r11
adc rbc,r12

with each instruction propagating the carry bit to the next one. How
would you do that?

Terje Mathisen

unread,

Apr 6, 2020, 8:57:26 AM4/6/20

to

Your pair of ADD3 and AGC3 instructions is just a way to split a single
ADC with two explicit and one implicit source into two instructions that
need three explicit inputs each, and which do exactly the same
underlying operation, so the only rasonable implementation would be to
detect such a pair and internally turn it into a three-source and
two-destination operation.

BTW, it also has the additional issue that the third input cannot be
general, i.e. it pretty much has to use only the bottom bit and/or
require it to be either 1 or 0: If you allow any three inputs then you
must also allow the AGC3 part to generate two-bit outputs.

>
> If testing for carry then compare and branch works just fine.
>
> For overflow/wrap detection one wants to trap at the point of detection,
> not branch, because that leaves the program counter pointing at the
> problem instruction.
>
> While Overflow requires long-ish sequence to generate manually,
> and if various trap-on-overflow instructions are already
> present then there is minimal case for additional support.
> (here I'm thinking of Add Generate Overflow or Subtract Generate
> Overflow instructions which write the overflow flag to a register.)
>

This is much more reasonable if you cannot just have the full orthogonal
set, which includes add_trap_on_overflow.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Terje Mathisen

unread,

Apr 6, 2020, 9:23:00 AM4/6/20

to

CMOVE is mostly quite slow, so for limited sizes, some of those carry
chains can be short enough to make at least the final (rare overflow)
carry better handled with a branch.

Anton Ertl

unread,

Apr 6, 2020, 9:49:18 AM4/6/20

to

EricP <ThatWould...@thevillage.com> writes:
>For overflow/wrap detection one wants to trap at the point of detection,
>not branch, because that leaves the program counter pointing at the
>problem instruction.
>
>While Overflow requires long-ish sequence to generate manually,
>and if various trap-on-overflow instructions are already
>present then there is minimal case for additional support.
>(here I'm thinking of Add Generate Overflow or Subtract Generate
>Overflow instructions which write the overflow flag to a register.)

There are a few use cases for branch on overflow:

1) comparing with minint:

dec %edi
jno ...

This extends to other operand sizes and variations can be used for
maxint, and for comparing with a range minint...minint+c, or
maxint-c...maxint.

2) Forth's +LOOP is specified in a way that maps pretty well to integer
overflow and branch on overflow. E.g., in SwiftForth "5 +LOOP"
compiles to

add 5,0(%esp)
jno ...

while the C code in Gforth is as follows:

Cell olddiff = n1-nlimit;
n2=n1+n;
if (((olddiff^(olddiff+n)) &(olddiff^n)) >=0) {
/* perform the branch */
}
/* fall-through path */

with the resulting code being

mov %rax,%r9
sub 0x8(%r13),%r9
add %r14,%rax
lea (%r9,%r14,1),%rsi
xor %r9,%r14
xor %r9,%rsi
test %r14,%rsi
js fallthrough
... perform the branch

This is not completely comparable, because SwiftForth keeps a biased
value (I think olddiff+maxint) in 0(%esp), but even if we wanted to
use a similar trick in Gforth, it would only eliminate the first two
instructions.

Alex McDonald

unread,

Apr 6, 2020, 11:57:38 AM4/6/20

to

I have never understood why ISAs insist on having literal forms of
instructions where the literal is in /this/ instruction. I know the
square root of fanny adams about the subject, but why not have the
literal with its own size opcode follow?

ldz-lit-following reg ; load literal zero extend into reg
32bits 1234567 ; opcode for 32bit literal

or

ldz-lit-following reg ; same op as above
8bits 10 ;

--
Alex

Anton Ertl

unread,

Apr 6, 2020, 12:46:10 PM4/6/20

to

Alex McDonald <al...@rivadpm.com> writes:
>I have never understood why ISAs insist on having literal forms of
>instructions where the literal is in /this/ instruction. I know the
>square root of fanny adams about the subject, but why not have the
>literal with its own size opcode follow?

There are such instruction sets, e.g., VAX, IA-32, and b16. However,
if you do a fixed-width instruction set for easy decoding, you don't
want some data item in the middle of the instruction stream.

EricP

unread,

Apr 6, 2020, 2:39:13 PM4/6/20

to

> .... perform the branch

>
> This is not completely comparable, because SwiftForth keeps a biased
> value (I think olddiff+maxint) in 0(%esp), but even if we wanted to
> use a similar trick in Gforth, it would only eliminate the first two
> instructions.
>
> - anton

Those are examples of doing overflow error detection using branches
because that is the only tool available to them. And it changes
the program counter away from the erroneous instruction loosing
traceability to the error.

An AddTrapOverflow instruction is the proper tool because it has no cost
and leaves the program counter pointing at the erroneous instruction.

Other than error detection I can't think of any other use for an
Overflow flag. But even if there is such an example, its frequency
of occurrence is so low that it does not warrant support in an ISA.

already...@yahoo.com

unread,

Apr 6, 2020, 3:14:42 PM4/6/20

to

At least on x86/Arm/SPARC Overflow flag is used for signed integer comparisons.
It sounds like FAR more common use than error detection.

EricP

unread,

Apr 6, 2020, 3:38:55 PM4/6/20

to

Ivan Godard wrote:
>
> I thought everyone used CMOVE for extended precision? Why a branch? It's
> going to predict badly.

This is about what happens when one eliminates the condition code
register from an ISA. Some condition flags, sign, zero, odd/even,
are redundant with the result value so there is no loss.
Only Carry and Overflow represent a potential loss of information
and one needs to ensure equivalent functionality is provided.

One common usage for a carry flag as a distinct register
is Add With Carry, Subtract With Borrow instructions.

An Add With Carry on an ISA without a carry flag takes 5 instructions

tmp_res = A + B
result = tmp_res + carry_in
tmp_cr1 = tmp_res < A
tmp_cr2 = result < tmp_res
carry_out = tmp_cr1 | tmp_cr2

That might be too expensive for some applications.
It could be done in a single instruction that has 3 source registers
and 2 destination registers. Most ISA's would find that unacceptable.

A compromise is my suggestion of 2 instructions with 3 source registers
and 1 dest register, one to calc the sum, one to generate the carry out.

Another use of carry out is as an error indicator on unsigned overflow.
An AddTrapCarry instruction is best suited for that.

And anyone who really wants to detect a carry out and save it in a
register can use a CMP instruction, and branch on it if they like.

None of this requires merging branches into arithmetic operations.

Ivan Godard

unread,

Apr 6, 2020, 3:54:15 PM4/6/20

to

Another data point:

Various wide-encoding (bundled, VLIW) ISAs have encoded big literals by
repurposing other slots in the bundle. In effect the decoder provides
anonymous registers holding the un-decoded bits of instructions in all
or a defined set of the instructions in the bundle. An instruction
elsewhere in the bundle can then treat the instruction pseudo-registers
as real registers in arithmetic operations. In the only case I
personally worked on the whole bundle was really more like horizontal
microcode than a conventional VLIW bundle, but that's not relevant to
the method.

The decoder of course had to suppress treating the literal as an actual
instruction. IIRC only one slot could use the "literal register" and use
of the register did the suppress as a side effect; references to the
"literal register" in other slots read as zero, similar to the common
"zero register" technique. Extension to more than one reference, or
other means to disambiguate, seems straightforward.

Mill doesn't repurpose bits like the above; instead adjacent
instructions in the bundle can concatenate their carried literals to
make a larger one. Every slot is assigned 32+(3*morsel) bits in a
literal buffer, which bits present as zero for all bits not specified by
the instruction in the slot. Most instructions just use (or don't) the
bits in their own part of the buffer, but one opcode says "consider that
the owned bits of the instruction to my left actually extend into my
part of the buffer". Note that there is no actual data movement in this.

The "ownership extension" facility is used mostly by the Mill
instructions that take a list of things as an argument, such as call and
its argument list for example. However it is also used to build wide
literals, including quad (128 bit) literals on members with hardware
support for quad. In a worst case (Gold, with a 5-bit morsel size and a
32 entry belt) a predicated call instruction with 32 arguments and more
than one result needs 5 (predicate) + 5 (result count) + 32*5
(arguments) + 32 (address), or 202 bits, while each slot owns 32+(3*5)
or 47 bits. The instruction thus encodes as one slot with a call
operation and four slots with the "flowArgs" instruction that extends
the part of the literal buffer owned by the call.

Gold has only five flow slots (as presently configured), so a monster
call like this would prevent the bundle from having any other flow
instruction in that bundle. More typical calls use only one or two
slots, so a bundle with a call can also encode some loads, stores, or
other calls too.

Mitch hasn't talked about native quad support (that I noticed anyway),
but it looks like his encoding scheme could be naturally extended to
quad literals.

Machines with SIMD might also have a need for very big literals. So far
the Mill splat and iota hardware instructions seem to be sufficient in
the very limited exploration of SIMD we have yet done. It's not clear
whether we could extend the present Mill scheme to say 512-bit literals
for SIMD; it would take 10 flow slots at 47 bits each, and many of those
slots would be otherwise useless.

EricP

unread,

Apr 6, 2020, 4:09:37 PM4/6/20

to

On a RISC-ish ISA the Compare instruction provides for that.

So to rephrase my question, when an ISA eliminates the condition
code register, and adds CMP rd,rs1,rs2 and AddTrapOverflow
instructions to provide equivalent or better functionality,
is there any other use for the prior Overflow condition code
flag that needs to be provided for?

Anton Ertl

unread,

Apr 6, 2020, 5:08:43 PM4/6/20

to

EricP <ThatWould...@thevillage.com> writes:

>Anton Ertl wrote:
>> There are a few use cases for branch on overflow:
>>
>> 1) comparing with minint:
>>
>> dec %edi
>> jno ...
>>
>> This extends to other operand sizes and variations can be used for
>> maxint, and for comparing with a range minint...minint+c, or
>> maxint-c...maxint.
>>
>> 2) Forth's +LOOP is specified in a way that maps pretty well to integer
>> overflow and branch on overflow. E.g., in SwiftForth "5 +LOOP"
>> compiles to
>>
>> add 5,0(%esp)
>> jno ...

...

>Those are examples of doing overflow error detection using branches
>because that is the only tool available to them.

No. If you compare with minint (use case 1), equality is not an
error. The exit of a counted loop (use case 2) is not an error,
either.

>Other than error detection I can't think of any other use for an
>Overflow flag.

Or you choose to ignore them.

>But even if there is such an example, its frequency
>of occurrence is so low that it does not warrant support in an ISA.

The original MIPS and its descendants seem to support your position.
However, microMIPS32 r6 has added BOVC and BNVC, which are
add-and-and-branch-on-overflow instructions. It seems that, after 30
years, they found that the needs are frequent enough to warrant
support in an ISA.

MitchAlsup

unread,

Apr 6, 2020, 5:12:35 PM4/6/20

to

On Sunday, April 5, 2020 at 10:41:18 PM UTC-5, Bruce Hoult wrote:
> On Monday, April 6, 2020 at 11:13:04 AM UTC+12, Andy Valencia wrote:

> > MitchAlsup <?????@aol.com> writes:
> > > Most of this 6% are instructions used to 'paste' constants together. And=20
> > > this data was from 32-bit only implementations
> >
> > But can't the decoder basically treat the obvious LIU/ORI sequence
> > as one decoded instruction?
>
> Yes, in a mid to high end implementation. So while the number of instructions is higher (let's assume Mitch's 6%) the number of uOps isn't if merging happens.
>
> Also I'm not sure whether Mitch meant static or dynamic instructions. Constant loads can usually be moved outside loops.
>
>
> >Or is it the increased icache pressure which causes the biggest problem?
>
> Again assuming the number of static instructions is 6% higher, the code size isn't 6% higher because the two fixed size instructions aren't twice as big as the single variable-length instruction. If i recall correctly, the basic instruction size in MY 66000 is 32 bits, with any literals adding to that. If that's correct then a 32 bit instruction with appended 32 bit literal is 64 bits, exactly the same as two 32 bit MIPS or RISC-V instructions.

But MIPS has only assembled the constant by this point, not used it as
an operand as My 66000 has.

>
> There is a gain for 64 bit literals though. Mitch's machine would need 32 + 64 bits (96), while the base RV64I instruction set needs up to 6x 32 bit instructions (192 bits) in the general case. This is reduced with the C extension as the final shift and OR can with suitable register allocation always be 16 bit instructions (so 160 bits). Specific constants can be smaller as one or more of the LUIs or ADDIs may be able to be encoded as 16 bit instructions or even omitted.

Which means 64-bit codes will have more overhead than mentioned above.

>
> The B extension will reduce the general case to 5 instructions, using PACK instead of shift/or.
>
> Aarch64 can do the general case with 4 instructions as it can insert a 16 bit literal into any of the 4 possible positions in a 64 bit register, so 128 bits of instructions total. That's four instructions instead of one, but only 33% more bits.

How much nicer it is to just have constants available.

MitchAlsup

unread,

Apr 6, 2020, 5:20:53 PM4/6/20

to

You are loading the literal for a reason--to use it. in My 66000 ISA
you grab the constant from the Istream and use it in the instruction
carrying the literal. Here, you just "got" the literal, whereas::

DIV R7,123987456873,R19

You have both gotten the literal and used it. large Immediates are almost
always use once. Displacements can be used several times.

EricP

unread,

Apr 6, 2020, 5:26:40 PM4/6/20

to

Terje Mathisen wrote:

> EricP wrote:
>>
>> When propagating a carry, one wants branch-less code.
>> Two instructions can do it: ADD3 adds 3 registers and writes the
>> truncated result to a single register, and AGC3 Add Generate Carry
>> adds 3 registers and writes just the carry bit to a register.
>> Similar for subtract SUB3 and SGB3.
>> These 4 instructions replace the long-ish RISC sequences.
>
> Your pair of ADD3 and AGC3 instructions is just a way to split a single
> ADC with two explicit and one implicit source into two instructions that
> need three explicit inputs each, and which do exactly the same
> underlying operation, so the only rasonable implementation would be to
> detect such a pair and internally turn it into a three-source and
> two-destination operation.

Actually what I was trying to do was not have instructions that are
(a) rarely used and (b) expensive for a minimum implementation.

I felt 2 dest registers in a single instruction, whether specified
in an ISA instruction or created by fusion, has a bunch of nasty
uArch implications for rename, writeback ports and writeback
arbitration that I did not want to force on all implementations.

3 source registers is already required for a base-index store
so there is no additional cost if an ISA has that (which I would).
Also there are some bitfield extract-merge instruction that can use
3 source registers.

So 2 instructions ADD3 and AGC3 seems a reasonable compromise.

> BTW, it also has the additional issue that the third input cannot be
> general, i.e. it pretty much has to use only the bottom bit and/or
> require it to be either 1 or 0: If you allow any three inputs then you
> must also allow the AGC3 part to generate two-bit outputs.

I was only thinking of 1 carry bit but I see no reason to
limit it to that - so 2 carry bits it is!

Quadibloc

unread,

Apr 6, 2020, 5:29:02 PM4/6/20

to

On Monday, April 6, 2020 at 8:39:13 AM UTC-6, EricP wrote:
> And it changes
> the program counter away from the erroneous instruction loosing
> traceability to the error.

Ah, so that's an argument for having a conditional jump to subroutine instruction.

John Savard

Quadibloc

unread,

Apr 6, 2020, 5:33:48 PM4/6/20

to

On Monday, April 6, 2020 at 2:43:23 AM UTC-6, Anton Ertl wrote:

> But it does not solve the problem.
>
> A 256-bit addition on AMD64 looks as follows:
>
> add rsi,r9
> adc rdi,r10
> adc rax,r11
> adc rbc,r12
>
> with each instruction propagating the carry bit to the next one. How
> would you do that?

A 128-bit addition would work, simply by treating "add with carry" like a
conditional branch, as an instruction prefix to the plain add instruction.

Making it both a prefix and prefixable leads to the problem of excessively long
sequences that could block interrupts. So there is a problem.

John Savard

Stefan Monnier

unread,

Apr 6, 2020, 5:44:07 PM4/6/20

to

> So 2 instructions ADD3 and AGC3 seems a reasonable compromise.

As mentioned by someone else before, the issue of condition codes has
seen many different variations in different architectures, and none seem
very clearly superior to the other in general.

Does any know of an architecture which stored the carry/overflow bits as
extra "metadata" bits in the registers? So you'd need, say, 34 bit
registers to hold the "32bit data, plus overflow and carry", but those
extra bits wouldn't be sent to cache&dram (except for context
switching, of course). It seems like an extra 2 bits in each register
should be a fairly cheap, and it eliminates the "implicit state" of
tradition condition codes. It also gives you multiple condition
codes, similarly to what Power does. It would still require a 3-input
instruction for "add with carry", but it'd be otherwise just as
efficient as i386's `addc` when doing a 256-bit addition.

Stefan

EricP

unread,

Apr 6, 2020, 5:55:32 PM4/6/20

to

Anton Ertl wrote:
> EricP <ThatWould...@thevillage.com> writes:
>> Anton Ertl wrote:
>>> There are a few use cases for branch on overflow:
>>>
>>> 1) comparing with minint:
>>>
>>> dec %edi
>>> jno ...
>>>
>>> This extends to other operand sizes and variations can be used for
>>> maxint, and for comparing with a range minint...minint+c, or
>>> maxint-c...maxint.
>>>
>>> 2) Forth's +LOOP is specified in a way that maps pretty well to integer
>>> overflow and branch on overflow. E.g., in SwiftForth "5 +LOOP"
>>> compiles to
>>>
>>> add 5,0(%esp)
>>> jno ...

> ....

>> Those are examples of doing overflow error detection using branches
>> because that is the only tool available to them.
>
> No. If you compare with minint (use case 1), equality is not an
> error. The exit of a counted loop (use case 2) is not an error,
> either.

Oops, sorry, I was going too fast and missed your point.

Ok, yes minint range compare takes extra instructions to do the same.
Does that happen a lot? Because that looks like a pretty special case.

Why would SwiftForth have a signed loop counter that overflows?
Is there some language reason that an unsigned loop counter
can't do the job?

>> Other than error detection I can't think of any other use for an
>> Overflow flag.
>
> Or you choose to ignore them.
>
>> But even if there is such an example, its frequency
>> of occurrence is so low that it does not warrant support in an ISA.
>
> The original MIPS and its descendants seem to support your position.
> However, microMIPS32 r6 has added BOVC and BNVC, which are
> add-and-and-branch-on-overflow instructions. It seems that, after 30
> years, they found that the needs are frequent enough to warrant
> support in an ISA.
>
> - anton

Ok, but is that because they removed various XxxTrapOverflow instructions?
I'll look at it.

Terje Mathisen

unread,

Apr 6, 2020, 6:02:49 PM4/6/20

to

MitchAlsup wrote:
> On Sunday, April 5, 2020 at 10:41:18 PM UTC-5, Bruce Hoult wrote:
>> There is a gain for 64 bit literals though. Mitch's machine would
>> need 32 + 64 bits (96), while the base RV64I instruction set needs
>> up to 6x 32 bit instructions (192 bits) in the general case. This
>> is reduced with the C extension as the final shift and OR can with
>> suitable register allocation always be 16 bit instructions (so 160
>> bits). Specific constants can be smaller as one or more of the LUIs
>> or ADDIs may be able to be encoded as 16 bit instructions or even
>> omitted.
>
> Which means 64-bit codes will have more overhead than mentioned
> above.

Yeah, my guess is that a compiler would be foreced to put almost all
wide immediates in a constant pool and load them from there, intruducing
at least one but often two cycles of extra latency.

>>
>> The B extension will reduce the general case to 5 instructions,
>> using PACK instead of shift/or.
>>
>> Aarch64 can do the general case with 4 instructions as it can
>> insert a 16 bit literal into any of the 4 possible positions in a
>> 64 bit register, so 128 bits of instructions total. That's four
>> instructions instead of one, but only 33% more bits.
>
> How much nicer it is to just have constants available.
>

This is one of these problems where anyone who "has seen the light" is
just shaking their heads that there are still people stumbling around in
the darkness. :-)

OTOH, if you are heavily invested in a fixed-size 32/16-bit decoder,
then I really do see that it can be extremely tempting to keep using the
(quite ugly) workarounds.

MitchAlsup

unread,

Apr 6, 2020, 7:01:17 PM4/6/20

to

On Monday, April 6, 2020 at 12:26:40 PM UTC-5, EricP wrote:
> Terje Mathisen wrote:
> > EricP wrote:
> >>
> >> When propagating a carry, one wants branch-less code.
> >> Two instructions can do it: ADD3 adds 3 registers and writes the
> >> truncated result to a single register, and AGC3 Add Generate Carry
> >> adds 3 registers and writes just the carry bit to a register.
> >> Similar for subtract SUB3 and SGB3.
> >> These 4 instructions replace the long-ish RISC sequences.
> >
> > Your pair of ADD3 and AGC3 instructions is just a way to split a single
> > ADC with two explicit and one implicit source into two instructions that
> > need three explicit inputs each, and which do exactly the same
> > underlying operation, so the only rasonable implementation would be to
> > detect such a pair and internally turn it into a three-source and
> > two-destination operation.
>
> Actually what I was trying to do was not have instructions that are
> (a) rarely used and (b) expensive for a minimum implementation.
>
> I felt 2 dest registers in a single instruction, whether specified
> in an ISA instruction or created by fusion, has a bunch of nasty
> uArch implications for rename, writeback ports and writeback
> arbitration that I did not want to force on all implementations.

It is for reasons like this, and my experiences with condition code
machines, that I explicitly left out condition codes from My 66000.

>
> 3 source registers is already required for a base-index store
> so there is no additional cost if an ISA has that (which I would).
> Also there are some bitfield extract-merge instruction that can use
> 3 source registers.

Minor quibble::

My 66000 has 3-register store:
STx Rd,[Rb+Ri<<s+DISP]
But all the implementations I have conceived, delay the reading of store
data until after cache hit (and TLB hit). And even here, the store data
resides in a 'slot' in the decoder waiting for a non-3-register inst
so the stores data read borrows a read port from another instruction.

It is more accurate to state that 3-source registers now required to
comply with IEEE 754-2008 FMAC functionality.

>
> So 2 instructions ADD3 and AGC3 seems a reasonable compromise.
>
> > BTW, it also has the additional issue that the third input cannot be
> > general, i.e. it pretty much has to use only the bottom bit and/or
> > require it to be either 1 or 0: If you allow any three inputs then you
> > must also allow the AGC3 part to generate two-bit outputs.
>
> I was only thinking of 1 carry bit but I see no reason to
> limit it to that - so 2 carry bits it is!

Technically, the upper n-bits of a n×n multiply are all carry bits!

MitchAlsup

unread,

Apr 6, 2020, 7:05:52 PM4/6/20

to

UNLESS a variable is going to have a negative value in it, it should
be defined as unsigned !

If it is defined as signed, the program should be ready to take overflow
exceptions !

>
> >> Other than error detection I can't think of any other use for an
> >> Overflow flag.
> >
> > Or you choose to ignore them.
> >
> >> But even if there is such an example, its frequency
> >> of occurrence is so low that it does not warrant support in an ISA.
> >
> > The original MIPS and its descendants seem to support your position.
> > However, microMIPS32 r6 has added BOVC and BNVC, which are
> > add-and-and-branch-on-overflow instructions. It seems that, after 30
> > years, they found that the needs are frequent enough to warrant
> > support in an ISA.
> >
> > - anton
>
> Ok, but is that because they removed various XxxTrapOverflow instructions?
> I'll look at it.

All signed instructions should be capable of raising overflow exceptions.

EricP

unread,

Apr 6, 2020, 7:27:13 PM4/6/20

to

EricP wrote:
> Anton Ertl wrote:
>>

>> The original MIPS and its descendants seem to support your position.
>> However, microMIPS32 r6 has added BOVC and BNVC, which are
>> add-and-and-branch-on-overflow instructions. It seems that, after 30
>> years, they found that the needs are frequent enough to warrant
>> support in an ISA.
>>
>> - anton
>
> Ok, but is that because they removed various XxxTrapOverflow instructions?
> I'll look at it.

The microMIPS32 BOVC and BNVC instructions perform the add,
discards the result, and branches on signed overflow/no_overflow.
Its really a slightly different way of doing signed Compare-And-Branch.

32-bit instructions for ADD and SUB still trap on overflow,
and there are new 16-bit compact non-trapping ADDU and SUBU variants.

The only example they give of BOVC use looks like a coding error.
They want to increment a 64-bit value composed of 2 32-bit values,
so they add 1 to the low 32-bit part and branch if it signed overflows.
So they tested if the low value was 0x7FFFFFFF, not 0xFFFFFFFF,
then increment the high 32-bit part.
A web search finds no other examples of BOVC use.

In the following example from their manual LLWP = Loads Lock Word Pair,
low part into T2, high part T3. SCWP = Store Conditional Word Pair.

From page 306 MicroMIPS32 manual:

LLWP and SCWP are used to atomically update memory locations,
as shown below.

L1:
LLWP T2, T3, (T0) # load T2 and T3
BOVC T2, 1, U32 # check whether least-significant word may overflow
ADDI T2, T2, 1 # increment lower - only
SCWP T2, T3, (T0) # store T2 and T3
BEQC T2, 0, L1 # if not atomic (0), try again

U32:
ADDI T2, T2, 1 # increment lower
ADDI T3, T3, 1 # increment upper
SCWP T2, T3, (T0)
BEQC T2, 0, L1 # if not atomic (0), try again

anti...@math.uni.wroc.pl

unread,

Apr 6, 2020, 7:29:46 PM4/6/20

to

Quadibloc <jsa...@ecn.ab.ca> wrote:
> On Sunday, April 5, 2020 at 2:51:51 AM UTC-6, Michael Barry wrote:
>
> > So, you're still holding state (condition and conditional branch address),
> > but it only has to persist for the following instruction? As long as the
> > programmer cooperates and no interrupts hit in the middle, it should work,
> > at least in theory. Whether or not it offers an advantage meritorious
> > of further study is a decision well above my pay scale ... and if this is
> > an April Fool's joke, you got me!
>
> I had been doing further thinking about this, and, no, it isn't a joke. Even if
> it's analogous to a previous April Fool's joke that I referenced.
>
> This was the Datamation article about the COME FROM statement. It eliminated the
> GO TO by making it go backwards.

Article was a joke, but COME FROM actually looks like good idea.
I see two reasons to have it:

- long delay branches: traditional delayed branches help almost
nothing when execution window is large. So one wants branch
which has sufficiently long delay. Long delay branch have
problem that one needs to specify two addresses: from address
(place where to execite branch) and target address. To
simplify we can decide that long delay branch should branch
to itself (for loops) or to following instruction. It seems
that best name for such instruction is "COME FROM"

- interrupts. Interrups pose problem for higher level
software because they may happen (or even usually happen)
when software is not prepared to handle them. So one
needs a way to transfer control to main probgram and to
come back to interrupt handler after main program
reaches a safe point. This could be arranged by duplicating
code, normal version just executes, cleanup version
transfers control to interrupt handler after reaching
a safe point. But duplication is expensive, and
special instruction can solve this.

For interrups problem there are alternatives, on processors
having debug registers they allow to trap at given execution
address and could be used by interrupt handler. But
debug registers usualy are considered as an extra that
program should not use. And using them in program means
that they are no longer available for intended use. But
special "trap registers" could be used instead.

--
Waldek Hebisch

Terje Mathisen

unread,

Apr 6, 2020, 7:49:27 PM4/6/20

to

As you know we are in full agreement here, I would even go so far as to
say that you should use unsigned even if -1 is used as an error return
somewhere: In that case you compare the current value for exact match
with that bit pattern.

>
> If it is defined as signed, the program should be ready to take overflow
> exceptions !
>>
>>>> Other than error detection I can't think of any other use for an
>>>> Overflow flag.
>>>
>>> Or you choose to ignore them.
>>>
>>>> But even if there is such an example, its frequency
>>>> of occurrence is so low that it does not warrant support in an ISA.
>>>
>>> The original MIPS and its descendants seem to support your position.
>>> However, microMIPS32 r6 has added BOVC and BNVC, which are
>>> add-and-and-branch-on-overflow instructions. It seems that, after 30
>>> years, they found that the needs are frequent enough to warrant
>>> support in an ISA.
>>>
>>> - anton
>>
>> Ok, but is that because they removed various XxxTrapOverflow instructions?
>> I'll look at it.
>
> All signed instructions should be capable of raising overflow exceptions.
>

Yes.

MitchAlsup

unread,

Apr 6, 2020, 8:02:26 PM4/6/20

to

When that called subroutine is defined as returning a signed, you should
define the receiving container to have signed type.

About 97% of all my C programming uses only unsigned integers. Having done
this for several decades, unsigned arithmetic semantics have a much lower
surprise coefficient; especially when mixing signed and unsigned.

Stephen Fuld

unread,

Apr 6, 2020, 8:02:38 PM4/6/20

to

On 4/5/2020 9:05 AM, Quadibloc wrote:

snip

> I had been doing further thinking about this, and, no, it isn't a joke. Even if
> it's analogous to a previous April Fool's joke that I referenced.
>
> This was the Datamation article about the COME FROM statement. It eliminated the

> GO TO by making it go backwards. Here, by making the order of a jump instruction
> and an operate instruction backwards,

Yes, and what happens if the program contains multiple COME FROM
statements with the same "target" address, multiple thread creation? :-)

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

MitchAlsup

unread,

Apr 6, 2020, 8:05:56 PM4/6/20

to

On Monday, April 6, 2020 at 3:02:38 PM UTC-5, Stephen Fuld wrote:
> On 4/5/2020 9:05 AM, Quadibloc wrote:
>
> snip
>
> > I had been doing further thinking about this, and, no, it isn't a joke. Even if
> > it's analogous to a previous April Fool's joke that I referenced.
> >
> > This was the Datamation article about the COME FROM statement. It eliminated the
> > GO TO by making it go backwards. Here, by making the order of a jump instruction
> > and an operate instruction backwards,
>
>
> Yes, and what happens if the program contains multiple COME FROM
> statements with the same "target" address, multiple thread creation? :-)

Of course ! you can even perform FORKs using multiple COME FROMs.

BGB

unread,

Apr 6, 2020, 9:12:02 PM4/6/20

to

On 4/6/2020 2:48 AM, Anton Ertl wrote:
> Bruce Hoult <bruce...@gmail.com> writes:

>> There is a gain for 64 bit literals though. Mitch's machine would need 32 +=
>> 64 bits (96), while the base RV64I instruction set needs up to 6x 32 bit i=

>> nstructions (192 bits) in the general case.
>

> IIRC on Alpha general 64-bit literals are loaded with one 32-bit
> instruction, from the constant table. So in terms of code size and
> I-cache it is a win (but I think that the obsession of many posters
> here with code size misguided). Costs: 1) D-cache pressure (and there
> the utilization of the cache line may be significantly lower than in
> the I-cache). 2) It consumes a register as pointer to the constant
> table. 3) It consumes additional instructions when calling/returning
> between compilation units.
>
> Also, when I looked at the code generated for the big constants in my
> programs, gcc often had managed to produce a short sequence of
> instructions instead; somebody must have invested quite a bit of work
> into that.
>

Yeah; loading constants from memory kinda sucks...

>> This is reduced with the C exte=
>> nsion as the final shift and OR can with suitable register allocation alway=
>> s be 16 bit instructions (so 160 bits). Specific constants can be smaller a=
>> s one or more of the LUIs or ADDIs may be able to be encoded as 16 bit inst=
>> ructions or even omitted.
>>
>> The B extension will reduce the general case to 5 instructions, using PACK =
>> instead of shift/or.
>
> An instruction that does
>
> dest = lit16|(src<<16)
>
> would allow to do it in 128 bits. Only one instruction would be
> needed (use zero as src for the first one). Maybe a second one with
>
> dest = (~lit16)|(src<<16)
>
> to make the sequences for -2^95<=n<0 shorter.
>

BJX2 basically has these already...
Essentially, this constant-loading mechanism was what the ISA was
originally designed around.

Its role/future is less certain now, with the ISA now having "jumbo ops"
which can more directly encode larger immediate values. But, it still
makes sense to keep around given jumbo-ops are a fairly expensive
feature (so it makes sense to keep them optional).

The jumbo ops have a "jumbo prefix" which glues 22 bits onto the
following op, allowing cases of:
OP Imm33s, Rn
OP Rm, Imm32s, Rn
OP (Rm, Disp31), Rn

As well as a 64-bit constant load (can be encoded in 96 bits), and a
32-bit constant load in 64 bits.

The constant load cases don't save much space (loading a 32-bit constant
in two 32-bit ops was already possible); but in this case, does allow
the constant to be loaded in a single clock cycle.

> Alternatively, with 22-bit literals and
>
> dest = lit22|(dest<<22)
>
> it could be done in 96 bits. You would need a second instruction for
> the initial literal:
>
> dest = sext(lit22)
>

However, 22 is a bit steep (if one also needs to provide space for a
destination register).

There are instructions to load a 25-bit sign-extended value into R0
however, which was/is a common case for immediates which don't fit in
whatever the instruction provides, but do happen to fit into a 25-bit
immediate.

In early design stages, there was IIRC an op for:
Rn = (R0<<16)|Imm16;
But it no longer exists, as it wouldn't have saved much.

> So why do neither Aarch64 nor RISC-V have it (and not Alpha, which was
> also designed from the start as a 64-bit architecture)? Maybe the
> sequential nature of the sequences is the reason; but is it a real
> problem?
>
> 1) In single-issue implementations it is not.
>
> 2) In OoO implementations, it should not be, either: With good branch
> prediction (the usual case) the front end runs far ahead of the
> execution of instructions on the critical data-flow path. These
> constant sequences would depend on nothing and their instructions
> would be executed soon after being decoded, and their results would be
> available early enough to avoid delaying the instructions on the
> critical path. Also instructions from these sequences can be combined
> in the decoder.
>
> 3) Superscalar in-order implementations pose the biggest problem, but
> even there it does not seem to be particularly bad: a) The compiler
> can interleave instructions from constant sequences with other
> instructions to provide ILP. b) The decoder can combine these
> sequences into a single instruction. However, one probably has to
> decide in advance which way such implementations should go because
> dealing with it in the decoder probably becomes much harder if the
> instructions are interleaved.
>

Can't say for certain.

As I can note, my CPU cores are either single-issue/scalar, or
VLIW-style (which can still run scalar code, albeit slower). There is
partial compatibility to let VLIW style code be usable on a scalar core,
but as-is "jumbo ops" kind of mess this up.

It is possible, however, that jumbo-ops could be supported on a scalar
core via a different mechanism, eg:
The prefix ops are decoded as NOPs (but update some internal state);
The decoder for the following op takes note that a prefix was present.

In my case, I have noted that performance still seems to be effected
somewhat by memory subsystem performance.

Tried boosting the L1 and L2 cache sizes:
Doom engine games saw the biggest improvement in performance;
Quake, still slow, but appears to have a pretty good hit ratio;
ROTT wasn't effected much, but also has the worst hit/miss ratio.

Cache size increases:
L1 I$: 2K -> 8K
L1 D$: 2K -> 16K
L2: 64K -> 128K

Caches are still direct mapped though; this is about the biggest I can
make L2 without going over the FPGA's resource budget.

As-is: Doom variants (Doom 1/2 and Hexen) seem to be getting the best
performance at present.

ROTT seem to get notably worse performance (reason not yet entirely
obvious, *).

So, eg, some approximate "average-ish" framerates:
Doom: ~18-22 (acceptable)
ROTT: ~ 7-10 (kinda sucks/annoying)
Quake: ~ 3-4 (mostly unplayable)

Combined (L1 D$ + L2) hit ratio:
Quake: ~ 98.0% hit;
Doom: ~ 97.6% hit;
ROTT: ~ 91.3% hit.

Average MIPS values ATM (at 50MHz), % cycles waiting for mem-access:
Quake: ~ 19.6 / ~ 54%
Doom: ~ 25.7 / ~ 42%
ROTT: ~ 10.0 / ~ 77%

*: There are a few areas of concern, but nothing yet seems to be
significant enough to explain why it is getting less than half the
framerate of Doom. I already went and rewrote a lot of the inner
renderer loops in ASM (these don't really seem to be the issue).

But, it is like there is something in the background which is being
problematic (well, along with it still being rather buggy and crash-prone).

Note that the numbers are based on modeling the cache behavior in my
emulator, which seems to be generally pretty close (except that ROTT
seems to perform even worse when running on the FPGA hardware;
framerates aren't really all that much better than Quake...).

Of the 3D engines I have tested it also seems to be doing the worst in
terms of cache hit/miss ratio, but this wasn't really improved much by
making the caches bigger.

Much of what misses L1 in ROTT goes all the way to DRAM (~ 30% L2 hit).
Quake seems to have an ~ 66% L2 hit rate.
Doom gets ~ 83% L2 hit (but has a lower L1 hit rate vs Quake).

Ironically, ROTT is also the only one of the engines still using an
8-bit / 256-color renderer (mostly because the engine was doing palette
animation tricks which would break otherwise).

TODO: Maybe port Wolfenstein 3D and compare if it behaves similarly to ROTT.

...

EricP

unread,

Apr 6, 2020, 9:24:18 PM4/6/20

to

Someone here suggested this a number of years ago.
It can work, the question is whether it is worth the cost.

The main issue I can think of is save & restore of those 2
extra bits across subroutine calls and interrupts.

If you access them per-register, you double the save & restore cost.

If you map them all into a single pseudo-register so you can save them
all in 1 store, you create a massive partial register update stall
and renaming nightmare - remember, those flags are renamed with
their parent register so 32 registers is 32 different places
to collect those flag from on save, and distribute to on restore.
And if you are doing that on every subroutine call and return...

moo...@gmail.com

unread,

Apr 6, 2020, 9:27:26 PM4/6/20

to

http://open-std.org/JTC1/SC22/WG21/docs/papers/2019/p1428r0.pdf
https://www.learncpp.com/cpp-tutorial/unsigned-integers-and-why-to-avoid-them/

There is a different school of thought, so to speak. I would be very
grateful if you can share your thoughts on the points made in
aforementioned articles, thanks.

[..snip..]

Stefan Monnier

unread,

Apr 6, 2020, 9:56:05 PM4/6/20

to

> Someone here suggested this a number of years ago.
> It can work, the question is whether it is worth the cost.
>
> The main issue I can think of is save & restore of those 2
> extra bits across subroutine calls and interrupts.

Subroutine save&restore is traditionally the responsability of the
compiler, and it should be very easy (and even natural) for the compiler
to arrange that there's no need to save&restore them across function
calls (after all, that's already what happens on "traditional
condition-codes architectures").

So the only real issue is for interrupts/exceptions.

Stefan

BGB

unread,

Apr 6, 2020, 10:02:38 PM4/6/20

to

This is sort of like how the "jumbo" prefixes in my BJX2 core work,
except that the plumbing feeds the bits directly into the target
instruction's decoder (and they are glued onto whatever bits would have
already been used for the immediate).

A sort of funky pseudo-register trick was used for the 64-bit
"jumbo-load" operation though, but mostly because the existing immediate
fields were limited to 33 bits (Imm32 + S/Z ext); requiring multiple
lanes to feed a 64-bit immediate.

The 64-bit jumbo load is basically a hack involving two jumbo prefixes
and an Imm25s constant load. It can load a 64-bit immediate into an
arbitrary GPR, with an encoding cost of 96 bits.

> The decoder of course had to suppress treating the literal as an actual
> instruction. IIRC only one slot could use the "literal register" and use
> of the register did the suppress as a side effect; references to the
> "literal register" in other slots read as zero, similar to the common
> "zero register" technique. Extension to more than one reference, or
> other means to disambiguate, seems straightforward.
>

In my case, they decode as NOP's.

The actual encoding in this case was actually for a parallel-execute
BRA; but in this case BRA is only allowed in Lane 1, so it seemed
reasonable to reuse BRA in Lanes 2 or 3 as a "Jumbo Prefix".

> Mill doesn't repurpose bits like the above; instead adjacent
> instructions in the bundle can concatenate their carried literals to
> make a larger one. Every slot is assigned 32+(3*morsel) bits in a
> literal buffer, which bits present as zero for all bits not specified by
> the instruction in the slot. Most instructions just use (or don't) the
> bits in their own part of the buffer, but one opcode says "consider that
> the owned bits of the instruction to my left actually extend into my
> part of the buffer". Note that there is no actual data movement in this.
>
> The "ownership extension" facility is used mostly by the Mill
> instructions that take a list of things as an argument, such as call and
> its argument list for example. However it is also used to build wide
> literals, including quad (128 bit) literals on members with hardware
> support for quad. In a worst case (Gold, with a 5-bit morsel size and a
> 32 entry belt) a predicated call instruction with 32 arguments and more
> than one result needs 5 (predicate) + 5 (result count) + 32*5
> (arguments) + 32 (address), or 202 bits, while each slot owns 32+(3*5)
> or 47 bits. The instruction thus encodes as one slot with a call
> operation and four slots with the "flowArgs" instruction that extends
> the part of the literal buffer owned by the call.
>
> Gold has only five flow slots (as presently configured), so a monster
> call like this would prevent the bundle from having any other flow
> instruction in that bundle. More typical calls use only one or two
> slots, so a bundle with a call can also encode some loads, stores, or
> other calls too.
>

No support in my case for 128-bit immediate values.

Loading a 128-bit immediate can be done as two 64-bit constant loads:
JLDI Low64, R4
JLDI High64, R5

This case will require 192 bits to encode though.

Doing it via an indirect memory load could potentially load the constant
using 160 bits, but would be slower and more annoying...

> Mitch hasn't talked about native quad support (that I noticed anyway),
> but it looks like his encoding scheme could be naturally extended to
> quad literals.
>
> Machines with SIMD might also have a need for very big literals. So far
> the Mill splat and iota hardware instructions seem to be sufficient in
> the very limited exploration of SIMD we have yet done. It's not clear
> whether we could extend the present Mill scheme to say 512-bit literals
> for SIMD; it would take 10 flow slots at 47 bits each, and many of those
> slots would be otherwise useless.

I have SIMD, but it only does 64 or 128 bit vectors.
No immediate plans to go any larger than this.

MitchAlsup

unread,

Apr 6, 2020, 10:03:57 PM4/6/20

to

On Monday, April 6, 2020 at 4:27:26 PM UTC-5, moo...@gmail.com wrote:

A rather amateurish paper on the subject. But it is probably
acceptable to teach new programmers how to deal with "numbers"
before tweaking their noggins with ring-sum algebras.

I went the "all unsigned unless" because most of my programming is
writing simulators in SW that accurately mimic real hardware. A
storage node in HW is the flip-flop, HW has no concept of sign,
everything is "just a bunch of bits" ! Unsigned has more of the
properties and fewer of the surprises that signed has.

I find that I almost never need signed numbers even outside of
writing simulators. Almost all of my counted loops go from 0..MAX
and don't need a negative number. Almost all of my integer arithmetic
is performed on positive only data. If one keeps track of checks for
deposit separately from checks written on the account, these are
done in unsigned terms also--you just have to use + for the deposits
and - for the withdrawals.

Funny story that really happened:: My mother (mid 1970s) wrote a check
to Sears (or J.C. Penny's) that used an 8 instead of a 5 for the last
digit. Subsequently, my mom started getting a bill for -0.03. After
numerous phone calls and the like, mom wrote a check for -0.03 and the
computer got the balance to zero.....

Signed numbers are fine when you NEED negative values.
Unsigned numbers are optimal when you do not.
The C signed/unsigned arithmetic definitions work better with unsigned
than with signed.

>
> [..snip..]

jim.bra...@ieee.org

unread,

Apr 6, 2020, 10:18:38 PM4/6/20

to

On Sunday, April 5, 2020 at 2:18:38 AM UTC-5, Quadibloc wrote:
> I am a few days too late with this post... as its title recalls that of a famous
> April Fool's Day article from _Datamation_.
>
> However, I'm not so sure this idea isn't a good and serious one.
>
> A number of modern processors, from the DEC Alpha to RISC-V, do not have condition
> codes. On the other hand, the Power PC tries to mitigate the problems of having
> this pervasive persistent state by having multiple condition codes, with
> instrucitons indicating which one is to be used.
>
> An idea just came to me of a way to achieve these goals:
>
> - Avoid having condition codes existing as a persistent state, but
>
> - Retain the full generality that the classical architectures with condition codes
> provided, so that converting programs would normally not be difficult.
>
> Here is the strange idea I came up with:
>
> Have instructions that "look" like a conditional branch or ocnditional jump
> instruction in every way.
>
> But they _do_ something different.
>
> There is a "delay slot" of sorts... what happens is this:
>
> the following instruction must perform an arithmetic operation of a kind that
> would normally set the condition codes on a conventional computer that had them;
>
> that instruction is performed normally;
>
> and then the branch or jump specified by the preceding instruction is taken, or
> not, as its branch condition specifies with respect to the condition codes the
> instruction following would have set if the architecture had them.
>
> So you take a conventional architecture with conditional branch instructions,
> and arithmetic instructions that set condition codes - and the *only change* you
> make to the ISA is that the conditional branch instruction is to be put before
> the instruction on the result of which it depends.
>
> Of course, this is not really a change that makes _no_ difference.
>
> Since condition codes as a persistent state have been eliminated, the one thing
> you _can't_ do any more is put instructions that _don't_ change the condition
> codes (i.e. memory load and store, unconditional jump) between the arithmetic
> instruction that set a condition and the conditional branch.
>
> John Savard

A different take on local (near and non-looping) branches:
In the distant past ALUs were expensive and branches were cheap. Today it is the opposite. So, proceed along both paths as far as possible doing potentially redundant ALU operations and save doing the conditional things to last (say in the context of a subroutine). Thus short branches are replaced by one or two operand conditional moves. Instruction stream "streams" through the processor spewing results into the register file. At the last moment desired some of the register results are placed/stored.

A different style of programming in the face of changed/new economics. Left with predicated instructions and looping conditional branches. This is more or less what high performance hardware does anyway. Just that compilers and CS departments need to get the message.

Jim Brakefield

MitchAlsup

unread,

Apr 6, 2020, 10:39:28 PM4/6/20

to

Instruction sets should have some mechanism to control an instruction's
execution other than a branch around that instruction. Even a skip is
far better than a branch. Branches come with predictors and other HW
speedup devices that should not be used when one only needs 1-2-3-4
instructions to be executed or not.

My 66000 has branches when one really wants to express go somewhere else,
and it has predicate instructions to maintain the momentum of the inst
stream. My predicate instructions form a control dependency between the
PRED instruction and those instructions under its shadow. The instructions
under its shadow can be in the THEN-clause or in the ELSE-clause and
when the PRED instruction has determined whether those instructions are
supposed to execute (or not) those instructions become free to deliver
results (they are free to start based solely on data dependencies.)

Early in my PRED design, I considered finding those instructions that
are in both THEN-clause and ELSE-clause and marking them both. This made
it significantly harder to encode the Prediction shadow and harder to
combine multiple conditions into a predicated group. Since modern
programming languages invariably provide && and || functionality, I
simplified the PRED shadow rules and backed down from the above so that
(realistically) My 66000 PRED only deals with && and || combinations.

My 66000 ISA also has the LOOP control instruction so that loops can be
iterated at low power cost and high overlap of instructions even on
narrow implementations (like getting 5-6 I/C on a 1-wide pipeline).
Loops are not predicted by the branch predictor, loops are predicted
by performing the loop control arithmetic ahead of time so that one
only returns to the top when the loop control conditions remains
unsatisfied.

This should make the branch predictor a lot better at predicting
what is left of the branches. Its hit rate may go down! But this is
more from there being fewer branches--interference between branches
in/at the predictor will go WAY down. The SUM( Branches, Predications )
remains essentially unchanged.

>
> A different style of programming in the face of changed/new economics. Left with predicated instructions and looping conditional branches. This is more or less what high performance hardware does anyway. Just that compilers and CS departments need to get the message.

I built some of that HW over the decades, and have designed My 66000 ISA
to allow compilers to express those things more directly/succinctly.
>
> Jim Brakefield

Melzzzzz

unread,

Apr 6, 2020, 11:26:28 PM4/6/20

to

On 2020-04-06, MitchAlsup <Mitch...@aol.com> wrote:
>
> My 66000 ISA also has the LOOP control instruction so that loops can be
> iterated at low power cost and high overlap of instructions even on
> narrow implementations (like getting 5-6 I/C on a 1-wide pipeline).
> Loops are not predicted by the branch predictor, loops are predicted
> by performing the loop control arithmetic ahead of time so that one
> only returns to the top when the loop control conditions remains
> unsatisfied.
>

Hm, that covers iterating loops, what about loops that break on
condition?

--
press any key to continue or any other to quit...
U ničemu ja ne uživam kao u svom statusu INVALIDA -- Zli Zec
Svi smo svedoci - oko 3 godine intenzivne propagande je dovoljno da jedan narod poludi -- Zli Zec
Na divljem zapadu i nije bilo tako puno nasilja, upravo zato jer su svi
bili naoruzani. -- Mladen Gogala

jim.bra...@ieee.org

unread,

Apr 6, 2020, 11:34:15 PM4/6/20

to

Possible implementation:

My first thought is to go down the PPC route with a narrow condition code register file. Each instruction specifies where its condition code goes. At the end one needs an instruction to select, say, three condition codes and do a logical connective on them and use that to do a branch or conditional move/select. Such an instruction needs a lot of bits.

So, consider a one bit wide eight deep "belt". Each instruction has a four bit field to select the condition to place on the belt. At the end one has an instruction that selects three bits from the belt and does a logical connective: 3 * 3 + 8, 17 bits total. Much more manageable.

TBD: How well does this scheme work in practice?
Lots of design choices!

Jim Brakefield

BGB

unread,

Apr 7, 2020, 1:31:58 AM4/7/20

to

Agreed, I have this feature as well, it works well...

In this case, there is a single bit predicate register, and instructions
can either be execute-if-true or execute-if-false.

> My 66000 has branches when one really wants to express go somewhere else,
> and it has predicate instructions to maintain the momentum of the inst
> stream. My predicate instructions form a control dependency between the
> PRED instruction and those instructions under its shadow. The instructions
> under its shadow can be in the THEN-clause or in the ELSE-clause and
> when the PRED instruction has determined whether those instructions are
> supposed to execute (or not) those instructions become free to deliver
> results (they are free to start based solely on data dependencies.)
>

In my case, it is part of the basic instruction encoding:
Exxx_xxxx: Predicated instructions
E0xx_xxxx..E3xx_xxxx: Execute if True
E4xx_xxxx..E7xx_xxxx: Execute if False
...
Fxxx_xxxx: unconditional instructions.
F0xx_xxxx..F3xx_xxxx: Scalar Op
F4xx_xxxx..F7xx_xxxx: Wide-Execute Pp
...

> Early in my PRED design, I considered finding those instructions that
> are in both THEN-clause and ELSE-clause and marking them both. This made
> it significantly harder to encode the Prediction shadow and harder to
> combine multiple conditions into a predicated group. Since modern
> programming languages invariably provide && and || functionality, I
> simplified the PRED shadow rules and backed down from the above so that
> (realistically) My 66000 PRED only deals with && and || combinations.
>

Sadly, && and || are unhandled in my case; it only supports CMPxx and TEST.

> My 66000 ISA also has the LOOP control instruction so that loops can be
> iterated at low power cost and high overlap of instructions even on
> narrow implementations (like getting 5-6 I/C on a 1-wide pipeline).
> Loops are not predicted by the branch predictor, loops are predicted
> by performing the loop control arithmetic ahead of time so that one
> only returns to the top when the loop control conditions remains
> unsatisfied.
>
> This should make the branch predictor a lot better at predicting
> what is left of the branches. Its hit rate may go down! But this is
> more from there being fewer branches--interference between branches
> in/at the predictor will go WAY down. The SUM( Branches, Predications )
> remains essentially unchanged.
>

There are a few branch encodings I had considered for being predicted as
always taken or never taken. I forget if I ever did anything with them
(never really used these in any case).

>>
>> A different style of programming in the face of changed/new economics. Left with predicated instructions and looping conditional branches. This is more or less what high performance hardware does anyway. Just that compilers and CS departments need to get the message.
>
> I built some of that HW over the decades, and have designed My 66000 ISA
> to allow compilers to express those things more directly/succinctly.

I tried to make my stuff fast...

But, it seems kinda moot given how dismal and slow some things are...

Probably need to try to investigate how to design a memory subsystem
which gets better performance; while hopefully also being cheaper (the
memory/cache subsystem is also seems to be eating a fair chunk of the
LUT budget, *1).

Then again, probably OK works for a "proof of concept" (eg: at least
managed to make something which "basically works"; even if performance
falls a bit short of what I had hoped for).

*1: Or, maybe investigate how to use a PLL so I can drive the DDR DRAM
module at 200MHz or similar, rather than 50MHz, and maybe not get
screwed over quite so badly by CAS and RAS latency, ...

EricP

unread,

Apr 7, 2020, 1:49:55 AM4/7/20

to

So you are suggesting we just define in the ABI that
those 2 bits as not preserved across subroutine calls,
only the register data values are preserved.
The compiler then arranges code accordingly.

> So the only real issue is for interrupts/exceptions.
>
> Stefan

If the 2 bits are extracted a single register at a time,
I'm not too bothered by extra cost for exceptions as those
are supposed to be exceptional - as in rare.

Interrupts, yeah its not elegant.

EricP

unread,

Apr 7, 2020, 1:56:24 AM4/7/20

to

MitchAlsup wrote:
> On Monday, April 6, 2020 at 12:26:40 PM UTC-5, EricP wrote:
>> 3 source registers is already required for a base-index store
>> so there is no additional cost if an ISA has that (which I would).
>> Also there are some bitfield extract-merge instruction that can use
>> 3 source registers.
>
> Minor quibble::
>
> My 66000 has 3-register store:
> STx Rd,[Rb+Ri<<s+DISP]
> But all the implementations I have conceived, delay the reading of store
> data until after cache hit (and TLB hit). And even here, the store data
> resides in a 'slot' in the decoder waiting for a non-3-register inst
> so the stores data read borrows a read port from another instruction.

Yes, I should have added that you had pointed this out earlier
but I only remembered after I had sent the msg.

But this would not help my (simulated) OoO uArch as, for simplicity,
all source operands are read at Dispatch or received through forwarding.

> It is more accurate to state that 3-source registers now required to
> comply with IEEE 754-2008 FMAC functionality.

It means that in some of those FPGA implementations that people
discuss here, which only have single read port register files,
that a minimum implementation now needs 3 banks instead of 2.

Stefan Monnier

unread,

Apr 7, 2020, 2:50:56 AM4/7/20

to

> So you are suggesting we just define in the ABI that those 2 bits as
> not preserved across subroutine calls, only the register data values
> are preserved.

That's what I would expect, yes.

I guess the ABI could also have a few callee-save registers that don't
only preserve the data value but also its "condition-bits" metadata, but
I highly doubt it would be useful enough to justify its weight.

In any case, IIUC the answer was no, it hasn't been tried, yet?

Stefan

Quadibloc

unread,

Apr 7, 2020, 5:23:13 AM4/7/20

to

On Monday, April 6, 2020 at 4:39:28 PM UTC-6, MitchAlsup wrote:

> Instruction sets should have some mechanism to control an instruction's
> execution other than a branch around that instruction. Even a skip is
> far better than a branch.

It was so much simpler in those long-ago days when computers were built out of
discrete transistors, and memories were built from magnetic cores.

Today, if we want computers to be efficient, we have to take into account how
some parts have gotten a lot faster, while other parts haven't to quite the same
extent.

And indeed, if, instead of a general jump instruction, we just have an
instruction that can jump a limited number of instructions forwards, an
implementation can make that behave as if the instructions jumped over were
actually predicated - with perhaps the additional advantage that, if the jump is
far enough ahead that it skips some instructions that haven't been fetched, the
fetch can be skipped (even though that makes it look more like a jump).

Even with predicated instructions, though, a fast computer would still try to do
the same thing that makes branch predictors necessary for conditional branches:
the computer sees the instructions coming, and it tries to do them as much ahead
of time as it can.

A naive view of this issue would lead to the suggestion:

All right, attempt to process instructions ahead of time, up to, but not beyond,
the point where one starts incorporating the result of the instruction in the
changing machine state. So if it turns out those instructions won't be executed,
nothing from them has been committed - so they can just be discarded without
taking any cycles to unwind anything. All the preparatory work was isolated from
the final result, the illusory in-order history of the computation that is seen
by interrupts and traps.

Apparently, for some reason which is beyond my knowledge in this area, that just
isn't possible. Mispredicts have penalties - affecting the latency of the
process - instead of the wasted effort manifesting as extra electricity used,
and extra heat generated, to no purpose, and basically nowhere else.

John Savard

Melzzzzz

unread,

Apr 7, 2020, 5:27:15 AM4/7/20

to

Isn't it that misspredicts are slow because data have to be fetched from
slower memory?

>
> John Savard

BGB

unread,

Apr 7, 2020, 6:36:26 AM4/7/20

to

On 4/7/2020 12:27 AM, Melzzzzz wrote:
> On 2020-04-07, Quadibloc <jsa...@ecn.ab.ca> wrote:
>> On Monday, April 6, 2020 at 4:39:28 PM UTC-6, MitchAlsup wrote:
>>
>>> Instruction sets should have some mechanism to control an instruction's
>>> execution other than a branch around that instruction. Even a skip is
>>> far better than a branch.
>>
>> It was so much simpler in those long-ago days when computers were built out of
>> discrete transistors, and memories were built from magnetic cores.
>>
>> Today, if we want computers to be efficient, we have to take into account how
>> some parts have gotten a lot faster, while other parts haven't to quite the same
>> extent.
>>

Also expectations were lower:
Barely working was good enough;
"Efficient" was mostly "doesn't waste huge amounts of cycles doing other
stuff".

Now, even for single-person hobby projects, it is considered worthless
unless the results are on-par with products made by large multinational
corporations.

Also, one gets amazing performance if one takes a RAM module designed to
operate at 533MHz and then proceeds to run it at 50MHz... One can then
marvel at the effective 80ns CAS latency.

( Granted, this isn't the only thing not good here; naively
opening/closing the row for every transfer probably doesn't exactly help
either, ... )

>> And indeed, if, instead of a general jump instruction, we just have an
>> instruction that can jump a limited number of instructions forwards, an
>> implementation can make that behave as if the instructions jumped over were
>> actually predicated - with perhaps the additional advantage that, if the jump is
>> far enough ahead that it skips some instructions that haven't been fetched, the
>> fetch can be skipped (even though that makes it look more like a jump).
>>
>> Even with predicated instructions, though, a fast computer would still try to do
>> the same thing that makes branch predictors necessary for conditional branches:
>> the computer sees the instructions coming, and it tries to do them as much ahead
>> of time as it can.
>>
>> A naive view of this issue would lead to the suggestion:
>>
>> All right, attempt to process instructions ahead of time, up to, but not beyond,
>> the point where one starts incorporating the result of the instruction in the
>> changing machine state. So if it turns out those instructions won't be executed,
>> nothing from them has been committed - so they can just be discarded without
>> taking any cycles to unwind anything. All the preparatory work was isolated from
>> the final result, the illusory in-order history of the computation that is seen
>> by interrupts and traps.
>>
>> Apparently, for some reason which is beyond my knowledge in this area, that just
>> isn't possible. Mispredicts have penalties - affecting the latency of the
>> process - instead of the wasted effort manifesting as extra electricity used,
>> and extra heat generated, to no purpose, and basically nowhere else.
>
> Isn't it that misspredicts are slow because data have to be fetched from
> slower memory?
>

Probably depends on implementation.

In a simple core it is mostly that it requires flushing the pipeline.
If the pipeline gets long, there is a lot of added latency.

I think, in fancier cores, it ends up having to discard a lot more state...

>>
>> John Savard
>
>

Terje Mathisen

unread,

Apr 7, 2020, 7:54:46 AM4/7/20

to

Not only that, you will also need some way to generate two results, even
if you have to drop them in successive cycles:

AugmentedAddition(a, b) and AugmentedMultiplication(a, b) are exact
(floating point) operations that deliver the results in a pair of
registers. It is worth noting that for the addition operator it is quite
possible that this is effectively a NOP when the exponents are more than
53 apart. :-)

Anton Ertl

unread,

Apr 7, 2020, 8:39:00 AM4/7/20

to

EricP <ThatWould...@thevillage.com> writes:
>Actually what I was trying to do was not have instructions that are
>(a) rarely used and (b) expensive for a minimum implementation.
>
>I felt 2 dest registers in a single instruction, whether specified
>in an ISA instruction or created by fusion, has a bunch of nasty
>uArch implications for rename, writeback ports and writeback
>arbitration that I did not want to force on all implementations.

Yes, although it did not keep ARM from putting load with
auto-increment in the first (minimal) ARM, nor into Aarch64; LDP can
even write to three registers.

>3 source registers is already required for a base-index store
>so there is no additional cost if an ISA has that (which I would).

Are such stores really frequent enough to justify three register read
ports alone? You can have loads with that addressing mode while
having only MIPS-style stores.

>Also there are some bitfield extract-merge instruction that can use
>3 source registers.

Yes, and conditional moves. So overall there is probably a good case
for three-source-register instructions.

And ADD3 may be useful for other computations as well.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

Anton Ertl

unread,

Apr 7, 2020, 9:23:51 AM4/7/20

to

EricP <ThatWould...@thevillage.com> writes:
>Anton Ertl wrote:
>Ok, yes minint range compare takes extra instructions to do the same.
>Does that happen a lot? Because that looks like a pretty special case.

To me, too, but gcc generates a warning when it sees another idiom for
detecting minint ("n<n-1" with n being signed), and the related cases.
So apparently not just the usage, but even this idiom (which makes
some people freak out) happens often enough to justify such detection
and the warning. Unfortunately, last I looked, gcc, despite detecting
this idiom, did not optimize it to use JNO, but instead miscompiled it
into "0". It did not optimize it with -fwrapv, either, nor did it
optimize the C-standards-compliant idiom for detecting minint.

>Why would SwiftForth have a signed loop counter that overflows?
>Is there some language reason that an unsigned loop counter
>can't do the job?

In Forth counted loops are not signed or unsigned, but circular. You
can count up from min-signed to max-signed, or from 0 to max-unsigned
with the same +LOOP. You can even count up from 6 to 4 with
increments of 1000, and it will produce (on a 32-bit system, with the
indices interpreted as signed numbers):

6 1006 2006 ... -2290 -1290 -290

Adding 1000 to -290 crosses the boundary between the limit (4) and
limit-1, and that terminates the loop.

As an additional complication, you can also use negative increments
for +LOOP (increments are interpreted as signed numbers).

Why has +LOOP been designed this way? Forth has been designed for
memory-constrained systems, so covering as many cases as possible with
the single word +LOOP instead of having separate words for the various
combinations of signed/unsigned and positive and negative increments
is seen as a virtue. Also, the specification with the boundary
between limit and limit-1 allows a simple and efficient implementation
that makes use of the hardware's branch-on-overflow (present in all of
Forth's target hardware at the time this +LOOP was specified).

So, in this case branch-on-overflow is needed, because at some point
it was built into hardware, and software people (programming language
designers in this case) made use of it.

Anton Ertl

unread,

Apr 7, 2020, 9:37:19 AM4/7/20

to

MitchAlsup <Mitch...@aol.com> writes:
>UNLESS a variable is going to have a negative value in it, it should
>be defined as unsigned !
>

>If it is defined as signed, the program should be ready to take overflow
>exceptions !

Why? If you want overflow exceptions, why limit them to signed integers?

>All signed instructions should be capable of raising overflow exceptions.

This makes no sense. Why should Alpha's CMPLT instruction be capable
of raising an overflow exception? Why should Aarch64's SMULH?

Thanks to the glory of 2s-complement representation of signed numbers,
Alphas ADD instruction (nor the non-trapping add instructions of any
other architecture introduced in the last 5 decades) does not care
whether it adds signed or unsigned numbers. Alpha also has an ADDV
that traps on signed overflow. It misses an ADDUV instruction that
traps on unsigned overflow.

Anton Ertl

unread,

Apr 7, 2020, 10:45:52 AM4/7/20

to

Stefan Monnier <mon...@iro.umontreal.ca> writes:
>Does any know of an architecture which stored the carry/overflow bits as
>extra "metadata" bits in the registers?

Chuck Moore's uP21 and F21 are (word-addressed) 20-bit CPUs with
21-bit stack entries <http://www.ultratechnology.com/f21data.pdf>; the
21st bit is the carry bit (no overflow support on these CPUs). AFAIK
Chuck Moore did not continue with the extra bit on his more recent
designs. AFAIK these CPUs did not support interrupts.

DSP CPUs have (had) accumulators that are wider than the other
registers (e.g., 40-bit accumulators on an otherwise 32-bit
architecture).

For the long arithmetic, doing something with the wide SIMD registers
would seem appropriate (or not, see below). Not sure if multi-cycle
256-bit addition on AVX would be realistic, or if we instead have
something that performs multiple single-cycle instructions. And in
either case, there would need to be something for extending the
computation beyond the SIMD register length.

Or maybe I am wrong; Intel added the ADX instructions to work on GPRs,
not the SIMD registers. What's wrong with using SIMD registers for
long arithmetic?

>So you'd need, say, 34 bit
>registers to hold the "32bit data, plus overflow and carry", but those
>extra bits wouldn't be sent to cache&dram (except for context
>switching, of course). It seems like an extra 2 bits in each register
>should be a fairly cheap, and it eliminates the "implicit state" of
>tradition condition codes. It also gives you multiple condition
>codes, similarly to what Power does. It would still require a 3-input
>instruction for "add with carry", but it'd be otherwise just as
>efficient as i386's `addc` when doing a 256-bit addition.

Looks like a good idea.

Quadibloc

unread,

Apr 7, 2020, 11:52:03 AM4/7/20

to

On Monday, April 6, 2020 at 11:33:48 AM UTC-6, Quadibloc wrote:
> On Monday, April 6, 2020 at 2:43:23 AM UTC-6, Anton Ertl wrote:
>
> > But it does not solve the problem.
> >
> > A 256-bit addition on AMD64 looks as follows:
> >
> > add rsi,r9
> > adc rdi,r10
> > adc rax,r11
> > adc rbc,r12
> >
> > with each instruction propagating the carry bit to the next one. How
> > would you do that?
>
> A 128-bit addition would work, simply by treating "add with carry" like a
> conditional branch, as an instruction prefix to the plain add instruction.
>
> Making it both a prefix and prefixable leads to the problem of excessively long
> sequences that could block interrupts. So there is a problem.

How did I not think of this?

I was aware that 8-bit microprocessors had the ADC instruction. Since they only
worked with 8-bit numbers natively, it was important that they be easily able to
chain additions to operate on larger numbers.

On the other hand, the IBM System/360 could already operate on 16-bit and 32-bit
integers directly. So multi-precision arithmetic would be a very rare operation.
Thus, it did not have an ADC instruction. Thus, I didn't think a real computer
needed one, and so the case would not be present.

John Savard

BGB

unread,

Apr 7, 2020, 1:14:09 PM4/7/20

to

On 4/5/2020 8:33 PM, Michael Barry wrote:
> On Sunday, April 5, 2020 at 3:35:34 PM UTC-7, Quadibloc wrote:> added 16-bit integer immediates.)
>> [...]
>> Hence, I came up with an idea, described here some time ago, where
>> instructions of uniform length had short pointers into unused storage in
>> an instruction "container", as you put it, to the immediates. (As they're
>> pointed to, of course, they're not "really" immediates, but they're close
>> by in the instruction stream, so they have the advantages of immediates
>> instead of the drawbacks of constants fetched from 'way over in data
>> storage.)
>> [...]
>
> I believe that concept is equivalent (or nearly so) to a "literal pool",
> which has been around for many decades.
>

Yes.

Certain ISA's I am aware of (eg: SuperH and Thumb) have dedicated
instructions which can be used to load a value from a small displacement
relative to PC.

On one hand, this can be used to load constants.

On the other hand, it is really annoying, and the assembler / codegen
needs to go through contortions to make sure that the constants are
in-range of where they are needed (because frequently functions are
larger than the displacement used for constant-loading can reach to).

It also kinda sucks if memory loads are slow.

Gluing a constant together via an inline instruction sequence may well
both be faster, as well as a lot less annoying for the assembler/codegen.

> I always thought of "COME FROM" as equivalent to a breakpoint, and I found
> some corroborating opinions during a quick search.
>
> Mike B.
>

Stephen Fuld

unread,

Apr 7, 2020, 3:06:33 PM4/7/20

to

On 4/6/2020 10:23 PM, Quadibloc wrote:
> On Monday, April 6, 2020 at 4:39:28 PM UTC-6, MitchAlsup wrote:

snip

> It was so much simpler in those long-ago days when computers were built out of
> discrete transistors, and memories were built from magnetic cores.
>
> Today, if we want computers to be efficient, we have to take into account how
> some parts have gotten a lot faster, while other parts haven't to quite the same
> extent.

Having to take the hardware characteristics into account for efficiency
is nothing new.

In the time before the one you referenced, instructions were loaded, one
at a time, from a drum memory. You had to adjust the placement of
instructions on the drum such that, based on the execution time of the
current instruction, the next one would have rotated to just before the
drum read head or you would lose a revolution.

Even in the time period you referenced, there were considerations. For
the Univac 1108, (750ns instruction execution time, core memory, no
cache), you kept the instructions in a different place in memory from
the data so you could do the two memory reads required for most
instructions (the data for the current instruction and the next
instruction itself) simultaneously from different memory banks.
(Actually, the OS made this trivial for you, but you noticed it.) And
if you had a multi-processor system, you configured it with the memory
banks interleaved at the word level to prevent one processor from
"hogging" a single memory bank for say sequential instruction fetch.

And, of course, there was (and still is) the huge access time disparity
between main memory and storage peripherals.

Anton Ertl

unread,

Apr 7, 2020, 4:13:17 PM4/7/20

to

moo...@gmail.com writes:
>http://open-std.org/JTC1/SC22/WG21/docs/papers/2019/p1428r0.pdf

There Bjarne Stroustoup writes (in 2018):
|Checking signed 0<=x && x<max (often evaluated as (0<=x) & (x<max) to
|avoid branching) is not slower than unsigned x<max on a modern
|computer. The number of loads in the two cases are identical and
|unless the instruction pipeline and the arithmetic units are
|saturated, the speed will be identical. Please check; I couldn't check
|all hardware/compiler combinations.

I am surprised to read that, because I expected that it produces
exactly the same code on an optimizing compiler. So let's see:

void bar1(long x);
void bar3(unsigned long x);

void foo1(long x, long max)
{
if (0<=x && x<max)
bar1(x);
}

void foo2(long x, long max)
{
if (0<=x & x<max)
bar1(x);
}

void foo3(unsigned long x, unsigned long max)
{
if (x<max)
bar3(x);
}

Here's the code produced by gcc 8.3.0 and clang 7.0, both with -O3:

0000000000000000 <foo1>:
0: test %rdi,%rdi 0: test %rdi,%rdi
3: js a <foo1+0xa> 3: js f <foo1+0xf>
5: cmp %rsi,%rdi 5: cmp %rsi,%rdi
8: jl 10 <foo1+0x10> 8: jge f <foo1+0xf>
a: retq a: jmpq f <foo1+0xf>
b: nopl 0x0(%rax,%rax,1) f: retq
10: jmpq 15 <foo1+0x15>

0000000000000020 <foo2>:
20: test %rdi,%rdi 10: mov %rdi,%rax
23: js 2a <foo2+0xa> 13: shr $0x3f,%rax
25: cmp %rsi,%rdi 17: not %eax
28: jl 30 <foo2+0x10> 19: xor %ecx,%ecx
2a: retq 1b: cmp %rsi,%rdi
2b: nopl 0x0(%rax,%rax,1) 1e: setl %cl
30: jmpq 35 <foo2+0x15> 21: test %ecx,%eax
23: je 2a <foo2+0x1a>
25: jmpq 2a <foo2+0x1a>
2a: retq

0000000000000040 <foo3>:
40: cmp %rsi,%rdi 30: cmp %rsi,%rdi
43: jb 50 <foo3+0x10> 33: jae 3a <foo3+0xa>
45: retq 35: jmpq 3a <foo3+0xa>
46: nopw %cs:0x0(%rax,%rax,1) 3a: retq
50: jmpq 55 <foo3+0x15>

So they do indeed produce worse code for the signed idioms. If the
gcc and clang optimizer teams had spent as much time on real
optimizations instead of on miscompiling code that they consider
buggy, their compilers should have been able to optimize this case
many years ago.

Let's see how a Forth compiler fares:

defer bar
: foo1 ( x max -- )
over 0 rot within if bar else drop then ;
: foo2 ( x max -- )
over u> if bar else drop then ;

VFX Forth 4.71:
FOO1 FOO2
080C09D0 CMP EBX, [EBP] 080C0990 CMP EBX, [EBP]
080C09D3 MOV EBX, [EBP] 080C0993 MOV EBX, [EBP]
080C09D6 LEA EBP, [EBP+04] 080C0996 LEA EBP, [EBP+04]
080C09D9 JBE/NA 080C09EA 080C0999 JBE/NA 080C09AA
080C09DF CALL [080C0930] BAR 080C099F CALL [080C0930] BAR
080C09E5 JMP 080C09F0 080C09A5 JMP 080C09B0
080C09EA MOV EBX, [EBP] 080C09AA MOV EBX, [EBP]
080C09ED LEA EBP, [EBP+04] 080C09AD LEA EBP, [EBP+04]
080C09F0 NEXT, 080C09B0 NEXT,

So at least this compiler manages to compile the WITHIN idiom to the
same code as the unsigned-comparison (U>) idiom. The rest is due to
the calling convention used.

MitchAlsup

unread,

Apr 7, 2020, 4:25:30 PM4/7/20

to

On Monday, April 6, 2020 at 6:26:28 PM UTC-5, Melzzzzz wrote:

> On 2020-04-06, MitchAlsup <?????@aol.com> wrote:
> >
> > My 66000 ISA also has the LOOP control instruction so that loops can be
> > iterated at low power cost and high overlap of instructions even on
> > narrow implementations (like getting 5-6 I/C on a 1-wide pipeline).
> > Loops are not predicted by the branch predictor, loops are predicted
> > by performing the loop control arithmetic ahead of time so that one
> > only returns to the top when the loop control conditions remains
> > unsatisfied.
> >
>
> Hm, that covers iterating loops, what about loops that break on
> condition?

There are LOOP instruction variants for those, too.

MitchAlsup

unread,

Apr 7, 2020, 4:35:58 PM4/7/20

to

Exactly, some of the instructions that come to mind::

MOV immediate: this can be performed in DECODE and the result ready by
EXECUTE

MOV register: This can also be performed in DECODE and the result ready
by EXECUTE.

BRanches: You can FETCH the target so that if the branch is taken, the
instructions at the target are ready to insert into PARSE/DECODE. You could also put the instructions through PARSE and be ready for DECODE should the
branch be taken.

>
> Apparently, for some reason which is beyond my knowledge in this area, that just
> isn't possible. Mispredicts have penalties - affecting the latency of the
> process - instead of the wasted effort manifesting as extra electricity used,
> and extra heat generated, to no purpose, and basically nowhere else.

In GBOoO machines, as much as 1/2 of the stuff inserted into the execution
window is "thrown away" by mispredicting branches.
>
> John Savard

MitchAlsup

unread,

Apr 7, 2020, 4:40:48 PM4/7/20

to

On Tuesday, April 7, 2020 at 4:23:51 AM UTC-5, Anton Ertl wrote:
> EricP <ThatWould...@thevillage.com> writes:
> >Anton Ertl wrote:
> >Ok, yes minint range compare takes extra instructions to do the same.
> >Does that happen a lot? Because that looks like a pretty special case.
>
> To me, too, but gcc generates a warning when it sees another idiom for
> detecting minint ("n<n-1" with n being signed), and the related cases.
> So apparently not just the usage, but even this idiom (which makes
> some people freak out) happens often enough to justify such detection
> and the warning. Unfortunately, last I looked, gcc, despite detecting
> this idiom, did not optimize it to use JNO, but instead miscompiled it
> into "0". It did not optimize it with -fwrapv, either, nor did it
> optimize the C-standards-compliant idiom for detecting minint.

My 66000 ISA has branch on condition instructions where one can ask
for:

Signed MAX int 0x7fffffff
Unsigned MAX int 0xffffffff
Signed MIN int 0x80000000
Unsigned MIN int 0x00000000

MitchAlsup

unread,

Apr 7, 2020, 4:44:58 PM4/7/20

to

On Tuesday, April 7, 2020 at 4:37:19 AM UTC-5, Anton Ertl wrote:

> MitchAlsup <?????@aol.com> writes:
> >UNLESS a variable is going to have a negative value in it, it should
> >be defined as unsigned !
> >
> >If it is defined as signed, the program should be ready to take overflow
> >exceptions !
>
> Why? If you want overflow exceptions, why limit them to signed integers?
>
> >All signed instructions should be capable of raising overflow exceptions.
>
> This makes no sense. Why should Alpha's CMPLT instruction be capable
> of raising an overflow exception?

No, compares are not "signed arithmetic"; it is a byproduct of signed
arithmetic that they can perform compares not a necessity. Compares
are not arithmetic so they should not be given the ability to overflow.

> Why should Aarch64's SMULH?

Similarly, the high order part of a multiply cannot overflow because the
returned result is "all the rest of the bits".

MitchAlsup

unread,

Apr 7, 2020, 4:46:51 PM4/7/20

to

You will find that the need is even less for 64-bit machines.

>
> John Savard

MitchAlsup

unread,

Apr 7, 2020, 4:49:54 PM4/7/20

to

On Tuesday, April 7, 2020 at 8:14:09 AM UTC-5, BGB wrote:
> On 4/5/2020 8:33 PM, Michael Barry wrote:
> > On Sunday, April 5, 2020 at 3:35:34 PM UTC-7, Quadibloc wrote:> added 16-bit integer immediates.)
> >> [...]
> >> Hence, I came up with an idea, described here some time ago, where
> >> instructions of uniform length had short pointers into unused storage in
> >> an instruction "container", as you put it, to the immediates. (As they're
> >> pointed to, of course, they're not "really" immediates, but they're close
> >> by in the instruction stream, so they have the advantages of immediates
> >> instead of the drawbacks of constants fetched from 'way over in data
> >> storage.)
> >> [...]
> >
> > I believe that concept is equivalent (or nearly so) to a "literal pool",
> > which has been around for many decades.
> >
>
> Yes.
>
> Certain ISA's I am aware of (eg: SuperH and Thumb) have dedicated
> instructions which can be used to load a value from a small displacement
> relative to PC.

My 66000 has this too:: if you use R0 as a base register, the DECODE
stage replaces R0 with IP. {R0 is a normal register not defined to be 0}

>
> On one hand, this can be used to load constants.
>
> On the other hand, it is really annoying, and the assembler / codegen
> needs to go through contortions to make sure that the constants are
> in-range of where they are needed (because frequently functions are
> larger than the displacement used for constant-loading can reach to).

In My 66000 the range is +/- 2**15 bytes.

>
> It also kinda sucks if memory loads are slow.

Which is why constants should come out of the I stream.

>
>
> Gluing a constant together via an inline instruction sequence may well
> both be faster, as well as a lot less annoying for the assembler/codegen.

In 32-bit world this is true, in the 64-bit world it is not.

MitchAlsup

unread,

Apr 7, 2020, 4:55:14 PM4/7/20

to

On Tuesday, April 7, 2020 at 11:13:17 AM UTC-5, Anton Ertl wrote:
> moo...@gmail.com writes:
> >http://open-std.org/JTC1/SC22/WG21/docs/papers/2019/p1428r0.pdf
>
> There Bjarne Stroustoup writes (in 2018):
> |Checking signed 0<=x && x<max (often evaluated as (0<=x) & (x<max) to
> |avoid branching) is not slower than unsigned x<max on a modern
> |computer. The number of loads in the two cases are identical and
> |unless the instruction pipeline and the arithmetic units are
> |saturated, the speed will be identical. Please check; I couldn't check
> |all hardware/compiler combinations.
>
> I am surprised to read that, because I expected that it produces
> exactly the same code on an optimizing compiler. So let's see:
>
> void bar1(long x);
> void bar3(unsigned long x);
>
> void foo1(long x, long max)
> {
> if (0<=x && x<max)
> bar1(x);
> }

I should note: This is a single instruction in My 66000 ISA--Range compares.
There are 4 of them::
a) (0< x)&&(x< y) // really IN
b) (0<=x)&&(x< y) // 'C' IN
c) (0< x)&&(x<=y) // FORTRAN IN
d) (0<=x)&&(x<=y) // Barely IN

Brian G. Lucas

unread,

Apr 7, 2020, 5:03:35 PM4/7/20

to

Isn't multi-precision arithmetic (at least add) key to current public key
encryption algorithms -- perhaps the major cycle contributor? (Cue Terje..)

brian

EricP

unread,

Apr 7, 2020, 5:21:26 PM4/7/20

to

MitchAlsup wrote:
>
> Instruction sets should have some mechanism to control an instruction's
> execution other than a branch around that instruction. Even a skip is

> far better than a branch. Branches come with predictors and other HW
> speedup devices that should not be used when one only needs 1-2-3-4
> instructions to be executed or not.
>

> My 66000 has branches when one really wants to express go somewhere else,
> and it has predicate instructions to maintain the momentum of the inst
> stream. My predicate instructions form a control dependency between the
> PRED instruction and those instructions under its shadow. The instructions
> under its shadow can be in the THEN-clause or in the ELSE-clause and
> when the PRED instruction has determined whether those instructions are
> supposed to execute (or not) those instructions become free to deliver
> results (they are free to start based solely on data dependencies.)
>

> Early in my PRED design, I considered finding those instructions that
> are in both THEN-clause and ELSE-clause and marking them both. This made
> it significantly harder to encode the Prediction shadow and harder to
> combine multiple conditions into a predicated group. Since modern
> programming languages invariably provide && and || functionality, I
> simplified the PRED shadow rules and backed down from the above so that
> (realistically) My 66000 PRED only deals with && and || combinations.
>

> My 66000 ISA also has the LOOP control instruction so that loops can be
> iterated at low power cost and high overlap of instructions even on
> narrow implementations (like getting 5-6 I/C on a 1-wide pipeline).
> Loops are not predicted by the branch predictor, loops are predicted
> by performing the loop control arithmetic ahead of time so that one
> only returns to the top when the loop control conditions remains
> unsatisfied.
>

> This should make the branch predictor a lot better at predicting
> what is left of the branches. Its hit rate may go down! But this is
> more from there being fewer branches--interference between branches
> in/at the predictor will go WAY down. The SUM( Branches, Predications )
> remains essentially unchanged.

(Some of the research on this subject uses the term "Guarded ISA"
rather than "Predicated" because the PRED prefix shows up in so
many different words that sentences get confusing when talking
about "predictions of predicate predictors")

There are a number of research papers that make the case
that guared ISA's need prediction too. The reason is that
when code is moved from an IF statement under a guard
that it affects the branch prediction negatively.

Efficient Out-of-Order Execution of Guarded ISAs, 2013
https://hal.inria.fr/docs/00/91/12/30/PDF/RR-8406.pdf

Guarded Execution and Branch Prediction in Dynamic ILP Processors, 1994
https://research.cs.wisc.edu/techreports/1993/TR1193.pdf

EricP

unread,

Apr 7, 2020, 5:33:16 PM4/7/20

to

I don't recall ever hearing of such an design.
There looks to be some DSP's that were 33 bits
so they could do 16*16+32 => 33-bits
but no extra overflow bit.

EricP

unread,

Apr 7, 2020, 5:57:34 PM4/7/20

to

Anton Ertl wrote:
> EricP <ThatWould...@thevillage.com> writes:
>

>> 3 source registers is already required for a base-index store
>> so there is no additional cost if an ISA has that (which I would).
>
> Are such stores really frequent enough to justify three register read
> ports alone? You can have loads with that addressing mode while
> having only MIPS-style stores.

The full address mode is: [base] [+ index*scale] [+ immediate]
The base can be a general register or the PC.
If the base is dropped, then the immediate is an absolute base address.
That allows direct load and store to arrays on the stack, inside structs,
or in absolute or position independent global program memory.

For more complex situations (like structs containing arrays of structs)
LEA's would be used.

For most languages it would be used for all array accesses.
In C, if they code array[i] then it can be used,
if they code *ptr then not.

George Neuner

unread,

Apr 7, 2020, 6:22:36 PM4/7/20

to

And also to arbitrary precision arithmetic used by languages that are
not C.

George

Quadibloc

unread,

Apr 7, 2020, 6:27:30 PM4/7/20

to

On Tuesday, April 7, 2020 at 7:14:09 AM UTC-6, BGB wrote:
> On 4/5/2020 8:33 PM, Michael Barry wrote:

> > I believe that concept is equivalent (or nearly so) to a "literal pool",
> > which has been around for many decades.

> Yes.

> Certain ISA's I am aware of (eg: SuperH and Thumb) have dedicated
> instructions which can be used to load a value from a small displacement
> relative to PC.

> On one hand, this can be used to load constants.

> On the other hand, it is really annoying, and the assembler / codegen
> needs to go through contortions to make sure that the constants are
> in-range of where they are needed (because frequently functions are
> larger than the displacement used for constant-loading can reach to).

> It also kinda sucks if memory loads are slow.

My idea is _similar_ but it avoids having to branch around the constant pools,
and it avoids doing a memory load to get the constant - the constant was loaded
with the instruction stream.

The idea is, instead, to do something which works a lot like the "heads and
tails" proposed instruction encoding, but without the drawback of there being no
direct link between each head and its corresponding tail.

John Savard

George Neuner

unread,

Apr 7, 2020, 6:35:08 PM4/7/20

to

On Mon, 6 Apr 2020 22:23:10 -0700 (PDT), Quadibloc <jsa...@ecn.ab.ca>
wrote:

>It was so much simpler in those long-ago days when computers were built out of
>discrete transistors, and memories were built from magnetic cores.
>
>Today, if we want computers to be efficient, we have to take into account how
>some parts have gotten a lot faster, while other parts haven't to quite the same
>extent.

Recently came across this article re: low(er) latency memory ... but
I'm not a hardware person, so I can't tell whether or not it's just
wishful thinking.

http://transactsql.com/blogs/joe_chang/archive/2018/03/08/low-latency-memory.aspx?utm_source=ssc&utm_medium=pubemail

George

David Brown

unread,

Apr 7, 2020, 6:36:24 PM4/7/20

to

I would expect them to produce different code - because they do slightly
different things (AFIACS). You can be very confident that the gcc folks
have optimised this as well as they can. In particular, information
about ranges of variables is passed around by the optimiser, so the
comparisons in functions like this can sometimes be eliminated
altogether in real code.

It is worth noting that if the compiler knows that "max" is non-negative
(which it will, if max is a compile-time constant as is often the case),
the signed case is almost the same code as the unsigned case. (These
are generated with <https://godbolt.org>, which is a great tool for this
kind of thing.) gcc from version 7 upwards gives:

void bar1(long x);
void bar3(unsigned long x);

void foo1(long x, long max) {

if (max < 0) __builtin_unreachable();
if ((0 <= x) && (x < max)) bar1(x);

}

void foo2(long x, long max) {

if ((0 <= x) & (x < max)) bar1(x);

}

void foo3(unsigned long x, unsigned long max) {
if (x < max) bar3(x);
}

foo1:
cmp rsi, rdi
ja .L4
ret
.L4:
jmp bar1

foo2:
test rdi, rdi
js .L5
cmp rdi, rsi
jl .L10
.L5:
ret
.L10:
jmp bar1

foo3:
cmp rdi, rsi
jb .L13
ret
.L13:
jmp bar3

clang is not as smart about using the information about "max" in "foo1"
until version 9.

MitchAlsup

unread,

Apr 7, 2020, 7:17:00 PM4/7/20

to

DRAM is more constrained by cost than any desire for performance/speed.

I have a patent (Moto owns it) from 1990-ish where we invented a DRAM
with an address port, a data-out port and a data-in port. Both data
ports contained bank numbers, the address port also had bank numbers.
We were thinking 16 banks at the time.

In order to read one presented the row address to the address port, and
later presented the COL address to the address port, then several cycles
later the column data would be asserted on the data-out port. One could
leave the ROW bank selected and read multiple 'words' from the DRAM in
any order the processor wanted.

In order to write one would assert the row address to the address port,
and prior to asserting column address, one would assert write data to
the data-in port. This would transfer the write data to the bank, which
would be written with the COL address assertion.

By decoupling the data-in from the data-out port one gets rid of lots of
electrical nasties without as many pins as widening the address port to
full width. Reads and writes stream continuously, no turn around on the
data bus.

Look for "Alsup, Shebanow, and Hoekstra"
>
> George

Terje Mathisen

unread,

Apr 7, 2020, 7:21:00 PM4/7/20

to

Did someone call?

Yeah, I mentioned "intermediate size" arbitrary precision, of which
512-2048 bit RSA keys are a perfect example.

Back in the Itanium days I figured out that you could fit such a number
inside the register set, with redundant storage (i.e. base+overflow for
each position) in order to delay all the carry propagations to the very
end, or at least not have to do it all the time. For actual carry chains
on that platform you would have to program a pair of add operations, one
regular and one ADD1 (?) which added the two inputs, plus one, and then
the predicate would be the outgoing carry from the level below.

Using redundant storage could be faster even on x64 since you can then
use four 256-bit AVX SIMD regs to hold each 512-bit number, but I'm
afraid this will blow up on 1K+ operands, where ADD/ADC would have to be
used quite regularly.

BTW, when programming using 64-bit blocks I would consider using some of
the hardware adder tricks, i.e. adding two bignums could proceed in
parallel over each set of 4 such blocks, then the outgoing carries from
each superblock would be added to the bottom entry in the block above
and at this point it makes sense to use a branch on carry to handle any
excessive carry propagation. OTOH, this would introduce data-dependent
timing which is _reallY_ bad for safety-critical programming, so then
we're back with long carry chains.

Anton Ertl

unread,

Apr 8, 2020, 11:20:47 AM4/8/20

to

I expected it, but I have seen blunders from gcc before.

>It is worth noting that if the compiler knows that "max" is non-negative
>(which it will, if max is a compile-time constant as is often the case),
>the signed case is almost the same code as the unsigned case.

Ah, now I see that signed 0<=x && x<max is not the same as unsigned
x<max if signed max<0. I retract my claim that gcc and clang should
be able to optimize this.

Can we get any additional insights from this example?

I assumed signed max>=0, but did not express this in C, as you did:

>void foo1(long x, long max) {
> if (max < 0) __builtin_unreachable();
> if ((0 <= x) && (x < max)) bar1(x);
>}

I also did not really express it in Forth, but the x 0 <max> WITHIN
idiom happens to result in better code than 0<=x && x<max idiom in C,
because it produces a different result if max<0
<https://forth-standard.org/standard/core/WITHIN>.

Maybe the compiler knows often enough for that idiom that max>=0, but
if not, maybe Stroustroup or rather the C++ committee should add
something like Forth's WITHIN to C++.