Modern ignorance ---- apologies!

gareth evans

unread,

May 10, 2021, 5:12:25 PM5/10/21

to

Did my assembler apprenticeship 40 years ago on a PDP11/20,
long before any cache or pipelining.

In, say, the ARMv8 architecture, with its pipelining and
branch prediction, does one have to pad out instructions
following a branch so that the pipeline gets flushed?

Also, if the branch is not taken, should there be a string of NOPs
inlineso that if the branch is taken, there have not been
any speculative instructions already executed that would
affect the logic of a program?

MitchAlsup

unread,

May 10, 2021, 5:19:36 PM5/10/21

to

On Monday, May 10, 2021 at 4:12:25 PM UTC-5, gareth evans wrote:
> Did my assembler apprenticeship 40 years ago on a PDP11/20,
> long before any cache or pipelining.
>
> In, say, the ARMv8 architecture, with its pipelining and
> branch prediction, does one have to pad out instructions
> following a branch so that the pipeline gets flushed?
<

No, no modern machine is requiring NoOp padding.

>
> Also, if the branch is not taken, should there be a string of NOPs

> inline so that if the branch is taken, there have not been

> any speculative instructions already executed that would
> affect the logic of a program?
<

Only MIPS, SPARC, 88K, and Alpha had branch delay slots.

EricP

unread,

May 10, 2021, 6:12:40 PM5/10/21

to

Not Alpha - the ISA guarantees no branch or load delay slots.

Stephen Fuld

unread,

May 10, 2021, 6:24:25 PM5/10/21

to

AMD 29000 had branch delay slots.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Anton Ertl

unread,

May 11, 2021, 5:12:36 AM5/11/21

to

MitchAlsup <Mitch...@aol.com> writes:
>Only MIPS, SPARC, 88K, and Alpha had branch delay slots.

Alpha hadn't. 29K did AFAIK had. Various signal processors (IIRC
from TI, and Trimedia from Philips) elevated branch delay slot to an
art form: the had several branch delay slots, and several nops, some
of which where for filling multiple delay slots.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

Thomas Koenig

unread,

May 11, 2021, 6:45:32 AM5/11/21

to

Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:

> MitchAlsup <Mitch...@aol.com> writes:
>>Only MIPS, SPARC, 88K, and Alpha had branch delay slots.
>
> Alpha hadn't. 29K did AFAIK had. Various signal processors (IIRC
> from TI, and Trimedia from Philips) elevated branch delay slot to an
> art form: the had several branch delay slots, and several nops, some
> of which where for filling multiple delay slots.

IIRC, the 801 had branch instructions with and without delay slots,
so the compiler could chose.

Quadibloc

unread,

May 11, 2021, 8:10:49 AM5/11/21

to

On Tuesday, May 11, 2021 at 4:45:32 AM UTC-6, Thomas Koenig wrote:

> IIRC, the 801 had branch instructions with and without delay slots,
> so the compiler could chose.

So if useful instructions can be placed in them, the one with delay
slots is used; otherwise, no need to waste memory on no-operation
instructions.

John Savard

Ivan Godard

unread,

May 11, 2021, 4:31:10 PM5/11/21

to

On 5/11/2021 2:09 AM, Anton Ertl wrote:
> MitchAlsup <Mitch...@aol.com> writes:
>> Only MIPS, SPARC, 88K, and Alpha had branch delay slots.
>
> Alpha hadn't. 29K did AFAIK had. Various signal processors (IIRC
> from TI, and Trimedia from Philips) elevated branch delay slot to an
> art form: the had several branch delay slots, and several nops, some
> of which where for filling multiple delay slots.
>
> - anton
>

Well put, Anton.

I was on the Trimedia architecture team. There were three branch delay
slots, and you could put branches in the delay slots, which gave you
what amounted to an EXEC instruction. Asm for it was indeed an art form.

MitchAlsup

unread,

May 11, 2021, 5:50:55 PM5/11/21

to

It gave you a semi-literate EXC instruction, but if the EXCed instruction was
a branch, you just lost control of where you were.

Ivan Godard

unread,

May 11, 2021, 7:54:03 PM5/11/21

to

Isn't that true of any EXEC? I don't know what other systems did about
that - do you?

Stephen Fuld

unread,

May 11, 2021, 9:55:36 PM5/11/21

to

On the Univac 1100 series there is an Execute instruction. If the
instruction executed is a jump, control is transferred to the "jump to"
address. Furthermore, if the executed instruction is a call type,
control is transferred as before, but the return address is set to the
instruction after the Execute instruction.

And, BTW, the effective address computed by the Execute instruction (the
address of the instruction being Executed), was went through normal
address calculation, including optional register contents addition, etc.

BGB

unread,

May 14, 2021, 1:24:04 PM5/14/21

to

Adding to this list:
SuperH, TMS320, ...

I skipped out on them because:
They suck;
They would have exposed awkward implementation details (*);
Much of the time, they would just end up with NOPs anyways;
...

*: Say, one has an ISA where the bundling, handling of larger
instruction encodings, ...

May be handled as one of:
Read and decoded, and executed, all at once;
Read, decoded, and executed one instruction word at a time;
...

When I designed my ISA, I had a few design goals:
A single-wide core should still be able to execute wide-execute code;
A wider core should be able to execute code meant for a narrower core;
A core should be able to ignore the bundle encoding and use superscalar
if it wants;
...

Delay slots would have caused this to fall on its face.

For example, there is a single-wide profile for BJX2, which (provided
the same instructions are supported) can execute code intended to use WEX.

However, jumbo encodings use a different mechanism:
Jumbo prefixes are executed one at a time, initially behaving like NOPs;
On the final instruction, the larger immediate just sorta appears out of
nowhere.

Similarly, the emulator also checks and "lints" bundles for being
well-formed, and will decode instructions as "BREAK" if they violate
bundling rules (and thus lead to a behavioral divergence). This is
intended mostly to try to sanity-check the compiler output, but
sometimes detects ASM bugs as well (I recently tightened up a few of the
rules after I had noticed some evidence of behavioral divergence in the
Verilog implementation).

There are some features I had looked at which if-supported can also make
for possible issues:
Supporting a second (Load Only) memory-port in Lane 2;
Allowing scalar FPU ops to be executed in Lane 2 (*2);
...

Mostly in that, if used, these operations would break on a core which
does not support them (this code could only be safely executed in scalar
mode).

*2: I have experimentally allowed this. It does allow for a speedup in
floating-point intensive code, generally by allowing memory access or
similar to happen in parallel with FPU operations or "narrow" SIMD
operations (2xFP32 or 4xFP16). Still not formally allowed though.

Likewise, 3-wide code would have broken on a 2-wide core, and it would
have been possible to compose code on 2-wide which would have broken on
a 3-wide core. However, the differences in cost and capabilities favored
eliminating the 2-wide case entirely (the 3-wide was only slightly more
expensive; and was more capable in terms of what it could do with 2-wide
code).

Wider doesn't look likely at the moment, and it is likely if a 4+ core
were done, it would probably be based on OoO superscalar rather than VLIW.

I suspect for the most part, 3-wide is already mostly at the limits of
what level of usable ILP tends to exist for an in-order design, and that
going much wider would be essentially pointless.

...

Nevermind if DRAM bandwidth is still an issue:
Despite the much faster bus (and faster L2 speeds), the bandwidth to
external DRAM is still apparently only ~ 1/4 what is claimed for 72-pin
SIMM RAM on a 486 (seeing stuff claiming a 486 could memcpy 50MB/s in
external DRAM).

Thomas Koenig

unread,

May 14, 2021, 1:32:02 PM5/14/21

to

BGB <cr8...@gmail.com> schrieb:

> On 5/10/2021 4:19 PM, MitchAlsup wrote:
>> On Monday, May 10, 2021 at 4:12:25 PM UTC-5, gareth evans wrote:
>>> Did my assembler apprenticeship 40 years ago on a PDP11/20,
>>> long before any cache or pipelining.
>>>
>>> In, say, the ARMv8 architecture, with its pipelining and
>>> branch prediction, does one have to pad out instructions
>>> following a branch so that the pipeline gets flushed?
>> <
>> No, no modern machine is requiring NoOp padding.
>>>
>>> Also, if the branch is not taken, should there be a string of NOPs
>>> inline so that if the branch is taken, there have not been
>>> any speculative instructions already executed that would
>>> affect the logic of a program?
>> <
>> Only MIPS, SPARC, 88K, and Alpha had branch delay slots.
>>
>
> Adding to this list:
> SuperH, TMS320, ...

Don't forget HP-PA.

>
>
> I skipped out on them because:
> They suck;

I worked on HP-PA machines for a while, it was not too bad in
instructions per cycle.

MitchAlsup

unread,

May 14, 2021, 2:11:59 PM5/14/21

to

On Friday, May 14, 2021 at 12:24:04 PM UTC-5, BGB wrote:
> On 5/10/2021 4:19 PM, MitchAlsup wrote:
> > On Monday, May 10, 2021 at 4:12:25 PM UTC-5, gareth evans wrote:
> >> Did my assembler apprenticeship 40 years ago on a PDP11/20,
> >> long before any cache or pipelining.
> >>
> >> In, say, the ARMv8 architecture, with its pipelining and
> >> branch prediction, does one have to pad out instructions
> >> following a branch so that the pipeline gets flushed?
> > <
> > No, no modern machine is requiring NoOp padding.
> >>
> >> Also, if the branch is not taken, should there be a string of NOPs
> >> inline so that if the branch is taken, there have not been
> >> any speculative instructions already executed that would
> >> affect the logic of a program?
> > <
> > Only MIPS, SPARC, 88K, and Alpha had branch delay slots.
> >
> Adding to this list:
> SuperH, TMS320, ...
>
>
> I skipped out on them because:
> They suck;
> They would have exposed awkward implementation details (*);
> Much of the time, they would just end up with NOPs anyways;
> ...
>
>
> *: Say, one has an ISA where the bundling, handling of larger
> instruction encodings, ...
<

Say one decided that this wastes more entropy that one wants to allow:

>
> May be handled as one of:
> Read and decoded, and executed, all at once;
> Read, decoded, and executed one instruction word at a time;
> ...
<

The Std RISC-like ISA allows for wide implementations where the compiler
can be rather ignorant of how wide the machine happens to be.

>
> When I designed my ISA, I had a few design goals:
> A single-wide core should still be able to execute wide-execute code;
> A wider core should be able to execute code meant for a narrower core;
> A core should be able to ignore the bundle encoding and use superscalar
> if it wants;
<

Is it NOT simpler simply not to have wide execute encoding, this lets narrow
machines execute as they see fit, and for machines that are significantly
wide, decide for themselves how to bundle instructions ?

> ...
>
> Delay slots would have caused this to fall on its face.
<

Absofriggenlutely ! Delay slots are a bad idea that in the heat of battle 1st
generation RISC could not afford to lose the perf they apparently bought,
but that created a large burden on all future implementatnions.

MitchAlsup

unread,

May 14, 2021, 2:12:47 PM5/14/21

to

HP-PA was not nearly as bad as the machine that HP jumped onto
after HP-PA !!

Thomas Koenig

unread,

May 14, 2021, 3:21:26 PM5/14/21

to

MitchAlsup <Mitch...@aol.com> schrieb:

Which was one of the worst aspects of Itanium.

It killed off too many RISC architectures. There would eventually
have been a consolidation, but x86_64 was not the right architecture
to consolidate to...

BGB

unread,

May 14, 2021, 3:52:19 PM5/14/21

to

Possibly, though most of the "hard" parts of the work for the compiler
still apply to optimizing code for an in-order superscalar.

Namely, the compiler needs to figure out which instructions it can
shuffle around, and try to get them in an order that minimizes
interlocks and dependencies between instructions.

After this, scanning along and putting instructions into bundles is more
an "icing on the cake" thing. Compiler mostly has to verify that the
instructions can be safely executed in parallel according to the ISA
rules, and then flag those that can.

>>
>> When I designed my ISA, I had a few design goals:
>> A single-wide core should still be able to execute wide-execute code;
>> A wider core should be able to execute code meant for a narrower core;
>> A core should be able to ignore the bundle encoding and use superscalar
>> if it wants;
> <
> Is it NOT simpler simply not to have wide execute encoding, this lets narrow
> machines execute as they see fit, and for machines that are significantly
> wide, decide for themselves how to bundle instructions ?

A wider machine can ignore the wide-execute encoding and do its own
bundling if it wants.

It costs ~ 1 bit of encoding space, but doesn't add any additional
complexity for an implementation that just wants to ignore it.

The wide execute encoding is more intended for processors that are
effectively too limited to be able to figure this out themselves (yet
still capable enough to be able to afford it at all).

Besides the 3-wide profile, the other profile is currently 1 wide.

This is mostly because the 3-wide profile has little hope of fitting
into an XC7S25 or similar.

However, on the XC7S50 and XC7A100, I can to do this, but not that much
more than this.

I have not yet tried putting it in an ECP5 or similar (but, dev boards
with Lattice FPGAs seem to be a lot harder to find on Amazon than ones
with Xilinx or Altera FPGAs).

>> ...
>>
>> Delay slots would have caused this to fall on its face.
> <
> Absofriggenlutely ! Delay slots are a bad idea that in the heat of battle 1st
> generation RISC could not afford to lose the perf they apparently bought,
> but that created a large burden on all future implementatnions.

Yeah.

MitchAlsup

unread,

May 14, 2021, 5:42:21 PM5/14/21

to

We actually found this not to be true in the case of the Mc 88120.
Pretty much anything compiled for Mc 88100 ran really well on the
'120--in fact we did not have access to a '120 compiler and used
the '100 simulator to pipe instructions to the '120 simulator.
<
Now could a bit more have been obtained--sure--but we were getting
5.99 ipc out of MATRIX300, and 2.05 ipc out of XLISP ! This seemed
to be "enough" for a 1991 design 1994 chip.

>
> Namely, the compiler needs to figure out which instructions it can
> shuffle around, and try to get them in an order that minimizes
> interlocks and dependencies between instructions.
<

We had the HW to do this--with a bit of annotation in the register
specification fields and the reservation station design. In a packet,
register specifiers ended up 6-bits wide, if the HoB was 0 the field
specified a register which was read from the file, forwarding,...
If the HoB was 1, we knew that the operand was delivered from
within the packet, and the field encoded the "slot" in the machine
that would deliver said result. We did similar tricks on the destination
register to say the result was delivered and became dead within the
packed.
<
The packet builder built packets AFTER the instructions retired from
the machine, so we have architectural (observed) order. unconditional
branches simply did not exist in the packet. Instruction under a shadow
of a branch were so annotated along with their position {Then-clause or
Else-clause}.

>
> After this, scanning along and putting instructions into bundles is more
> an "icing on the cake" thing. Compiler mostly has to verify that the
> instructions can be safely executed in parallel according to the ISA
> rules, and then flag those that can.
> >>
> >> When I designed my ISA, I had a few design goals:
> >> A single-wide core should still be able to execute wide-execute code;
> >> A wider core should be able to execute code meant for a narrower core;
> >> A core should be able to ignore the bundle encoding and use superscalar
> >> if it wants;
> > <
> > Is it NOT simpler simply not to have wide execute encoding, this lets narrow
> > machines execute as they see fit, and for machines that are significantly
> > wide, decide for themselves how to bundle instructions ?
<
> A wider machine can ignore the wide-execute encoding and do its own
> bundling if it wants.
<

But you burned those bits ! and thus wasted entropy.

>
> It costs ~ 1 bit of encoding space, but doesn't add any additional
> complexity for an implementation that just wants to ignore it.
>
>
> The wide execute encoding is more intended for processors that are
> effectively too limited to be able to figure this out themselves (yet
> still capable enough to be able to afford it at all).
>
> Besides the 3-wide profile, the other profile is currently 1 wide.
>
> This is mostly because the 3-wide profile has little hope of fitting
> into an XC7S25 or similar.
>
> However, on the XC7S50 and XC7A100, I can to do this, but not that much
> more than this.
<

I have been following this for a couple of years, and the 3-wide nature
appears to be due to the low clock frequency an a desire to run DOOM
and QUAKE at acceptable frame rates.

mac

unread,

Jun 3, 2021, 9:40:58 AM6/3/21

to

> Which was one of the worst aspects of Itanium.

> It killed off too many RISC architectures. There would eventually
> have been a consolidation, but x86_64 was not the right architecture
> to consolidate to...

So it *was* a success.

Quadibloc

unread,

Jun 4, 2021, 12:59:02 PM6/4/21

to

On Tuesday, May 11, 2021 at 7:55:36 PM UTC-6, Stephen Fuld wrote:

> On the Univac 1100 series there is an Execute instruction. If the
> instruction executed is a jump, control is transferred to the "jump to"
> address. Furthermore, if the executed instruction is a call type,
> control is transferred as before, but the return address is set to the
> instruction after the Execute instruction.

That is the way to do it. That behaves as if the Execute instruction _is_
the instruction at the effective address. A branch, therefore, quite properly
loses control - but a subroutine call does not, the return is still to the code
sequence where the Execute instruction was.

So my question would be, were there instructions that got it wrong, so that
one couldn't use the Execute instruction on a Jump to Subroutine instruction
without getting a useless and dangerous result?

John Savard

Quadibloc

unread,

Jun 4, 2021, 1:01:09 PM6/4/21

to

On Friday, May 14, 2021 at 1:21:26 PM UTC-6, Thomas Koenig wrote:

> It killed off too many RISC architectures. There would eventually
> have been a consolidation, but x86_64 was not the right architecture
> to consolidate to...

So true, but there was little choice then. Now we have a second chance,
ARM.

John Savard

Quadibloc

unread,

Jun 4, 2021, 1:08:40 PM6/4/21

to

On Friday, May 14, 2021 at 3:42:21 PM UTC-6, MitchAlsup wrote:
> On Friday, May 14, 2021 at 2:52:19 PM UTC-5, BGB wrote:

> > A wider machine can ignore the wide-execute encoding and do its own
> > bundling if it wants.

> But you burned those bits ! and thus wasted entropy.

Now _there's_ a place where my Concertina II architecture shines.

It wastes entropy _elsewhere_. Lots of it, no doubt. Having full base-index
addressing like a CISC, and banks of 32 registers like a RISC, which shouldn't
even be _possible_, it has to resort to "every trick in the book" to make the
instructions fit at all.

So there's no room for putting one bit at the front of each instruction as a
'break' bit!

Yet, I offer a _fancy_ scheme of wide-execute encoding, which adds not just
one, but *three*, bits to every instruction! How do I do this? Well, I take a tiny
sliver of the opcode space, and use it for a program block header - only if I use
a block header do I have the bits available to do the wide-execute encoding.

John Savard

Quadibloc

unread,

Jun 4, 2021, 1:13:23 PM6/4/21

to

Not from a _sales_ point of view, but from a _strategic_ point of view,
yes. However, IBM was big and powerful enough that it managed to
keep the Power PC around, and indeed, Oracle kept SPARC around.

If, therefore, those designs had so much technical merit that they
were threats to x86-64 *on that basis*, Itanium would also have been
a strategic failure. However, the world doesn't work that way. Instead,
ARM is the only serious threat to x86 dominance - because it found
a niche - smartphones - big enough to finance development of the
architecture to the extent that there are implementations across a
range of performance levels, some worthy of the desktop and server
space.

So ARM exists as a challenger *for the same reason* that x86 is the
undisputed champion... it has a base of installed software.

John Savard

MitchAlsup

unread,

Jun 4, 2021, 1:20:29 PM6/4/21

to

On Friday, June 4, 2021 at 12:08:40 PM UTC-5, Quadibloc wrote:
> On Friday, May 14, 2021 at 3:42:21 PM UTC-6, MitchAlsup wrote:
> > On Friday, May 14, 2021 at 2:52:19 PM UTC-5, BGB wrote:
>
> > > A wider machine can ignore the wide-execute encoding and do its own
> > > bundling if it wants.
>
> > But you burned those bits ! and thus wasted entropy.
> Now _there's_ a place where my Concertina II architecture shines.
>
> It wastes entropy _elsewhere_. Lots of it, no doubt. Having full base-index
> addressing like a CISC, and banks of 32 registers like a RISC, which shouldn't
> even be _possible_,
<

My 66000 fit this one in::
a) Mem Rd,[Rbase+IMM16] is one 32-bit instruction
b) Mem Rd,[Rbase+Rindex<<scale] is one 32-bit instruction
c) Mem Rd,[Rbase+Rindex<<scale+imm32] is a 64-bit instruction.

<
> it has to resort to "every trick in the book" to make the
> instructions fit at all.
>
> So there's no room for putting one bit at the front of each instruction as a
> 'break' bit!
<

This is where PRED and CARRY and VEC-LOOP come in.........

MitchAlsup

unread,

Jun 4, 2021, 1:20:59 PM6/4/21

to

Not much of a choice:
a) mud pie
b) mud pudding
>
> John Savard

MitchAlsup

unread,

Jun 4, 2021, 1:22:24 PM6/4/21

to

This seems, to me, to be the correct thought train on that subject.

MitchAlsup

unread,

Jun 4, 2021, 1:24:33 PM6/4/21

to

On Friday, June 4, 2021 at 12:13:23 PM UTC-5, Quadibloc wrote:
> On Thursday, June 3, 2021 at 7:40:58 AM UTC-6, mac wrote:
> > > Which was one of the worst aspects of Itanium.
>
> > > It killed off too many RISC architectures. There would eventually
> > > have been a consolidation, but x86_64 was not the right architecture
> > > to consolidate to...
>
> > So it *was* a success.
> Not from a _sales_ point of view, but from a _strategic_ point of view,
> yes. However, IBM was big and powerful enough that it managed to
> keep the Power PC around, and indeed, Oracle kept SPARC around.
>
> If, therefore, those designs had so much technical merit that they
> were threats to x86-64 *on that basis*, Itanium would also have been
> a strategic failure. However, the world doesn't work that way. Instead,
> ARM is the only serious threat to x86 dominance - because it found
> a niche - smartphones -
<

The niche started out to be "anything but PCs" not smartphones.

<
> big enough to finance development of the
> architecture to the extent that there are implementations across a
> range of performance levels, some worthy of the desktop and server
> space.
<

Cubic dollars beats clever architecture every time (except the first.)

John Dallman

unread,

Jun 4, 2021, 1:48:25 PM6/4/21

to

In article <3af33fa9-ee4e-4ab5...@googlegroups.com>,

jsa...@ecn.ab.ca (Quadibloc) wrote:

> However, IBM was big and powerful enough that it managed to
> keep the Power PC around, and indeed, Oracle kept SPARC around.

Oracle didn't acquire Sun until 2009-10, by which time it was clear that
Itanium wasn't going to become dominant.

IBM and Sun both embraced Itanium in its early days, but kept their own
architectures going. The Solaris port to IA-64 was announced and
cancelled; IBM was part of Project Monterey, but that was also cancelled.

The most obvious victims of Itanium were DEC with Alpha and SGI with
MIPS.

John

MitchAlsup

unread,

Jun 4, 2021, 2:52:55 PM6/4/21

to

All in all, it would have been less expensive to simply buy SGI and DEC.......
>
> John

Chris M. Thomasson

unread,

Jun 4, 2021, 4:33:17 PM6/4/21

to

On 5/10/2021 2:19 PM, MitchAlsup wrote:
> On Monday, May 10, 2021 at 4:12:25 PM UTC-5, gareth evans wrote:
>> Did my assembler apprenticeship 40 years ago on a PDP11/20,
>> long before any cache or pipelining.
>>
>> In, say, the ARMv8 architecture, with its pipelining and
>> branch prediction, does one have to pad out instructions
>> following a branch so that the pipeline gets flushed?
> <
> No, no modern machine is requiring NoOp padding.
>>
>> Also, if the branch is not taken, should there be a string of NOPs
>> inline so that if the branch is taken, there have not been
>> any speculative instructions already executed that would
>> affect the logic of a program?
> <
> Only MIPS, SPARC, 88K, and Alpha had branch delay slots.
>

side note... Putting a MEMBAR instruction in a branch delay slot on the
SPARC was a no-no!

unread,

Jun 5, 2021, 12:14:16 AM6/5/21

to

On 6/4/2021 5:39 PM, Quadibloc wrote:
> On Friday, June 4, 2021 at 3:58:07 PM UTC-6, Quadibloc wrote:
>
>> then how come ARM has a Thumb Mode which can be used to
>> write whole programs made out only of 16-bit instructions, and I
>> have been unable to approach this?
>
> My pure 16-bit instruction set is _almost_ complete, but it does
> leave *one* thing out that ARM's Thumb Mode includes. You
> have to switch to 32-bit instructions to do a subroutine jump.
>

I had followed after Thumb2 here...
0zzz..Dzzz: 16-bit (more or less)
Ezzz/ Fzzz: 32-bit (or 32+ bits)

There is a certain advantage to being able to freely mix 16 and 32 bit
instructions without needing some sort of mode-change.

In ASM code, it is also possible to use 16-bit ops alongside WEX
bundles, ... However, this isn't done by the main codegen mostly because
the "WEXifier" can't deal with 16-bit encodings (this would require the
compiler to take a different approach, and effectively split the backend
into multiple pieces; adding an intermediate stage which represents ASM
instructions in the form of arrays or linked-lists or similar).

I did experiment some with some 24-bit ops, but my current leaning is
"they aren't worth it".

Also there is a property with the ISA at present that, if one starts
disassembling at a random location within the middle of a 32-bit
instruction, then typically the decoded instruction stream will realign
itself within 1 or 2 instructions (or within 0 instructions by looking
at the prior 2 instruction words). The breaks between instructions are
also fairly easy to determine in a hex dump. Similarly, E or F in the
high order bits of the 2nd word in a 32-bit op is relatively uncommon.

However, if one throws 24 bit ops into the mix, this property goes out
the window (and a misaligned instruction-stream is basically confetti).

> While that would be easy enough to remedy, since such a subroutine
> jump would specify a particular register as the location of the return
> address, that would distort the architecture, essentially warping the
> choice of calling conventions.
>

I used a Link Register (LR).

Implicitly, some contexts are using R1 as a secondary/stand-in Link
Register, and some instructions like "JMP R1" have been defined to
behave as-if R1 were the link register.

Though, in this case, this is mostly related to a recent semantics tweak:
LR(47: 0): Contains the saved PC address;
LR(63:48): Contains some captured bits from SR.

Implicitly, PC now implicitly mirrors these same bits in the same layout
as LR.

RTS and RTSU will restore these bits into SR.

Likewise, "JMP R1" will also restore these bits, wheres "JMP Rn" with
any other register will ignore the high-order bits from the register
(and keep whatever is already in these bits).

The need for a secondary link register mostly comes up in prolog and
epilog compression, and some other related forms of short-fragment code
reuse.

The reason for preserving these bits is that it resolves some issues
with the semantics.

John Dallman

unread,

Jun 5, 2021, 5:07:57 AM6/5/21

to

In article <2021Jun...@mips.complang.tuwien.ac.at>,

an...@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

> HP-PA.
>
> But it seems to me that SGI and DEC/Compaq had problems independent
> of IA-64. IA-64 was a welcome escape hatch to get rid of the
> no-longer-wanted legacy. I have not followed HP-PA enough to make
> similar guesses for that, or to make an opposite guess.

Itanium was the planned replacement for PA-RISC. When HP approached Intel
to partner on the project, their outline architecture was called "PA-RISC
v3".

HP felt they had some good ideas, but that fully developing a new
architecture good for a few decades would be too expensive to be
supported by the HP-UX and MPE businesses. They were quite correct about
the latter point. Itanium replacing PA-RISC as the HP-UX platform was one
of the things that went more or less as planned.

John

Tom Gardner

unread,

Jun 5, 2021, 5:14:33 AM6/5/21

to

Having the Itanic as the forward path allowed HP salesmen to
keep the PA-RISC customers (e.g. telecom) on board, and dissuade
them from defecting to alternatives.

For a while :)

gareth evans

unread,

Jun 5, 2021, 6:01:28 AM6/5/21

to

On 04/06/2021 23:39, Quadibloc wrote:
> ... since such a subroutine

> jump would specify a particular register as the location of the return

> address, ...

Shades of the nightmare that is / was the 1802 microprocessor where
you changed the register that was the current program counter,
a contrivance that suggests to me that that the designer of that
ISA had very limited programming experience, maybe simple programs
with no nested subroutines!

Quadibloc

unread,

Jun 5, 2021, 8:51:29 AM6/5/21

to

On Friday, June 4, 2021 at 5:34:02 PM UTC-6, MitchAlsup wrote:

> Architecture is as much about "what to leave out" as "what to leave in" !!

Well, I'm leaving that as an exercise for others.

That is, as an example, you are quite correct that VLIW is lightweight, and
so my mega-CISC instruction set, when it's complete, would be a poor fit.

But the spec is meant for partial implementations - by defining the opcodes for
different kinds of machines, though, they can all have the same opcodes for the
instructions they have in common.

So leaving out is an exercise for the implementor...

John Savard

Quadibloc

unread,

Jun 5, 2021, 8:56:22 AM6/5/21

to

On Saturday, June 5, 2021 at 3:14:33 AM UTC-6, Tom Gardner wrote:

> Having the Itanic as the forward path allowed HP salesmen to
> keep the PA-RISC customers (e.g. telecom) on board, and dissuade
> them from defecting to alternatives.

> For a while :)

Well, one can hardly fault HP for not having a crystal ball.

And with Xeon E7 v2, Intel brought over the special RAS features
that were confined to the Itanium to their x86 line. So HP has a
reasonable alternative to migrate to which is likely to be around
for some time.

If they don't like that, and are unhappy with their relationship to
Intel, there's always ARM.

John Savard

John Dallman

unread,

Jun 5, 2021, 9:02:19 AM6/5/21

to

In article <c47c2758-dd6d-4867...@googlegroups.com>,

jsa...@ecn.ab.ca (Quadibloc) wrote:

> And with Xeon E7 v2, Intel brought over the special RAS features
> that were confined to the Itanium to their x86 line. So HP has a
> reasonable alternative to migrate to which is likely to be around
> for some time.

It's been clear for some time that's what they're going to do, replacing
HP-UX with Linux.

> If they don't like that, and are unhappy with their relationship to
> Intel, there's always ARM.

They sacrificed their ability to get meaningfully cross with Intel about
Itanium when they sold their in-house Itanium design team to Intel.

John

Quadibloc

unread,

Jun 5, 2021, 9:07:25 AM6/5/21

to

On the ARM, a _fixed_ register is the program counter.

On my architecture, the program counter is the program counter, and not
one of the general registers.

But there is no stack.

Subroutine calls save the return address in one of the registers according
to whatever calling convention the user may choose. I imagine that a common
choice will be to place the return address in a register that the called program
uses as a base register, so that a return can be made by a conditional jump
instruction with that register as the base, a displacement of zero, and a condition
of always.

This is how it was done on the IBM System/360.

My architecture, however, _differs_ from the IBM System/360 in the following
respect: there are families of addressing modes which use different registers
as base registers.

Integer general registers 25 through 31 are the registers that may be used
as base registers with 16-bit displacements.

Integer general register 24 may be used as a base register with a 15-bit
displacement.

Integer general registers 9 through 15 may be used as base registers with
12-bit displacements.

Integer general registers 17 through 23 may be used as base registers with
20-bit displacements.

Integer general register 16 may be used as a base register with a 9-bit
displacement; this is the one used for memory-reference instructions in
16-bit instruction only code.

The idea is that instructions have three-bit fields to indicate a base register.

The use of register 24 as a base register allows programs using 12-bit
displacements for shorter address constants within instructions to have
a main 32,767 byte data area, following the scheme IBM used on the
unique 360/20 computer.

Basically, a few general registers are used as base registers in a typical
program, leaving most of the 32 integer general registers available for
computation. (Registers 1 through 7 are available for indexing.)

But any given integer general register may be used as a base register for
a memory area of *only one size*, to avoid certain kinds of confusion.

John Savard

unread,

Jun 5, 2021, 11:11:56 AM6/5/21

to

j...@cix.co.uk (John Dallman) writes:
>In article <2021Jun...@mips.complang.tuwien.ac.at>,
>an...@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
>
>> HP-PA.
>>
>> But it seems to me that SGI and DEC/Compaq had problems independent
>> of IA-64. IA-64 was a welcome escape hatch to get rid of the
>> no-longer-wanted legacy. I have not followed HP-PA enough to make
>> similar guesses for that, or to make an opposite guess.
>
>Itanium was the planned replacement for PA-RISC.

So it was the planned escape hatch to get rid of the no-longer-wanted
legacy? :-)

>When HP approached Intel
>to partner on the project, their outline architecture was called "PA-RISC
>v3".

And for Intel it also became the planned escape hatch to get rid of
the no-longer wanted legacy.

>HP felt they had some good ideas, but that fully developing a new
>architecture good for a few decades would be too expensive to be
>supported by the HP-UX and MPE businesses. They were quite correct about
>the latter point. Itanium replacing PA-RISC as the HP-UX platform was one
>of the things that went more or less as planned.

So I guess they also saw the writing on the wall that they would not
be able to afford the ever-increasing costs of designing
implementations of their private architecture. And indtead they fell
into the trap of designing an even more cost-intensive architecture.

John Levine

unread,

Jun 5, 2021, 11:19:31 AM6/5/21

to

According to Quadibloc <jsa...@ecn.ab.ca>:

>On the ARM, a _fixed_ register is the program counter.

That's what the PDP-11 did. I don't know if it was the first architecture to put the PC
in a register but it was certainly the most famous.

It made relative addressing the same as indexed addressing, using the PC as the index register.
It had a (R)+ and @(R)+ modes, use the register as the address and then inrement it by the address and
then in the second case, use that as the indirect address.
Hence (PC)+ was an immediate operand and @(PC)+ was absolute addressing.

Considering how expensive transistors were in 1969, it was a cute hack to simplify the architecture
and remove what otherwise would have been special cases.

The IBM 360 made you use a general register to address your code. In
retrospect that was a mistake which they fixed by adding relative
branches much later.

--
Regards,
John Levine, jo...@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

MitchAlsup

unread,

Jun 5, 2021, 3:07:28 PM6/5/21

to

On Saturday, June 5, 2021 at 10:19:31 AM UTC-5, John Levine wrote:
> According to Quadibloc <jsa...@ecn.ab.ca>:
> >On the ARM, a _fixed_ register is the program counter.
> That's what the PDP-11 did. I don't know if it was the first architecture to put the PC
> in a register but it was certainly the most famous.
>
> It made relative addressing the same as indexed addressing, using the PC as the index register.
> It had a (R)+ and @(R)+ modes, use the register as the address and then inrement it by the address and
> then in the second case, use that as the indirect address.
> Hence (PC)+ was an immediate operand and @(PC)+ was absolute addressing.
<

It was a cute trick, but made pipelineing difficult.

John Levine

unread,

Jun 5, 2021, 4:46:05 PM6/5/21

to

According to MitchAlsup <Mitch...@aol.com>:

>On Saturday, June 5, 2021 at 10:19:31 AM UTC-5, John Levine wrote:
>> According to Quadibloc <jsa...@ecn.ab.ca>:
>> >On the ARM, a _fixed_ register is the program counter.
>> That's what the PDP-11 did. I don't know if it was the first architecture to put the PC
>> in a register but it was certainly the most famous.
>>
>> It made relative addressing the same as indexed addressing, using the PC as the index register.
>> It had a (R)+ and @(R)+ modes, use the register as the address and then inrement it by the address and
>> then in the second case, use that as the indirect address.
>> Hence (PC)+ was an immediate operand and @(PC)+ was absolute addressing.
><
>It was a cute trick, but made pipelineing difficult.

Was there ever a PDP-11 that was pipelined? This was quite a while ago.

Also, the address modes were all fixed places in the first word of the
instruction so I would think it'd be simple enough to recognize those
two and take the words out of the instruction stream. It has to do
roughly the same thing for the index words in indexed address modes.

Quadibloc

unread,

Jun 5, 2021, 5:01:23 PM6/5/21

to

On Saturday, June 5, 2021 at 7:16:34 AM UTC-6, Stephen Fuld wrote:

> In general, ISTM that having different sets of registers with different
> capabilities is not a good idea. You invariably want more of one type
> than there are and don't use all of another type. It also complicates
> the compiler.

That is certainly true. However, I don't think that what I have done is
_too_ bad in *that* respect, although it could be bad for other reasons.

There are 32 general registers. You will want to use maybe two or three
of them as base registers in normal code, so as not to take away too many
registers from normal use.

All that changes here is that, depending on the memory model you use,
those base registers may be drawn from a different group of 8 out of
the 32 registers.

John Savard

Paul A. Clayton

unread,

Jun 5, 2021, 5:12:21 PM6/5/21

to

On Friday, June 4, 2021 at 1:20:59 PM UTC-4, MitchAlsup wrote:
> On Friday, June 4, 2021 at 12:01:09 PM UTC-5, Quadibloc wrote:

>> On Friday, May 14, 2021 at 1:21:26 PM UTC-6, Thomas Koenig wrote:
>>
>>> It killed off too many RISC architectures. There would eventually
>>> have been a consolidation, but x86_64 was not the right architecture
>>> to consolidate to...

>> So true, but there was little choice then. Now we have a second chance,
>> ARM.
> <
> Not much of a choice:
> a) mud pie
> b) mud pudding

Are you aware that AArch64 (64-bit ARM ISA) is a relatively clean RISC and that ARM is no longer developing high-performance A-profile cores that support the 32-bit ISAs? (I think an abstraction layer would be better than a traditional ISA, but the business case for such is weaker. My 66000 is better than AArch64, but AArch64 does not seem like mud pudding.)

George Neuner

unread,

Jun 5, 2021, 5:17:37 PM6/5/21

to

And when the architecture (or convention) doesn't define something
important - like which register to use for a stack - you can end up
with competing and incompatible ABIs.

Recall the joys of trying to mix code from various 68K Macintosh
toolchains that differed in using A5 or A6 for the stack.

George

Stefan Monnier

unread,

Jun 5, 2021, 5:21:49 PM6/5/21

to

> Are you aware that AArch64 (64-bit ARM ISA) is a relatively clean RISC and
> that ARM is no longer developing high-performance A-profile cores that
> support the 32-bit ISAs?

Actually, it seems they also dropped support for the 32bit ISA in the
A5x line of CPUs (at least in the Cortex-A510).

Stefan

MitchAlsup

unread,

Jun 5, 2021, 6:14:14 PM6/5/21

to

On Saturday, June 5, 2021 at 3:46:05 PM UTC-5, John Levine wrote:
> According to MitchAlsup <Mitch...@aol.com>:
> >On Saturday, June 5, 2021 at 10:19:31 AM UTC-5, John Levine wrote:
> >> According to Quadibloc <jsa...@ecn.ab.ca>:
> >> >On the ARM, a _fixed_ register is the program counter.
> >> That's what the PDP-11 did. I don't know if it was the first architecture to put the PC
> >> in a register but it was certainly the most famous.
> >>
> >> It made relative addressing the same as indexed addressing, using the PC as the index register.
> >> It had a (R)+ and @(R)+ modes, use the register as the address and then inrement it by the address and
> >> then in the second case, use that as the indirect address.
> >> Hence (PC)+ was an immediate operand and @(PC)+ was absolute addressing.
> ><
> >It was a cute trick, but made pipelineing difficult.
> Was there ever a PDP-11 that was pipelined? This was quite a while ago.
>
> Also, the address modes were all fixed places in the first word of the
> instruction so I would think it'd be simple enough to recognize those
> two and take the words out of the instruction stream. It has to do
> roughly the same thing for the index words in indexed address modes.
<

Not nearly as hard to pipeline as VAX, but you could have a mem=mem op mem
instruction (2 mem addresses)m along with 3 register updates (not in the same
instruction, though).
<
So to achieve 1 I/C instruction rates, you had to be setup to perform 2 memory
reads and 1 memory write per cycle and up to 2 register results per cycle.
<
Basically, PDP-11, VAX, and Intel 432 taught us what NOT TO DO ! More so
VAX and 432 than PDP-11.
<
The converse is that S/360 is "Not bad at all" in a pipeline sense.
<
PDP-11 taught us that memory mapped I/O was the thing to do and in a sense
was the father of PCI.....

MitchAlsup

unread,

Jun 5, 2021, 6:23:24 PM6/5/21

to

On Saturday, June 5, 2021 at 4:01:23 PM UTC-5, Quadibloc wrote:
> On Saturday, June 5, 2021 at 7:16:34 AM UTC-6, Stephen Fuld wrote:
>
> > In general, ISTM that having different sets of registers with different
> > capabilities is not a good idea. You invariably want more of one type
> > than there are and don't use all of another type. It also complicates
> > the compiler.
> That is certainly true. However, I don't think that what I have done is
> _too_ bad in *that* respect, although it could be bad for other reasons.
>
> There are 32 general registers. You will want to use maybe two or three
> of them as base registers in normal code, so as not to take away too many
> registers from normal use.
<

/*
*******************************************************************
* Kernel 13 -- 2-D PIC (Particle In Cell)
*******************************************************************
* DO 13 L= 1,Loop
* DO 13 ip= 1,n
* i1= P(1,ip)
* j1= P(2,ip)
* i1= 1 + MOD2N(i1,64)
* j1= 1 + MOD2N(j1,64)
* P(3,ip)= P(3,ip) + B(i1,j1)
* P(4,ip)= P(4,ip) + C(i1,j1)
* P(1,ip)= P(1,ip) + P(3,ip)
* P(2,ip)= P(2,ip) + P(4,ip)
* i2= P(1,ip)
* j2= P(2,ip)
* i2= MOD2N(i2,64)
* j2= MOD2N(j2,64)
* P(1,ip)= P(1,ip) + Y(i2+32)
* P(2,ip)= P(2,ip) + Z(j2+32)
* i2= i2 + E(i2+32)
* j2= j2 + F(j2+32)
* H(i2,j2)= H(i2,j2) + 1.0
* 13 CONTINUE
*/

MitchAlsup

unread,

Jun 5, 2021, 6:24:27 PM6/5/21

to

While I value your opinion, you, too, will come to a different conclusion in a few years.

BGB

unread,

Jun 5, 2021, 6:55:44 PM6/5/21

to

OK.

I don't have a call/return stack, but do have a special case in the
branch predictor to deal with RTSU. It is possible though that RTS and
RTSU could be merged though (with the relevant pipeline checks being
handled in hardware rather than by the compiler).

In my case it is R1 partly because that is what had ended up being used
already in this case. R1 was defined as being freely stomped and was not
otherwise used by the C ABI, but was also not used much as a stomp
register in practice, and some of the cases for which it existed as such
originally no longer exist.

While similar behavior could have also been applied to R0, the use of R0
was more often as an escape-case for encoding out-of-range branches.
Though, this later case is likely to end up subsumed into the newer "JMP
Abs48" encoding.

It can also be noted though that using R0 as a stomp register is itself
greatly reduced with the existence of a jumbo prefix.

And, at this point the main properties which R0 and R1 have are:
Special cases for Load/Store ops (eg: PC Rel, GBR Rel);
Not allowed with MOV.X or other 128-bit ops;
R0 and R1 are fixed registers for some ops (CPUID, LDTLB, ...);
R1 may be used as a stand-in for LR (or as a secondary LR);
...

BGB

unread,

Jun 5, 2021, 9:36:28 PM6/5/21

to

On 6/5/2021 2:07 PM, MitchAlsup wrote:
> On Saturday, June 5, 2021 at 10:19:31 AM UTC-5, John Levine wrote:
>> According to Quadibloc <jsa...@ecn.ab.ca>:
>>> On the ARM, a _fixed_ register is the program counter.
>> That's what the PDP-11 did. I don't know if it was the first architecture to put the PC
>> in a register but it was certainly the most famous.
>>
>> It made relative addressing the same as indexed addressing, using the PC as the index register.
>> It had a (R)+ and @(R)+ modes, use the register as the address and then inrement it by the address and
>> then in the second case, use that as the indirect address.
>> Hence (PC)+ was an immediate operand and @(PC)+ was absolute addressing.
> <
> It was a cute trick, but made pipelineing difficult.

The MSP430 also did this.

To me, it seemed functionally equivalent to a variable-length
instruction, just with a more awkward encoding.

AFAIK, the MSP430 was pipelined.

Quadibloc

unread,

Jun 6, 2021, 2:31:22 AM6/6/21

to

Your point is, perhaps, that experience has already proven that the people
responsible for ARM are sufficiently market-driven that as the years go on,
they will not resist the temptation to add features to future iterations of
their architecture?

John Savard

Michael S

unread,

Jun 6, 2021, 6:34:56 AM6/6/21

to

Does it mean that you expect that in few years (how many?) the best available (== shipping, sold and bought, *not* paper) high-performance CPU would be neither mud pie nor mud pudding nor mud pastry (i.e. POWER) ?

John Dallman

unread,

Jun 6, 2021, 6:37:23 AM6/6/21

to

In article <5c085551-bbe4-4e26...@googlegroups.com>,

jsa...@ecn.ab.ca (Quadibloc) wrote:

> Your point is, perhaps, that experience has already proven that the
> people responsible for ARM are sufficiently market-driven that as
> the years go on, they will not resist the temptation to add features
> to future iterations of their architecture?

They are engaged in doing this. ARM v8 has had levels v8.1 to v8.5, and
now v9 added.

A hopefully-significant difference is that they aren't doing this in the
manner of 32-bit ARM, where there were large numbers of unconnected
additions. Those were intended for embedded use, where broad software
compatibility was seen as irrelevant.

With 64-bit ARM, the baseline instruction set is quite capable, and the
additions are compatible: v8.3 includes v8.1 and v8.2, so software people
only have to cope with changes along a single path, AFAIK.

John

MitchAlsup

unread,

Jun 6, 2021, 10:39:23 AM6/6/21

to

No my point is/was that the words "relative" and "clean" are not appropriate
in describing AArch64.
>
> John Savard

Quadibloc

unread,

Jun 6, 2021, 11:07:03 AM6/6/21

to

On Sunday, June 6, 2021 at 8:39:23 AM UTC-6, MitchAlsup wrote:

> No my point is/was that the words "relative" and "clean" are not appropriate
> in describing AArch64.

Ah. So you find it to be _already_ as bad as x86, and just think that he will
eventually see that too. But then I would ask - how is it that it can be as bad
as x86 without being as _obviously_ bad as x86?

John Savard

BGB

unread,

Jun 6, 2021, 12:31:15 PM6/6/21

to

Weighing in with my opinions here...

It seems almost inevitable that an ISA will end up messy in some areas
absent frequent redesign efforts and which break binary compatibility.

Adding features, and trying to avoid breaking existing binaries, will
mean that new features don't necessarily mesh cleanly with older
features, ...

So, eg, AArch64 deals with 32 and 64 bit operations in a roughly similar
way to x86-64, but then has funkiness to deal with sign or zero
extending inputs (rather than keeping the values themselves in a sign or
zero extended form).

It also uses condition-code flags, which personally I am not
particularly a fan of (I consider the "1 bit predicate" to be a
different system).

One other drawback with most traditional RISC designs is that to run
multiple ops in parallel, it is necessary for the hardware to be able to
figure out whether or not parallel execution is possible, which is more
difficult and more resource intensive than encoding it explicitly.

Though it is a tradeoff, in that now compiled code needs to care about
things like how wide the target machine is, ...

So, a wider machine would likely need to fall back to using the
superscalar approach if compiling code specifically for that width is
not viable.

I was almost doing OK in my current ISA, though recent efforts towards
expanding the GPR space to R0..R63 haven't been super clean.

But, why? Mostly because when writing things like rasterizer and
edge-walking loops in my OpenGL rasterizer (in ASM) I was frequently
running into issues with running out of GPRs and needing to fall back to
(gasp) spilling variables to memory.

Initially, I did an extension which only worked on SIMD ops, which
worked well enough:
These ops only used even pairs, meaning one can do R0..R63 in a 5-bit
register field. However, this was still fairly limiting.

I had recently tried the Op24 experiment, where I carved off some of the
remaining (unused) 16-bit encoding space for 24-bit instructions (7zzz
and 9zzz). They are kinda terrible though, so I decided not to have them
in the main ISA.

More recently, I have reused the same encoding space for 'XGPR' encodings:
7wnm-ZeoZ (F0zz)
9wnm-Zeii (F2zz / F1zz)

Where the 'w' field is (wnmo):
w: WEX bit;
nm: Bit 5 for Rm/Rn
o: Bit 5 of Ro, or F2/F1 selector

Which expands the main GPR space, but with a few drawbacks:
It is not possible to predicate these instructions;
ADD?T R47, R31, R55 // N/E
The non-contiguous encoding space is fairly ugly;
Encodings are mutually exclusive with Op24.

At present, I will consider them an optional feature.

It is possible that predicated forms could have been added which also
encode the full R0..R63 range, but this would basically "eat" the entire
16-bit encoding space (and effectively require the instruction decoder
to be modal, which is undesirable).

MitchAlsup

unread,

Jun 6, 2021, 4:59:42 PM6/6/21

to

It is not as bad as x86 which is significantly better than 432.
<
But it is far from as clean as <say> MIPS in the R3000 days.
>
> John Savard

MitchAlsup

unread,

Jun 6, 2021, 5:20:08 PM6/6/21

to

On Sunday, June 6, 2021 at 11:31:15 AM UTC-5, BGB wrote:
> On 6/6/2021 5:34 AM, Michael S wrote:
> > On Sunday, June 6, 2021 at 1:24:27 AM UTC+3, MitchAlsup wrote:
> >> On Saturday, June 5, 2021 at 4:12:21 PM UTC-5, Paul A. Clayton wrote:
> >>> On Friday, June 4, 2021 at 1:20:59 PM UTC-4, MitchAlsup wrote:
> >>>> On Friday, June 4, 2021 at 12:01:09 PM UTC-5, Quadibloc wrote:
> >>>>> On Friday, May 14, 2021 at 1:21:26 PM UTC-6, Thomas Koenig wrote:
> >>>>>
> >>>>>> It killed off too many RISC architectures. There would eventually
> >>>>>> have been a consolidation, but x86_64 was not the right architecture
> >>>>>> to consolidate to...
> >>>>> So true, but there was little choice then. Now we have a second chance,
> >>>>> ARM.
> >>>> <
> >>>> Not much of a choice:
> >>>> a) mud pie
> >>>> b) mud pudding
> >>>
> >>> Are you aware that AArch64 (64-bit ARM ISA) is a relatively clean RISC and that ARM is no longer developing high-performance A-profile cores that support the 32-bit ISAs? (I think an abstraction layer would be better than a traditional ISA, but the business case for such is weaker. My 66000 is better than AArch64, but AArch64 does not seem like mud pudding.)
> >> <
> >> While I value your opinion, you, too, will come to a different conclusion in a few years.
> >
> >
> > Does it mean that you expect that in few years (how many?) the best available (== shipping, sold and bought, *not* paper) high-performance CPU would be neither mud pie nor mud pudding nor mud pastry (i.e. POWER) ?
> >
> Weighing in with my opinions here...
>
> It seems almost inevitable that an ISA will end up messy in some areas
> absent frequent redesign efforts and which break binary compatibility.
<

This is one of the BIG reasons My 66000 has about 30% of the major OpCode space
unallocated at present. This gives plenty of room to grow, without having to squirrel
things into odd corners.

>
> Adding features, and trying to avoid breaking existing binaries, will
> mean that new features don't necessarily mesh cleanly with older
> features, ...
<

This is another BIG reason I ended up preferring instruction modifiers over instructions
when extending the ISA to cover seldom used features (CARRY is the prime example).

>
>
> So, eg, AArch64 deals with 32 and 64 bit operations in a roughly similar
> way to x86-64, but then has funkiness to deal with sign or zero
> extending inputs (rather than keeping the values themselves in a sign or
> zero extended form).
<

I completely agree

>
> It also uses condition-code flags, which personally I am not
> particularly a fan of (I consider the "1 bit predicate" to be a
> different system).
<

I completely agree, here, too.

>
>
> One other drawback with most traditional RISC designs is that to run
> multiple ops in parallel, it is necessary for the hardware to be able to
> figure out whether or not parallel execution is possible, which is more
> difficult and more resource intensive than encoding it explicitly.
<

When I built the "wide" machine, we configured the instruction cache to
have more bits per instruction to encode the intra instruction dependencies.
If an instruction in the packet consumed a result as an operand, the register
specifier had the HoB set and the lower part of the field pointed at the
instruction which would deliver said result. When an instruction produces
a result that did not survive the end of the packet, we marked it as dead
so we did not even allocate a destination register for it.
<
This isolated all of this fairly hard BigO( n^3-to-n^3 ) into the packet builder;
which had execution window cycles to build packets and write the instruction
cache. When running out-of-Icache we only decoded 1 instruction per cycle
and used the inter-packet dependency logic to deal with dependencies. We
remembered these dependencies while the instruction(!) executed, and then
we used the info when packing instructions into the packet.
<
When running in-packet, you get 6-8 instructions per access and can transfer
control to 2 (or 3) different target addresses for the next fetch. There is no
arithmetic in the selection process, merely a "tag" from the packet and "take"
(or "agree") bits from the predictor(s).
<
Were I to do this today, I would do 2 instruction per cycle running out-of Icache.

>
> Though it is a tradeoff, in that now compiled code needs to care about
> things like how wide the target machine is, ...
<

The major thing the compiler should be concerned with is generating the
fewest instructions possible to calculate the semantics of the program;
AND the fewest control transfers possible (i.e., use PRED instead of Branch).

>
> So, a wider machine would likely need to fall back to using the
> superscalar approach if compiling code specifically for that width is
> not viable.
<

I want code compiled for a 1-wide in-order machine to run within spitting
distance (10%-ish) of the best code possible for the Great Big Out-of-Order
Machine.

>
>
>
>
> I was almost doing OK in my current ISA, though recent efforts towards
> expanding the GPR space to R0..R63 haven't been super clean.
>
> But, why? Mostly because when writing things like rasterizer and
> edge-walking loops in my OpenGL rasterizer (in ASM) I was frequently
> running into issues with running out of GPRs and needing to fall back to
> (gasp) spilling variables to memory.
<

MitchAlsup

unread,

Jun 7, 2021, 4:58:49 PM6/7/21

to

On Monday, June 7, 2021 at 3:07:21 PM UTC-5, Quadibloc wrote:
> On Monday, June 7, 2021 at 11:49:08 AM UTC-6, MitchAlsup wrote:
>
> > My 66000 has no (zero, nada, zilch) supervisor instructions.
<
> No doubt it uses some other mechanism for allowing the operating
> system to enforce stuff on user processes, such as doing everything
> by means of memory mapping. (How you avoid having supervisor
> instructions to *manage the memory map*, though, is perhaps the
> question.)
<

If your MMU tables allow you to read an write, then you can control
something--like enabling a thread to run, disabling a thread from
running,..... There will be some threads in the system with permission
to read and write your {or his or hers} MMU tables. These control access
to memory and thus to all things contained in memory--and all registers
and thread state have a "place" in memory. HW is more like a cache
in spilling and filling machine registers.

>
> My only potential concern might be that system software writers might
> experience difficulty in adapting to a novel approach.
<

The *visor merely has to enable the thread, and then disable itself and
the machine does the rest. No saving registers, and state, no restoring
registers and state. All state associated with a thread/task is HW
managed at context switch time and software managed at all other
times.
<
Since the *visor has access to the enable bit of the thread, it also has
access to the register file and can thus read arguments to system calls
or deliver return values back to the caller.
<
If/when a debugger assesses your register file and the thread is running,
the thread comes to a quiescent point, the register is read, and the thread
resumes. Same for thread state registers. Use the same address whether
the thread is running or quiescent.
>
> John Savard

Marcus

unread,

> But, errm, probably good enough...
<

I want to explain what I considered a clever trick in the encoding of
My 66000.
<
Normal instructions have a 6-bit Major OpCOde followed by a 5-bit
result register specifier.
<
I have a Branch on Bit (BB) instruction. In order to access all 64-bits of a
register, the result register specifier needs 6-bits, so I made the BB instruction
decode with 01100x; 0->lower 32-bits, 1->upper 32-bits. all well and good.
<
I also have a Predicate on Bit (PB) instruction, and since this instruction only
needs a 12-bit immediate I put it in the 000xxx Major OpCode space. This
instruction, too, needs access to all 64-bits of the operand register. So, I
made a restriction on the successive 12-bit immediate subspace (shifts
000111) that it not allocate the first instruction in that subspace. {And in the
documentation, I illustrate both sub-spaces on a single instruction placement
figure.}
<
So, PB0-31 is encoded as 000110 and PB32-63 is encoded 000111 matching
the decode of the BB instruction even though it is in a completely different
section of the decode space. In both cases, the minor OpCode space for shift
and predicates for PB is 000. So by makeing a tiny constraint on the shift
subspace one gains decoding simplifications unifying branches and predicates.
<
Thus once you know you are a BB or a PB you can use inst<26:21> as the bit
index.
<
Also, once the "take" or "no take" signal is computed, you can use this signal
to take the branch (or repair a misprediction) or to enable then (no take) or
else (take) clauses of the predicate shadow. Simplifying the branch resolution
logic.

BGB

unread,

Jun 8, 2021, 11:29:10 PM6/8/21

to

Hmm, not entirely sure I follow.

While I couldn't fit predication into the 32-bit XGPR ops, there was a
lot more space in Op64 land.

In my case, the idea for the expanded predication is that one of several
bits from SR could be used as the predicate for an instruction (as
opposed to an arbitrary bit from a GPR).

SR.T is the current primary predicate bit;
SR.S exists, but isn't used for much ATM, but makes the most sense as a
"second predicate";
P/Q/R/O are mostly used as the outputs of SIMD compare ops, or inputs
for SIMD Conditional-Select ops.

Things like CMPxx and TEST might also be extended to support alternate
destination bits.

MitchAlsup

unread,

Jun 9, 2021, 11:37:53 AM6/9/21

to

I contrived a way so that the branch on bit and the predicate on bit share certain
bits when they use different formats and had no way of sharing bits (other than
my contrivance) {and shouldn't have been able to share those bits}.
<
This means in practice that they instruction when executing can share gates.

Paul A. Clayton

unread,

Jun 10, 2021, 3:04:01 PM6/10/21

to

On Saturday, June 5, 2021 at 6:24:27 PM UTC-4, MitchAlsup wrote:
> On Saturday, June 5, 2021 at 4:12:21 PM UTC-5, Paul A. Clayton wrote:

[snip]

>> Are you aware that AArch64 (64-bit ARM ISA) is a relatively clean RISC and that ARM is no longer developing high-performance A-profile cores that support the 32-bit ISAs? (I think an abstraction layer would be better than a traditional ISA, but the business case for such is weaker. My 66000 is better than AArch64, but AArch64 does not seem like mud pudding.)
>
> While I value your opinion, you, too, will come to a different conclusion in a few years.

I think you overestimate my perceptiveness (not that I do not suffer from analysis paralysis) and my likelihood of exposure to information relevant to the tradeoffs of microprocessor design/manufacture and ISA. (The probability of my working with a team designing an ARM processor — even being a clerical worker associated with such a team — in the next five years seems rather close to zero.)

I am also not especially interested in looking deeply into conventional ISAs (which seem not to have acted on the increased importance of communication, even within a core). With NVIDIA buying ARM, the future of AArch64 seems murkier. (NVIDIA seems unlikely to have any reason to hurt M-profile and R-profile ARM, but there seems to be significant queasiness among other users of A-profile that AArch64 may lose market share fairly quickly. The transition has already introduced a hiring freeze, so some FUD may be reasonable.) Even a "clean" ISA can face accidental mismanagement as well as "worse is better" effects; Best-Available-Data decisions and worse engineering choices that are better economic choices are more common in computing than other areas of engineering (rate of change and network effects are significant contributors to this).

If x86 is mud pie and 32-bit ARM is mud pudding, I do not think AArch64 should be considered just as muddy as either of those (i.e., "relatively clean"). Even compared to 64-bit Power, AArch64 (from the little I have looked at either) seems cleaner (and it should be since the main legacy connection to 32-bit ARM is probably condition codes).