Arguments for a Sane Instruction Set Architecture--5 years later

MitchAlsup

unread,

Nov 1, 2022, 6:53:02 PM11/1/22

to

In a thread called "Arguments for a Sane Instruction Set Architecture"
Aug 7, 2017, 6:53:09 PM I wrote::
-----------------------------------------------------------------------
Looking back over my 40-odd year career in computer architecture,
I thought I would list out the typical errors I and others have
made with respect to architecting computers. This is going to be
a bit long, so bear with me:

When the Instruction Set architecture is Sane, there is support
for:
A) negating operands prior to an arithmetic calculation.
B) providing constants from the instruction stream;
...where constant can be an immediate a displacement, or both.
C) exact floating point arithmetics that get the Inexact flag
...correctly unmolested.
D) exception and interrupt control transfer should take no more
...than 1 cache line read followed by 4 cache line reads to the
...same page in DRAM/L3/L2 that are dependent on the first cache
...line read. Control transfer back to the suspended thread should
...be no longer than the control transfer to the exception handler.
E) Exception control transfer can transfer control directly to a
...user privilege thread without taking an excursion through the
...Operating System.
F) upon arrival at an exception handler, no state needs to be saved,
...and the "cause" of the exception is immediately available to the
...Exception handler.
G) Atomicity over a multiplicity of instructions and over a
...multiplicity of memory locations--without losing the
...illusion of real atomicity.
H) Elementary Transcendental function are first class citizens of
...the instruction set, and at least faithfully accurate and perform
...at the same speeds as SQRT and DIV.
I) The "system programming model" is inherently:
...1) Virtual Machine
...2) Hypervisor + Supervisor
...3) multiprocessor, multithreaded
J) Simple applications can run with as little as 1 page of Memory
...Mapping overhead. An application like 'cat' can be run with
...an total allocated page count of 6: {MMU, Code, Data, BSS, Stack,
...and Register Files}
--------------------------------------------------------------------
<
I though it might be fun to have a review of what came out of this::
<
At the time of that writing My 66000 ISA was still gestating in my
head--I was pretty much following the Mc 88000 Architecture in scope
and in format.
<
So; point by point::
<
A) negating operands prior to an arithmetic calculation.
1-operand instructions have sign control over result and of operand
2-operand instructions have sign control over both operands
3-operand instructions have sign control over two operands
So: check
>
B) providing constants from the instruction stream;
1-operand instructions have one <optional> immediate
2-operand instructions have one register and one <optional> immediate
3-operand instructions have two registers and one <optional> immediate
Loads have base register, index register and <optional> displacement
Stores have the same addressing, but the value being stored can be
.....either from a register or from an immediate.
Many immediates have auto-expanding characteristics::
one can FADD Rd,Rs1,#3 to add 3.0D0 using a single 1-word
instruction. 32-bit immediates for (double) FP calculations are auto-
expanded to 64-bits in operand delivery.
Similarly, integer instructions have ±5-bit immediates, signed 16-bit
immediates, 32-bit immediates and 64-bit immediates.
Memory references have 16-bit, 32-bit, and 64-bit displacements.
When Rbase = R0 IP is inserted for easy access to data relative to the
code stream.
So, big Check
<
C) exact floating point arithmetics that get the Inexact flag
...correctly unmolested.
While CARRY provides access to these features (and the inexact bit
.....gets set correctly; it my current assessment that DBLE will be
.....greater use and utility than the exact FP arithmetics.
So, little check

D) exception and interrupt control transfer should take no more
...than 1 cache line read followed by 4 cache line reads to the
...same page in DRAM/L3/L2 that are dependent on the first cache
...line read.
While the above is TRUE it is different than expected. Yes, a context
switch still takes 5-cache line reads, and context switch can transpire
from any thread under any GuestOS to any other thread under any other
GuestOS, all of this is "perpetrated" by a "fixed function unit" far
from the cores of the chip.
This fixed function unit combines the thread being scheduled, the
customer thread asking for service and the <appropriate> HyperVisor
data "assembled" into a single message that effects a context switch.
<
E) Exception control transfer can transfer control directly to a
...user privilege thread without taking an excursion through the
...Operating System.
This remains illusive--while it is technically possible to setup
"state" such that the above happens; it requires each such thread
run under its unique GuestOS. However, one can configure a rather
normal GuestOS so that the exception dispatcher transfers control
to a user level exception handler in 15-ish instructions.
So, medium check
<
F) upon arrival at an exception handler, no state needs to be saved,
...and the "cause" of the exception is immediately available to the
...Exception handler.
This above TRUE and also comes with the property that multiple
exceptions can be logged onto a handler without Interrupt or
Exception disablement.
No state needs to be saved: Check
No state needs to be loaded: Check
Pertinence arrives with control: Check
Control arrives on affinitizxed core: Check
Control arrives at proper priority: Check
Control arrives with proper "privilege": Check
Hard Real Time supported: Maybe
Moderate Real Time Supported: Check
No extraneous excursions though OS: Check.
Overall: Big check
<
G) Atomicity over a multiplicity of instructions and over a
...multiplicity of memory locations--without losing the
...illusion of real atomicity.
Up to 8 cache lines participate in an ATOMIC event.
Multiple locations in each line may have state altered.
There is direct access to whether interference has transpired.
Software can use interference to drive down future interference.
Hardware can transfer control is ATOMICITY has been violated.
Essentially ANY atomic-primitive studied in academia or provided
by industry can be synthesized.
So, medium check
<
H) Elementary Transcendental function are first class citizens of
...the instruction set, and at least faithfully accurate and perform
...at the same speeds as SQRT and DIV.
Transcendental functions operate at about the latency of FDIV
ln2, ln2P1, exp2, exp2M1 14 cycles
ln, ln10, exp, exp10 <and cousins> 18 cycles
sin, sinpi, cos, cospi 19 cycles {including Payne and Hanek argument reduction}
tan, atan 19 or 38 cycles
power 35 cycles
23 Transcendental instructions are available in (float) and (double) forms.
(float will be around 9 cycles)
So, reasonable check.
<
I) The "system programming model" is inherently:
...1) Virtual Machine
...2) Hypervisor + Supervisor
...3) multiprocessor, multithreaded
It is not only the above, but even moderately hard real time is built in.
Interrupts are directed at threads not cores
Deferred Procedure Calls are single instruction events
Most handler->handler control transfers do not need an excursion though
the OS scheduler.
Basically, if you have less than 1024 processes in a Linux system, the
lower level scheduler consumes no cycles on a second by second basis.
Context switch between threads under different hypervisors is the same
10-cycles as context switch between threads under the same GuestOS (10).
Conventional machines might take 1,000 cycles for a within GuestOS
context switch and 10,000 cycles on a between Guest OS context switch;
given 1,000 context switches per second, this accounts for a fraction
of a percent speed up.
So, moderate-big check
<
J) Simple applications can run with as little as 1 page of Memory
...Mapping overhead.
Achievable even when different areas {.text, .data, .bss, .stack, ...}
are separated by GB or even TB.
So, check
-------------------------------------------------------------------
That is all.

Quadibloc

unread,

Nov 1, 2022, 10:38:03 PM11/1/22

to

On Tuesday, November 1, 2022 at 4:53:02 PM UTC-6, MitchAlsup wrote:

> F) upon arrival at an exception handler, no state needs to be saved,

It's not surprising that the My 66000 architecture you've designed
meets your criteria for what would constitute a "sane" computer
archiltecture. One would hardly expect you to deliberatly design
an insane one!

Of course, that _could_ be said to be exactly what I've done in my
_original_ Concertina architecture - since I just took pretty much
everything that was included in a computer architecture historically,
and threw it in. But then, it was intended to explain how computers
work, not to be used for practical work.

This particular point of what you consider sanity, however, intrigued
me enough to wish to comment.

On one of those ancient, primitive computers without multiple threads
or multiple cores, if there's an interrupt, or an exception, like divide by
zero, and it is desired, after something is done, to return to the program
that was running, obviously state has to be saved.

So it seems to me that what's really being discussed here is what you
would *refer to* as an "exception handler", or that it's an issue having
to do with operating system design, not hardware design.

I can certainly agree, *if* that is the issue being addressed here, that
the code, presumably mathematical in nature, that deals with stuff
like overflows, divides by zero, and so on and so forth, perhaps ought
to be separated from the code that does stuff like save all the state
(and somehow get access to a place in memory to _put_ the state
without destroying state in the process... that has to be assisted by
hardware in 360-like architectures that depend on base registers).

Ah, but the relation of that to hardware is no doubt what is in your
previous point:

> E) Exception control transfer can transfer control directly to a

> user privilege thread without taking an excursion through the

> Operating System.

Hmm. When I first encountered that, my reaction was, "Oh, yeah.
A way to make PL/I implementations more efficient". Now I'm
starting to see how totally unfair I was.

The "insane" thinking that pervaded computer design, of course,
was that an exception is just like an interrupt, so of course it
goes into the rarefied space of the kernel. What else has access
to the shared library and can also return to any user's program?

Hey, wait a moment, the shared library is accessible to every user,
and each user could return to one of *his own* programs, which is
all that's needed.

So since your goal is *obviously* achievable, why the insanity?

Well, another thing that could happen is do a core dump and return
to the command prompt. So they didn't want to have _two_ ways to
deal with an exception, and a mechanism for letting the user specify
where to go when an exception happens...

well, it's okay to have one as an operating system service, but if
the user could use an unprivileged instruction to do that, when those
same exceptions could also cause going to the operating system
for a core dump instead, would be a... massive security hole!

They weren't wrong, back then.

The IBM System/360, a huge mainframe, which was even microprogrammed,
so that it could perform instructions that did things too fancy to wire into
even such a big machine... didn't have instructions that did log or trig
functions. Back then, the idea would even have been laughed at!

And there you are. Why are most of today's architectures "insane"?

Because we've kept copying what worked back in the 1960s, without
questioning whether or not we might make changes to it in response to
techincal advances.

John Savard

Quadibloc

unread,

Nov 2, 2022, 4:40:18 AM11/2/22

to

On Tuesday, November 1, 2022 at 4:53:02 PM UTC-6, MitchAlsup wrote:

> A) negating operands prior to an arithmetic calculation.

Although when I first heard about it, I regarded it as too ambitious,
something on a VAX-like level of CISC-ishness, I have now decided
to defer once again to your superior knowledge.

So, in addition to allowing full immediates, I have added _some_
support for it, even within the basic set of 32-bit instructions.

What I allow is:

for the basic integer and floating operations, there is a three-address
instruction format,

with the same register restrictions as 16-bit instructions (all three
registers must be in the same group of 8; now this saves 4 bits);

the two input operands - not the output result, I couldn't fit that in -
in the case of integer values, can be complemented, incremented,
or both (so replacing n by -n requires both a complement and
an increment) - in the case of floating values, the sign can be
changed, and the value can be halved, or both. (Inspired by the
forgotten S/360 halve instruction, since that is a useful thing
that's hard to do otherwise.)

So I've tried to inject a _little_ sanity into my wild and crazy ISA.

John Savard

BGB

unread,

Nov 2, 2022, 2:05:22 PM11/2/22

to

On 11/2/2022 3:40 AM, Quadibloc wrote:
> On Tuesday, November 1, 2022 at 4:53:02 PM UTC-6, MitchAlsup wrote:
>
>> A) negating operands prior to an arithmetic calculation.
>
> Although when I first heard about it, I regarded it as too ambitious,
> something on a VAX-like level of CISC-ishness, I have now decided
> to defer once again to your superior knowledge.
>
> So, in addition to allowing full immediates, I have added _some_
> support for it, even within the basic set of 32-bit instructions.
>

It could make sense for floating-point operations (XOR'ing the sign bits).

For integer ops, No.

IME, 'NEG' is uncommon, and negating a twos complement number has
similar latency to an adder.

> What I allow is:
>
> for the basic integer and floating operations, there is a three-address
> instruction format,
>
> with the same register restrictions as 16-bit instructions (all three
> registers must be in the same group of 8; now this saves 4 bits);
>
> the two input operands - not the output result, I couldn't fit that in -
> in the case of integer values, can be complemented, incremented,
> or both (so replacing n by -n requires both a complement and
> an increment) - in the case of floating values, the sign can be
> changed, and the value can be halved, or both. (Inspired by the
> forgotten S/360 halve instruction, since that is a useful thing
> that's hard to do otherwise.)
>
> So I've tried to inject a _little_ sanity into my wild and crazy ISA.
>

I looked recently into trying to design a 16-bit ISA primarily based
around 8 registers.

Became pretty obvious, pretty quickly, that 8 is not sufficient.

Though, I guess someone could go the route of trying to implement an x86
core...
Not exactly "sane", but could be doable in premise.

Say:
Prefixes are handled as separate special instructions;
These set internal flag bits, which are reset when decoding a non-prefix
instruction;
Possibly, there is an internal "hidden" ROM which implements any
non-core instructions, and also manages emulating a lot of the hardware
interfaces (likely exists as an internal interrupt mode, with the
x86-style interrupts, etc, being effectively faked via internal firmware);
...

Would probably be a waste of effort though.

...

Stephen Fuld

unread,

Nov 2, 2022, 2:15:16 PM11/2/22

to

On 11/2/2022 11:05 AM, BGB wrote:
> On 11/2/2022 3:40 AM, Quadibloc wrote:
>> On Tuesday, November 1, 2022 at 4:53:02 PM UTC-6, MitchAlsup wrote:
>>
>>> A) negating operands prior to an arithmetic calculation.
>>
>> Although when I first heard about it, I regarded it as too ambitious,
>> something on a VAX-like level of CISC-ishness, I have now decided
>> to defer once again to your superior knowledge.
>>
>> So, in addition to allowing full immediates, I have added _some_
>> support for it, even within the basic set of 32-bit instructions.
>>
>
> It could make sense for floating-point operations (XOR'ing the sign bits).
>
> For integer ops, No.
>
> IME, 'NEG' is uncommon, and negating a twos complement number has
> similar latency to an adder.

But having the capability allows you to eliminate the subtract op
code(s), which occur more frequently.

In that sense it is an entropy question of whether to use the bits for
more op codes or for an additional instruction bit(S) to indicate negate.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Quadibloc

unread,

Nov 2, 2022, 2:38:15 PM11/2/22

to

On Wednesday, November 2, 2022 at 12:15:16 PM UTC-6, Stephen Fuld wrote:

> In that sense it is an entropy question of whether to use the bits for
> more op codes or for an additional instruction bit(S) to indicate negate.

I had to resort to the expedient of including this feature only in a very restricted
form of the operate instruction, since I pretty much divide up my opcode
space in the diametric opposite of the way Mitch does it.

John Savard

MitchAlsup

unread,

Nov 2, 2022, 2:39:16 PM11/2/22

to

On Tuesday, November 1, 2022 at 9:38:03 PM UTC-5, Quadibloc wrote:
> On Tuesday, November 1, 2022 at 4:53:02 PM UTC-6, MitchAlsup wrote:
>
> > F) upon arrival at an exception handler, no state needs to be saved,
<
> It's not surprising that the My 66000 architecture you've designed
> meets your criteria for what would constitute a "sane" computer
> archiltecture. One would hardly expect you to deliberatly design
> an insane one!
>
> Of course, that _could_ be said to be exactly what I've done in my
> _original_ Concertina architecture - since I just took pretty much
> everything that was included in a computer architecture historically,
> and threw it in. But then, it was intended to explain how computers
> work, not to be used for practical work.
>
> This particular point of what you consider sanity, however, intrigued
> me enough to wish to comment.
>
> On one of those ancient, primitive computers without multiple threads
> or multiple cores, if there's an interrupt, or an exception, like divide by
> zero, and it is desired, after something is done, to return to the program
> that was running, obviously state has to be saved.
<

State does get saved (and restored later) but no SW instructions are
executed in doing that.

>
> So it seems to me that what's really being discussed here is what you
> would *refer to* as an "exception handler", or that it's an issue having
> to do with operating system design, not hardware design.
<

It is HW design that OS uses.

>
> I can certainly agree, *if* that is the issue being addressed here, that
> the code, presumably mathematical in nature, that deals with stuff
> like overflows, divides by zero, and so on and so forth, perhaps ought
> to be separated from the code that does stuff like save all the state
> (and somehow get access to a place in memory to _put_ the state
> without destroying state in the process... that has to be assisted by
> hardware in 360-like architectures that depend on base registers).
<

It is a structure indexed by both Hypervisor index and GuestOS thread
index. The Base register (if you will) is accessible only in configuration
space and is not modified unless a GuestOS crashes.

>
> Ah, but the relation of that to hardware is no doubt what is in your
> previous point:
> > E) Exception control transfer can transfer control directly to a
> > user privilege thread without taking an excursion through the
> > Operating System.
>
> Hmm. When I first encountered that, my reaction was, "Oh, yeah.
> A way to make PL/I implementations more efficient". Now I'm
> starting to see how totally unfair I was.
>
> The "insane" thinking that pervaded computer design, of course,
> was that an exception is just like an interrupt, so of course it
> goes into the rarefied space of the kernel. What else has access
> to the shared library and can also return to any user's program?
>
> Hey, wait a moment, the shared library is accessible to every user,
> and each user could return to one of *his own* programs, which is
> all that's needed.
>
> So since your goal is *obviously* achievable, why the insanity?
<

It was solvable but would require another level of indirection, one
that I chose not to encode and perform.

>
> Well, another thing that could happen is do a core dump and return
> to the command prompt. So they didn't want to have _two_ ways to
> deal with an exception, and a mechanism for letting the user specify
> where to go when an exception happens...
>
> well, it's okay to have one as an operating system service, but if
> the user could use an unprivileged instruction to do that, when those
> same exceptions could also cause going to the operating system
> for a core dump instead, would be a... massive security hole!
<

Which brings to question of "what is a core dump" today: a) memory,
b) core-state ?

>
> They weren't wrong, back then.
>
> The IBM System/360, a huge mainframe, which was even microprogrammed,
> so that it could perform instructions that did things too fancy to wire into
> even such a big machine... didn't have instructions that did log or trig
> functions. Back then, the idea would even have been laughed at!
>
> And there you are. Why are most of today's architectures "insane"?
<

The "sea of control registers" privileged architecture; for example.
Having to use "IPIs" to communicate certain things to other 'cores'.

MitchAlsup

unread,

Nov 2, 2022, 2:42:19 PM11/2/22

to

On Wednesday, November 2, 2022 at 1:05:22 PM UTC-5, BGB wrote:
> On 11/2/2022 3:40 AM, Quadibloc wrote:
> > On Tuesday, November 1, 2022 at 4:53:02 PM UTC-6, MitchAlsup wrote:
> >
> >> A) negating operands prior to an arithmetic calculation.
> >
> > Although when I first heard about it, I regarded it as too ambitious,
> > something on a VAX-like level of CISC-ishness, I have now decided
> > to defer once again to your superior knowledge.
> >
> > So, in addition to allowing full immediates, I have added _some_
> > support for it, even within the basic set of 32-bit instructions.
> >
> It could make sense for floating-point operations (XOR'ing the sign bits).
>
> For integer ops, No.
<

For integer Ops, I invert the operand and concatenate a 1 in as carry in.
For logicals, I invert the operand and the carry in is ignored.

Scott Lurndal

unread,

Nov 2, 2022, 3:03:44 PM11/2/22

to

As opposed to? Is there a "spec" for MY 66000 somewhere?

>Having to use "IPIs" to communicate certain things to other 'cores'.

How do you communicate certain things (like a page table update) to
other concurrent threads of execution sharing a virtual address space?

When we added SMP to the Burroughs medium systems, I specified the
IPC instruction for inter-processor communications (and control);
ultimately an interrupt of some form on the destination core is required.

What is your alternative to an interrupt (whether implicit like
the IPC instruction or explicit like the intel IPI or ARM SGI)?

MitchAlsup

unread,

Nov 2, 2022, 4:02:45 PM11/2/22

to

On Wednesday, November 2, 2022 at 2:03:44 PM UTC-5, Scott Lurndal wrote:
> MitchAlsup <Mitch...@aol.com> writes:
> >On Tuesday, November 1, 2022 at 9:38:03 PM UTC-5, Quadibloc wrote:
>
> >> And there you are. Why are most of today's architectures "insane"?
> ><
> >The "sea of control registers" privileged architecture; for example.
<
> As opposed to? Is there a "spec" for MY 66000 somewhere?
<

Yes. The part you would be wanting is still under NDA. That is "System
Architecture." But you would likely also want "Principles of Operation"
(A.K.A. ISA), and "Software" {multi-precision arithmetic, ATOMICs, ABI,...}
To make sense of the former.

<
> >Having to use "IPIs" to communicate certain things to other 'cores'.
<
> How do you communicate certain things (like a page table update) to
> other concurrent threads of execution sharing a virtual address space?
<

TLBs are coherent, so writing to the page tables causes the right things
to happen automagically.

>
> When we added SMP to the Burroughs medium systems, I specified the
> IPC instruction for inter-processor communications (and control);
> ultimately an interrupt of some form on the destination core is required.
<

Ultimately a thread of sufficient privilege running on said core is required;
I completely agree. But control does not have to pass through any interrupt
service routine, or interrupt service dispatcher for control to arrive; nor does
there need to be any APIC (or similar) structure involved. In fact, no excursion
through the OS is required other than running of the worker thread in an
address space where he can do what is required.
<
But, in addition, control can arrive with a small message in the same registers
used by ABI to send arguments and receive results from subroutines. So, it
does not smell "that much" like an interrupt, but more like a messaging
apparati.

>
> What is your alternative to an interrupt (whether implicit like
> the IPC instruction or explicit like the intel IPI or ARM SGI)?
<

Queued Messages
<
You can argue that these use essentially the same mechanism that an
interrupt would use and that the distinction is only one of spelling. I,
respectfully, disagree. An interrupt is a 1-way mechanism, my messages
are a 2-way mechanism (call-thread-entry-point(arguments) and later
the called-thread performs return {caller-results} ) closing the loop.

Ivan Godard

unread,

Nov 2, 2022, 6:08:53 PM11/2/22

to

On 11/2/2022 12:03 PM, Scott Lurndal wrote:
> MitchAlsup <Mitch...@aol.com> writes:
>> On Tuesday, November 1, 2022 at 9:38:03 PM UTC-5, Quadibloc wrote:
>
>>> And there you are. Why are most of today's architectures "insane"?
>> <
>> The "sea of control registers" privileged architecture; for example.
>
> As opposed to? Is there a "spec" for MY 66000 somewhere?
>
>> Having to use "IPIs" to communicate certain things to other 'cores'.
>
> How do you communicate certain things (like a page table update) to
> other concurrent threads of execution sharing a virtual address space?
>
> When we added SMP to the Burroughs medium systems, I specified the
> IPC instruction for inter-processor communications (and control);
> ultimately an interrupt of some form on the destination core is required.

That was called HEYU on the B6500 - I was always fond of that mnemonic.

Tim Rentsch

unread,

Nov 2, 2022, 7:35:20 PM11/2/22

to

Quadibloc <jsa...@ecn.ab.ca> writes:

> The IBM System/360, a huge mainframe, which was even
> microprogrammed, so that it could perform instructions that did

> things too fancy to wire into even such a big machine... [...]

Some models of System/360 used microcode, but not all.

Furthermore microcode was used principally to accommodate a large
range on the price/performance spectrum. Different models of
System/360 (that had microcode) had different micro architectures.
It wasn't that the instruction set was too fancy; rather it was
cheaper, or more cost effective, in some cases, to implement with
microcode than directly in hardware.

Quadibloc

unread,

Nov 2, 2022, 7:52:42 PM11/2/22

to

On Wednesday, November 2, 2022 at 4:08:53 PM UTC-6, Ivan Godard wrote:

> That was called HEYU on the B6500 - I was always fond of that mnemonic.

As long as you're not fond of the sort of content that is available on the
cable TV service called "hayu", I'm not going to quarrel with your taste!

John Savard

Quadibloc

unread,

Nov 2, 2022, 7:56:01 PM11/2/22

to

On Wednesday, November 2, 2022 at 5:35:20 PM UTC-6, Tim Rentsch wrote:

> Some models of System/360 used microcode, but not all.

Yes, you're quite correct. For purposes of discussion, since I wasn't
really addressing computer history, I didn't want to note the exceptions
(the model 75 and the model 91 and the other models derived from it,
the 95 and 195).

> Furthermore microcode was used principally to accommodate a large
> range on the price/performance spectrum. Different models of
> System/360 (that had microcode) had different micro architectures.
> It wasn't that the instruction set was too fancy; rather it was
> cheaper, or more cost effective, in some cases, to implement with
> microcode than directly in hardware.

Yes, but microcode certainly made a more complex instruction set
possible, without much in the way of added cost, if it was desirable.
And at the time, putting log and trig functions in the hardware simply
didn't occur to anyone as being appropriate.

John Savard

MitchAlsup

unread,

Nov 2, 2022, 8:08:19 PM11/2/22

to

On Wednesday, November 2, 2022 at 6:35:20 PM UTC-5, Tim Rentsch wrote:
> Quadibloc <jsa...@ecn.ab.ca> writes:
>
> > The IBM System/360, a huge mainframe, which was even
> > microprogrammed, so that it could perform instructions that did
> > things too fancy to wire into even such a big machine... [...]
>
> Some models of System/360 used microcode, but not all.
<

As far as I know, only 360/91 was NOT microprogrammed.
Although there are rumors of a few 360/95s that made it out the door.
The vast majority of /95s were System 370.

>
> Furthermore microcode was used principally to accommodate a large
> range on the price/performance spectrum. Different models of
> System/360 (that had microcode) had different micro architectures.
<

The missing key word is "vastly" as in vastly different microarchitectures.
One (360/20) only had one 8-bit adder.

<
> It wasn't that the instruction set was too fancy; rather it was
> cheaper, or more cost effective, in some cases, to implement with
> microcode than directly in hardware.
<

Things like edit-and-mark, Decimal-add {sub, mul, div}, Start-IO,...

Quadibloc

unread,

Nov 3, 2022, 12:02:31 AM11/3/22

to

On Wednesday, November 2, 2022 at 6:08:19 PM UTC-6, MitchAlsup wrote:

> As far as I know, only 360/91 was NOT microprogrammed.
> Although there are rumors of a few 360/95s that made it out the door.
> The vast majority of /95s were System 370.

IBM originally planned to make the Model 75 (then the Model 70) microcoded.
But they couldn't find a fast enough memory for the microcode to meet its
goals, so they had to make it hardwired.

The Model 91 was hardwired. Most Model 195s were System 370, but some
360/195s did make it out the door. As for the Model 95, that was the version
of the 91 that had thin-film memory. Those weren't sold generally; two went
to NASA.

John Savard

Bill Findlay

unread,

Nov 3, 2022, 1:50:24 PM11/3/22

to

On 2 Nov 2022, Ivan Godard wrote
(in article <tjuppi$18r3j$1...@dont-email.me>):

Yes, but it's not as fun as EIEIO!

--
Bill Findlay

Tim Rentsch

unread,

Nov 4, 2022, 9:00:14 AM11/4/22

to

MitchAlsup <Mitch...@aol.com> writes:

> On Wednesday, November 2, 2022 at 6:35:20 PM UTC-5, Tim Rentsch wrote:
>
>> Quadibloc <jsa...@ecn.ab.ca> writes:
>>
>>> The IBM System/360, a huge mainframe, which was even
>>> microprogrammed, so that it could perform instructions that did
>>> things too fancy to wire into even such a big machine... [...]
>>
>> Some models of System/360 used microcode, but not all.
>
> As far as I know, only 360/91 was NOT microprogrammed.

My understanding is that the System/360 models 44 and 75 were
both directly wired rather than using microcode. (Also the
model 91, as you say.)

Incidentally, the concept of microcode (and also the term, I
believe) predates the IBM System/360 by at least a decade, IIRC.

Michael S

unread,

Nov 4, 2022, 9:13:35 AM11/4/22

to

ED/EDMK also hardwired? :shocked:

Tim Rentsch

unread,

Nov 4, 2022, 9:19:09 AM11/4/22

to

Quadibloc <jsa...@ecn.ab.ca> writes:

> On Wednesday, November 2, 2022 at 5:35:20 PM UTC-6, Tim Rentsch wrote:
>
>> Some models of System/360 used microcode, but not all.
>
> Yes, you're quite correct. For purposes of discussion, since I wasn't
> really addressing computer history, I didn't want to note the exceptions
> (the model 75 and the model 91 and the other models derived from it,
> the 95 and 195).

I strive to be accurate (or at least not inaccurate) in every
statement I make. I know some people don't, but I do, because I
think it's important not to pretend the inconvenient exceptions
don't exist. "Don't tell lies" was the phrase used, as I vividly
recall from a memorable lecture in Physics 1, now more than 50
years ago (and it seems like yesterday).

>> Furthermore microcode was used principally to accommodate a large
>> range on the price/performance spectrum. Different models of
>> System/360 (that had microcode) had different micro architectures.
>> It wasn't that the instruction set was too fancy; rather it was
>> cheaper, or more cost effective, in some cases, to implement with
>> microcode than directly in hardware.
>
> Yes, but microcode certainly made a more complex instruction set
> possible, without much in the way of added cost, if it was desirable.

The System/360 instruction set was possible both with and without
using microcode. Using microcode wasn't necessary to make the
instruction set possible, or even feasible, only to make it more
cost effective at various points on the price/performance spectrum.

John Levine

unread,

Nov 4, 2022, 2:51:34 PM11/4/22

to

According to Tim Rentsch <tr.1...@z991.linuxsc.com>:

>>>> The IBM System/360, a huge mainframe, which was even
>>>> microprogrammed, so that it could perform instructions that did
>>>> things too fancy to wire into even such a big machine... [...]
>>>
>>> Some models of System/360 used microcode, but not all.
>>
>> As far as I know, only 360/91 was NOT microprogrammed.
>
>My understanding is that the System/360 models 44 and 75 were
>both directly wired rather than using microcode. (Also the
>model 91, as you say.)

And the later models 95 and 195. The /91 did not have the decimal
arithmetic instructions but it did have CVB CVD ED and EDMK. The /44
only had a subset intended for scientific and realtime work, so no
storage-to-storage instrutions at all.

The 360/85 was microprogrammed, a faster reimplementation of the /65
with a cache that turned out to be nearly as fast as the far more
complex /91.

>Incidentally, the concept of microcode (and also the term, I
>believe) predates the IBM System/360 by at least a decade, IIRC.

Maurice Wilkes invented it in about 1951. Nearly all of the good ideas
in coomputing were invented a long time ago, most before 1970.

--
Regards,
John Levine, jo...@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

MitchAlsup

unread,

Nov 4, 2022, 4:00:14 PM11/4/22

to

On Friday, November 4, 2022 at 1:51:34 PM UTC-5, John Levine wrote:
> According to Tim Rentsch <tr.1...@z991.linuxsc.com>:
> >>>> The IBM System/360, a huge mainframe, which was even
> >>>> microprogrammed, so that it could perform instructions that did
> >>>> things too fancy to wire into even such a big machine... [...]
> >>>
> >>> Some models of System/360 used microcode, but not all.
> >>
> >> As far as I know, only 360/91 was NOT microprogrammed.
> >
> >My understanding is that the System/360 models 44 and 75 were
> >both directly wired rather than using microcode. (Also the
> >model 91, as you say.)
> And the later models 95 and 195. The /91 did not have the decimal
> arithmetic instructions but it did have CVB CVD ED and EDMK. The /44
> only had a subset intended for scientific and realtime work, so no
> storage-to-storage instrutions at all.
>
> The 360/85 was microprogrammed, a faster reimplementation of the /65
> with a cache that turned out to be nearly as fast as the far more
> complex /91.
<

A good cache (that is lower latency memory) almost always beats more
interleaving (with its longer latency).

<
> >Incidentally, the concept of microcode (and also the term, I
> >believe) predates the IBM System/360 by at least a decade, IIRC.
<
> Maurice Wilkes invented it in about 1951. Nearly all of the good ideas
> in coomputing were invented a long time ago, most before 1970.
<

About the only things we have invented recently (1985-2020) is various
forms of branch-prediction and the occasional data-prediction.
(USPTO notwithstanding).

Quadibloc

unread,

Nov 5, 2022, 12:28:25 AM11/5/22

to

On Friday, November 4, 2022 at 2:00:14 PM UTC-6, MitchAlsup wrote:
> On Friday, November 4, 2022 at 1:51:34 PM UTC-5, John Levine wrote:

> > The 360/85 was microprogrammed, a faster reimplementation of the /65
> > with a cache that turned out to be nearly as fast as the far more
> > complex /91.
> <
> A good cache (that is lower latency memory) almost always beats more
> interleaving (with its longer latency).

It is true that the 360/85 was faster than expected, while the 360/91 was
a disappointment.

However, the 360/85 was _not_ a version of the 360/65. It was a completely
different design, with a longer microcode word. However, the 370/165
was largely based on the 360/85 with additional improvements - perhaps this
is what led to the confusion. The 3033 continued the use of this particular
design.

And IBM did not ignore the fact that out-of-order execution, the innovation
of the Model 91 was still also helpful. That's why they came out with the 360/195
which added cache, similar to what the Model 85 had, to the Model 91, combining
both features, to make a very powerful computer.

John Savard

Quadibloc

unread,

Nov 5, 2022, 12:30:58 AM11/5/22

to

I wanted to make extra room for 64-bit instructions in the opcode space,
so I rearranged the operate instructions, reducing the number of opcode bits
by one in many of them. This, however, did not create the space I wanted to
create, so I ended up using a sliver of space that was available for the 64-bit
instructions instead.

But the small amount of extra opcode space I created did allow me to follow
Mitch a bit more closely, as now I can apply the additional operations to the
result of the main operation as well as to the two inputs.

John Savard

Thomas Koenig

unread,

Nov 5, 2022, 11:22:13 AM11/5/22

to

Quadibloc <jsa...@ecn.ab.ca> schrieb:

> But the small amount of extra opcode space I created did allow me to follow
> Mitch a bit more closely, as now I can apply the additional operations to the
> result of the main operation as well as to the two inputs.

My 66000 does not apply the operations to the result, only to the
source operands. Applying the operation to the result is redundant.

!(a&b) is equivalent to (!a)|(!b), !(a^b) is equivalent to (!a)^(!b),
-(-a-b) is a+b, and so on.

a...@littlepinkcloud.invalid

unread,

Nov 5, 2022, 1:17:10 PM11/5/22

to

Bill Findlay <findl...@blueyonder.co.uk> wrote:
> On 2 Nov 2022, Ivan Godard wrote
>>

>> That was called HEYU on the B6500 - I was always fond of that mnemonic.
>
> Yes, but it's not as fun as EIEIO!

There's a (possibly apocryphal) story from Intel, where the 8086 team
were told not to use SEX as the mnemonic for the sign-extension
instruction. They were forced to call it CBW instead.

They got their revenge later with the 8051, where the mnemonics for
logical or and and are ORL and ANL. For some reason Intel management
never noticed...

Andrew.

MitchAlsup

unread,

Nov 5, 2022, 1:32:32 PM11/5/22

to

Note:
<
We were surprised that we got FPCER register into Mc 88100.
{Floating Point Control and Exception Register}
>
> Andrew.

Tim Rentsch

unread,

Nov 6, 2022, 3:47:45 AM11/6/22

to

John Levine <jo...@taugh.com> writes:

> According to Tim Rentsch <tr.1...@z991.linuxsc.com>:
>
>>>>> The IBM System/360, a huge mainframe, which was even
>>>>> microprogrammed, so that it could perform instructions that did
>>>>> things too fancy to wire into even such a big machine... [...]
>>>>
>>>> Some models of System/360 used microcode, but not all.
>>>
>>> As far as I know, only 360/91 was NOT microprogrammed.
>>
>> My understanding is that the System/360 models 44 and 75 were
>> both directly wired rather than using microcode. (Also the
>> model 91, as you say.)
>
> And the later models 95 and 195. The /91 did not have the decimal
> arithmetic instructions but it did have CVB CVD ED and EDMK. The /44
> only had a subset intended for scientific and realtime work, so no
> storage-to-storage instrutions at all.

I guess that means the 360/44 is literally the world's first
Reduced Instruction Set Computer.

Tim Rentsch

unread,

Nov 6, 2022, 3:56:34 AM11/6/22

to

Thomas Koenig <tko...@netcologne.de> writes:

> Quadibloc <jsa...@ecn.ab.ca> schrieb:
>
>> But the small amount of extra opcode space I created did allow
>> me to follow Mitch a bit more closely, as now I can apply the
>> additional operations to the result of the main operation as
>> well as to the two inputs.
>
> My 66000 does not apply the operations to the result, only to
> the source operands. Applying the operation to the result is
> redundant.
>
> !(a&b) is equivalent to (!a)|(!b),

Presumably you mean ~(a&b) is equivalent to (~a)|(~b).

> !(a^b) is equivalent to (!a)^(!b),

and here is meant ~(a^b) is equivalent to (a)^(~b), not
to (~a)^(~b).

Quadibloc

unread,

Nov 6, 2022, 10:06:17 AM11/6/22

to

That depends on whether you think ^ stands for AND or for
XOR. I take it he went for AND, and you went for XOR.

John Savard

Quadibloc

unread,

Nov 6, 2022, 10:09:03 AM11/6/22

to

Oh, silly me. AND, unlike XOR, is not symmetric with respect to
true and false, and so if he did think that ^ was AND, then (!a)^(!b)
would be equivalent to !(a|b) where | represents OR.

John Savard

Stephen Fuld

unread,

Nov 6, 2022, 11:06:14 AM11/6/22

to

I get the humor, but it isn't quite true. Storage to storage
instructions were an optional, extra cost feature on most (all?) of the
original S/360 models. I believe so was floating point (IBM format, of
course). There may have been others.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Anne & Lynn Wheeler

unread,

Nov 6, 2022, 10:02:19 PM11/6/22

to

Quadibloc <jsa...@ecn.ab.ca> writes:
> It is true that the 360/85 was faster than expected, while the 360/91 was
> a disappointment.
>
> However, the 360/85 was _not_ a version of the 360/65. It was a completely
> different design, with a longer microcode word. However, the 370/165
> was largely based on the 360/85 with additional improvements - perhaps this
> is what led to the confusion. The 3033 continued the use of this particular
> design.
>
> And IBM did not ignore the fact that out-of-order execution, the innovation
> of the Model 91 was still also helpful. That's why they came out with the 360/195
> which added cache, similar to what the Model 85 had, to the Model 91, combining
> both features, to make a very powerful computer.

omasulo's algoritm
https://en.wikipedia.org/wiki/Tomasulo%27s_algorithm

some archived afc posts mentioning Tomasulo
http://www.garlic.com/~lynn/2015c.html#91 Critique of System/360, 1967
http://www.garlic.com/~lynn/2008g.html#38 Sad news of Bob Tomasulo's passing
http://www.garlic.com/~lynn/2003j.html#18 why doesn't processor reordering instructions affect most
http://www.garlic.com/~lynn/2000b.html#15 How many Megaflops and when?

from 23apr1981, piece of ("tandem memos") email from POK hardware group:

Date: 04/23/81 09:57:42
To: wheeler

your ramblings concerning the corp(se?) showed up in my reader
yesterday. like all good net people, i passed them along to 3 other
people. like rabbits interesting things seem to multiply on the
net. many of us here in pok experience the sort of feelings your mail
seems so burdened by: the company, from our point of view, is out of
control. i think the word will reach higher only when the almighty $$$
impact starts to hit. but maybe it never will. its hard to imagine one
stuffed company president saying to another (our) stuffed company
president i think i'll buy from those inovative freaks down the
street. '(i am not defending the mess that surrounds us, just trying
to understand why only some of us seem to see it).

bob tomasulo and dave anderson, the two poeple responsible for the
model 91 and the (incredible but killed) hawk project, just left pok
for the new stc computer company. management reaction: when dave told
them he was thinking of leaving they said 'ok. 'one word. 'ok. ' they
tried to keep bob by telling him he shouldn't go (the reward system in
pok could be a subject of long correspondence). when he left, the
management position was 'he wasn't doing anything anyway. '

in some sense true. but we haven't built an interesting high-speed
machine in 10 years. look at the 85/165/168/3033/trout. all the same
machine with treaks here and there. and the hordes continue to sweep
in with faster and faster machines. true, endicott plans to bring the
low/middle into the current high-end arena, but then where is the
high-end product development?

... snip ...

trout is code name for what was released mid-80s as 3090
https://www.ibm.com/ibm/history/exhibits/mainframe/mainframe_PP3090.html

some here about "Tandem Memos" in recent post
https://www.linkedin.com/pulse/john-boyd-ibm-wild-ducks-lynn-wheeler/

trivia: in later half of 70s, I had gotten involved with effort to do 16
processor 370 multiprocessor and we had con'ed the 3033 processor
engineers to work on it in their spare time (lot more interesting than
remapping 168-3 logic to 20% faster chips) ... going great guns until
somebody told the head of POK that it could be decades before the POK
favorite son operation system had effective 16-way support (POK doesn't
ship 16-way until turn of the century). The head of POK then invites
some of us to never visit POK again and tells the 3033 processor
engineers, nose to 3033 grindstone. Once the 3033 is out the door, they
start on trout/3090.

quick&dirty 3033/3081 projects kicked off in parallel after Future
System implodes
http://www.jfsowa.com/computer/memo125.htm
http://people.cs.clemson.edu/~mark/fs.html

note there is also ACS/360 ... end of effort when executives were afraid
it would advance the state-of-art too fast and IBM would loose control
of the market; following also mentions features that show up more than
20yrs later with ES/9000.
https://people.cs.clemson.edu/~mark/acs_end.html

shortly after joining IBM, the 370/195 group wanted me to help with
hyperthreading the ... two instruction streams simulating two processor
multiprocessor (see multithreading patents near bottom of the "acs_end"
webpage, before the es/9000 side-bar near the bottom). The issue was 195
pipelined stalled/drained with conditional branches ... so most codes
only ran at half 195 rated speed ... it was felt two instruction streams
(running at half rated speed) could maintain "peak" throughput.

caveat: MVT two-processor 360/65MP (and later two-processor OS/VS2) only
claimed to have 1.2-1.5 throughput of single processor (because of the
multiprocessor overhead).

370/195 two-istream effort got canceled when it was decided to make all
370s "virtual memory" and it was considered not worthwhile to add
virtual memory to the 195. trivia: a decade ago, mainframe customer
asked me if I could track down decision to make all 370s "virtual
memory" ... basically OS/360 MVT storage management was so bad that
regions sizes had to be specified four times larger than actual used ...
as a result typical 1mbyte 370/165 could only concurrently run four
regions ... insufficient to achieve throughput to justify the
machine. Mapping MVT to virtual memory could increase the number of
concurrent executing regions by factor of four times with little or no
paging old archived afc with pieces of the email exchange (with somebody
that reported to IBM executive making the decision):
http://www.garlic.com/~lynn/2011d.html#73 Multiple Virtual Memory

other parts of long winded thread in afc:
http://www.garlic.com/~lynn/2011d.html#71 Multiple Virtual Memory
http://www.garlic.com/~lynn/2011d.html#72 Multiple Virtual Memory
http://www.garlic.com/~lynn/2011d.html#73 Multiple Virtual Memory
http://www.garlic.com/~lynn/2011d.html#74 Multiple Virtual Memory
http://www.garlic.com/~lynn/2011d.html#81 Multiple Virtual Memory
http://www.garlic.com/~lynn/2011d.html#82 Multiple Virtual Memory
http://www.garlic.com/~lynn/2011e.html#1 Multiple Virtual Memory
http://www.garlic.com/~lynn/2011e.html#2 Multiple Virtual Memory
http://www.garlic.com/~lynn/2011e.html#3 Multiple Virtual Memory
http://www.garlic.com/~lynn/2011e.html#4 Multiple Virtual Memory
http://www.garlic.com/~lynn/2011e.html#5 Multiple Virtual Memory
http://www.garlic.com/~lynn/2011e.html#8 Multiple Virtual Memory
http://www.garlic.com/~lynn/2011e.html#10 Multiple Virtual Memory
http://www.garlic.com/~lynn/2011e.html#11 Multiple Virtual Memory
http://www.garlic.com/~lynn/2011e.html#12 Multiple Virtual Memory
http://www.garlic.com/~lynn/2011e.html#13 Multiple Virtual Memory
http://www.garlic.com/~lynn/2011e.html#14 Multiple Virtual Memory
http://www.garlic.com/~lynn/2011e.html#20 Multiple Virtual Memory
http://www.garlic.com/~lynn/2011e.html#22 Multiple Virtual Memory
http://www.garlic.com/~lynn/2011e.html#25 Multiple Virtual Memory
http://www.garlic.com/~lynn/2011e.html#26 Multiple Virtual Memory
http://www.garlic.com/~lynn/2011e.html#27 Multiple Virtual Memory
http://www.garlic.com/~lynn/2011e.html#29 Multiple Virtual Memory
http://www.garlic.com/~lynn/2011e.html#42 Multiple Virtual Memory

--
virtualization experience starting Jan1968, online at home since Mar1970

John Levine

unread,

Nov 7, 2022, 5:54:59 AM11/7/22

to

According to Stephen Fuld <sf...@alumni.cmu.edu.invalid>:

>> I guess that means the 360/44 is literally the world's first
>> Reduced Instruction Set Computer.
>
>I get the humor, but it isn't quite true. Storage to storage
>instructions were an optional, extra cost feature on most (all?) of the
>original S/360 models.

No, on models other than the /44, most of the storage-to-storage
instructions were standard. The optional parts were decimal arithmetic
and floating point. In practice I think everyone got decimal arithmetic
and on all but the smallest models, they got both.

For example, MVC which moved a string of bytes from one place to
another was standard for every model other than the /44. The /44 really
was reduced compared to other 360s.

The /20 had a commercial-oriented subset of the 360 with only half the
registers, and those only 16 bits. It was a clever hack to come up
with a much smaller mostly upward compatible computer, and sold very
well. Unlike the /44 it was heavily microcoded so the /44 was more like
later RISCs, leave out the stuff that is hard to make fast.

Stephen Fuld

unread,

Nov 7, 2022, 12:34:06 PM11/7/22

to

On 11/7/2022 2:54 AM, John Levine wrote:
> According to Stephen Fuld <sf...@alumni.cmu.edu.invalid>:
>>> I guess that means the 360/44 is literally the world's first
>>> Reduced Instruction Set Computer.
>>
>> I get the humor, but it isn't quite true. Storage to storage
>> instructions were an optional, extra cost feature on most (all?) of the
>> original S/360 models.
>
> No, on models other than the /44, most of the storage-to-storage
> instructions were standard. The optional parts were decimal arithmetic
> and floating point. In practice I think everyone got decimal arithmetic
> and on all but the smallest models, they got both.

You're right. I was thinking of the decimal arithmetic arithmetic, but
used the wrong name. Sorry. :-(

Thanks for the correction.

Tim Rentsch

unread,

Nov 15, 2022, 6:39:46 PM11/15/22

to

Based on the first example, it seems clear that & is used
for AND and | is used for OR. Presumably that rules out
using ^ for AND.

BGB

unread,

Nov 18, 2022, 2:32:46 PM11/18/22

to

On 11/4/2022 8:19 AM, Tim Rentsch wrote:
> Quadibloc <jsa...@ecn.ab.ca> writes:
>
>> On Wednesday, November 2, 2022 at 5:35:20 PM UTC-6, Tim Rentsch wrote:
>>
>>> Some models of System/360 used microcode, but not all.
>>
>> Yes, you're quite correct. For purposes of discussion, since I wasn't
>> really addressing computer history, I didn't want to note the exceptions
>> (the model 75 and the model 91 and the other models derived from it,
>> the 95 and 195).
>
> I strive to be accurate (or at least not inaccurate) in every
> statement I make. I know some people don't, but I do, because I
> think it's important not to pretend the inconvenient exceptions
> don't exist. "Don't tell lies" was the phrase used, as I vividly
> recall from a memorable lecture in Physics 1, now more than 50
> years ago (and it seems like yesterday).
>

Mostly similar.

Sometimes there are edge cases, like features/etc that apply to versions
of code I am working on, but not to the version that is up on GitHub
(there is usually a certain amount of lag here, as I at least try to
verify everything is "mostly working" before uploading to GitHub).

I don't really have the resources personally to maintain a separate
experimental and release branch though, so stuff tends to be inherently
experimental sometimes.

Sometimes, stuff is either misremembered or a guesstimate, so not
everything can be taken as absolute.

Or, something seems one way at first, but then tends not to be so with
further testing.

Recent example, the Imm5fp / E3.F2 instructions seemed promising at
first; but it turns out that most of the constants that fit the Imm5fp
pattern did not occur in instructions that could actually make use of
it. Say, if most of the matching values were for things like "x=1.0;"
and similar, and only around 10-20% of "y=x+c;" instructions matched
this pattern (and can only result in around a 1% delta for the size of
Quake).

Or some things are left ambiguous, but tend to have reasonable answers.

Like, someone can work out my DOB and similar if they want, ..., but
would still prefer it not be shown publicly.

Well, similar with a lots of other stuff, like ethnic background,
religious views, etc. Granted, one can say "white" and
"non-denominational Christian" (and/or "agnostic leaning theist" /
"theism leaning agnostic", though "openly non YEC" (1), or similar) and
this is basically "close enough".

1: The YEC position (eg, 6 day creation with a 6k year history) having
way too many issues to really take it seriously. They seem to believe
that their interpretation is mandatory, but like, it also seems
reasonable to assume that much of Genesis is mostly allegorical and the
historical account of events starts in Exodus.

So, say:
Supernatural intervention into the existence of life and humans:
Probably.
History from a few thousand years ago resembling "The Flintstones":
Probably not.
Direct supernatural intervention into the life of individuals:
Probably rare in general.

There are too many issues if one assumes that everyone's life is
micromanaged. It seems much more consistent with observation to assume a
lack of any sort of direct intervention in most cases.

For most things, also makes sense to take an "Occam's razor" approach to
the level of supernatural intervention required in any given scenario
(with most things tending to operate primarily in terms of physical
mechanisms).

Similarly, it seems that if/when it does occur, it would (in most cases)
be mostly in the form of skewing probabilities or similar.
So, one can't really be sure whether or not it was an intervention, or
if it would have happened that way regardless. In this case, one would
need to look for significant deviations from statistical probability.

I prefer not to take a hard stance on any of the specifics though.

It also seems reasonable to allow for disagreement as to if or how much
any of this occurs.

...

>>> Furthermore microcode was used principally to accommodate a large
>>> range on the price/performance spectrum. Different models of
>>> System/360 (that had microcode) had different micro architectures.
>>> It wasn't that the instruction set was too fancy; rather it was
>>> cheaper, or more cost effective, in some cases, to implement with
>>> microcode than directly in hardware.
>>
>> Yes, but microcode certainly made a more complex instruction set
>> possible, without much in the way of added cost, if it was desirable.
>
> The System/360 instruction set was possible both with and without
> using microcode. Using microcode wasn't necessary to make the
> instruction set possible, or even feasible, only to make it more
> cost effective at various points on the price/performance spectrum.

Seems reasonable.

MitchAlsup

unread,

Nov 18, 2022, 3:27:44 PM11/18/22

to

On Friday, November 18, 2022 at 1:32:46 PM UTC-6, BGB wrote:
> On 11/4/2022 8:19 AM, Tim Rentsch wrote:
> > Quadibloc <jsa...@ecn.ab.ca> writes:
> >
> >> On Wednesday, November 2, 2022 at 5:35:20 PM UTC-6, Tim Rentsch wrote:
> >>
> >>> Some models of System/360 used microcode, but not all.
> >>
> >> Yes, you're quite correct. For purposes of discussion, since I wasn't
> >> really addressing computer history, I didn't want to note the exceptions
> >> (the model 75 and the model 91 and the other models derived from it,
> >> the 95 and 195).
> >
> > I strive to be accurate (or at least not inaccurate) in every
> > statement I make. I know some people don't, but I do, because I
> > think it's important not to pretend the inconvenient exceptions
> > don't exist. "Don't tell lies" was the phrase used, as I vividly
> > recall from a memorable lecture in Physics 1, now more than 50
> > years ago (and it seems like yesterday).
> >
> Mostly similar.
<

When a computer architect lies to himself is the point where many projects
turn south.....

>
> Sometimes there are edge cases, like features/etc that apply to versions
> of code I am working on, but not to the version that is up on GitHub
> (there is usually a certain amount of lag here, as I at least try to
> verify everything is "mostly working" before uploading to GitHub).
>
> I don't really have the resources personally to maintain a separate
> experimental and release branch though, so stuff tends to be inherently
> experimental sometimes.
>
>
> Sometimes, stuff is either misremembered or a guesstimate, so not
> everything can be taken as absolute.
>
> Or, something seems one way at first, but then tends not to be so with
> further testing.
>
> Recent example, the Imm5fp / E3.F2 instructions seemed promising at
> first; but it turns out that most of the constants that fit the Imm5fp
> pattern did not occur in instructions that could actually make use of
> it. Say, if most of the matching values were for things like "x=1.0;"
> and similar, and only around 10-20% of "y=x+c;" instructions matched
> this pattern (and can only result in around a 1% delta for the size of
> Quake).
<

Given that I have imm5 constants available, I can convert imm5 into
FP32 or FP 64 is only a couple handfuls of gates. So, I can actually
encode and use::
<
FCMP Rt,Rx,#1
<
With the same effect as
<
FCMP Rt,Rx,#1.0D0
>
However, usage is skewed 8× more positives than negatives, nobody uses
13, 17, or 19; so I am looking for a more effective mapping function from
the 5-bit field I have to the values I want.

>
> Or some things are left ambiguous, but tend to have reasonable answers.
<

The think that took me the longest in my career to master was writing
such that a competent, industrious engineer could not misunderstand
what I was specifying.

>
> Like, someone can work out my DOB and similar if they want, ..., but
> would still prefer it not be shown publicly.
>
>
>
> Well, similar with a lots of other stuff, like ethnic background,
> religious views, etc. Granted, one can say "white" and
> "non-denominational Christian" (and/or "agnostic leaning theist" /
> "theism leaning agnostic", though "openly non YEC" (1), or similar) and
> this is basically "close enough".
>
>
> 1: The YEC position (eg, 6 day creation with a 6k year history) having
> way too many issues to really take it seriously. They seem to believe
> that their interpretation is mandatory, but like, it also seems
> reasonable to assume that much of Genesis is mostly allegorical and the
> historical account of events starts in Exodus.
>
> So, say:
> Supernatural intervention into the existence of life and humans:
> Probably.
<

There is no evidence, so I would put this in "almost assuredly not".

<
> History from a few thousand years ago resembling "The Flintstones":
> Probably not.
<

There is no evidence, so I would put this in "almost assuredly not".

<
> Direct supernatural intervention into the life of individuals:
> Probably rare in general.
<

There is no evidence, so I would put this in "almost assuredly not".

<
>
> There are too many issues if one assumes that everyone's life is
> micromanaged. It seems much more consistent with observation to assume a
> lack of any sort of direct intervention in most cases.
<

If you have ever known someone who took their own life (either rapidly or slowly)
you would never visualize anyone micromanaging them in any way shape or form.

>
>
> For most things, also makes sense to take an "Occam's razor" approach to
> the level of supernatural intervention required in any given scenario
> (with most things tending to operate primarily in terms of physical
> mechanisms).
<

Agreed.

>
> Similarly, it seems that if/when it does occur, it would (in most cases)
> be mostly in the form of skewing probabilities or similar.
> So, one can't really be sure whether or not it was an intervention, or
> if it would have happened that way regardless. In this case, one would
> need to look for significant deviations from statistical probability.
>

If god wanted to show that she existed, it would be simple:: just have
no babies born, for an entire month, across the entire planet, with any
birth defects whatsoever.

>
> I prefer not to take a hard stance on any of the specifics though.
>
> It also seems reasonable to allow for disagreement as to if or how much
> any of this occurs.
>
> ...
> >>> Furthermore microcode was used principally to accommodate a large
> >>> range on the price/performance spectrum. Different models of
> >>> System/360 (that had microcode) had different micro architectures.
> >>> It wasn't that the instruction set was too fancy; rather it was
> >>> cheaper, or more cost effective, in some cases, to implement with
> >>> microcode than directly in hardware.
> >>
> >> Yes, but microcode certainly made a more complex instruction set
> >> possible, without much in the way of added cost, if it was desirable.
> >
> > The System/360 instruction set was possible both with and without
> > using microcode.
<

By and large, this is exactly what separates 360 from VAX.

BGB

unread,

Nov 19, 2022, 4:27:51 AM11/19/22

to

On 11/18/2022 2:27 PM, MitchAlsup wrote:
> On Friday, November 18, 2022 at 1:32:46 PM UTC-6, BGB wrote:
>> On 11/4/2022 8:19 AM, Tim Rentsch wrote:
>>> Quadibloc <jsa...@ecn.ab.ca> writes:
>>>
>>>> On Wednesday, November 2, 2022 at 5:35:20 PM UTC-6, Tim Rentsch wrote:
>>>>
>>>>> Some models of System/360 used microcode, but not all.
>>>>
>>>> Yes, you're quite correct. For purposes of discussion, since I wasn't
>>>> really addressing computer history, I didn't want to note the exceptions
>>>> (the model 75 and the model 91 and the other models derived from it,
>>>> the 95 and 195).
>>>
>>> I strive to be accurate (or at least not inaccurate) in every
>>> statement I make. I know some people don't, but I do, because I
>>> think it's important not to pretend the inconvenient exceptions
>>> don't exist. "Don't tell lies" was the phrase used, as I vividly
>>> recall from a memorable lecture in Physics 1, now more than 50
>>> years ago (and it seems like yesterday).
>>>
>> Mostly similar.
> <
> When a computer architect lies to himself is the point where many projects
> turn south.....

Probably true.

>>
>> Sometimes there are edge cases, like features/etc that apply to versions
>> of code I am working on, but not to the version that is up on GitHub
>> (there is usually a certain amount of lag here, as I at least try to
>> verify everything is "mostly working" before uploading to GitHub).
>>
>> I don't really have the resources personally to maintain a separate
>> experimental and release branch though, so stuff tends to be inherently
>> experimental sometimes.
>>
>>
>> Sometimes, stuff is either misremembered or a guesstimate, so not
>> everything can be taken as absolute.
>>
>> Or, something seems one way at first, but then tends not to be so with
>> further testing.
>>
>> Recent example, the Imm5fp / E3.F2 instructions seemed promising at
>> first; but it turns out that most of the constants that fit the Imm5fp
>> pattern did not occur in instructions that could actually make use of
>> it. Say, if most of the matching values were for things like "x=1.0;"
>> and similar, and only around 10-20% of "y=x+c;" instructions matched
>> this pattern (and can only result in around a 1% delta for the size of
>> Quake).
> <
> Given that I have imm5 constants available, I can convert imm5 into
> FP32 or FP 64 is only a couple handfuls of gates. So, I can actually
> encode and use::
> <
> FCMP Rt,Rx,#1
> <
> With the same effect as
> <
> FCMP Rt,Rx,#1.0D0

Int -> Float in this case is pretty easy.

I tried several options initially:
E2.F3, E3.F2
Tweaking the exponent bias;
...

It seems my initial guess, E3.F2 with a bias of 3, was already pretty
near the optimal case for this...

However, the hit rate was still low enough to make it "not likely worth
the cost".

Things were failing *both* on the limited dynamic range, and not having
enough fractional bits.

Comparatively, some 2RI encodings with S.E5.F4, fared a fair bit better
here...

So:
FADD Imm10fp, Rn
FMUL Imm10fp, Rn

Despite the limitation of only working when the source and destination
register were equal, they still hit a lot more often.

Some FpImm E5.F4 instructions "could make sense", except for a big ugly
issue: There isn't really any encoding space left for this (Unless I
start using the F3 or F9 blocks). The potential gains from this don't
seem big enough to justify burning the F9 block.

>>
> However, usage is skewed 8× more positives than negatives, nobody uses
> 13, 17, or 19; so I am looking for a more effective mapping function from
> the 5-bit field I have to the values I want.

Yeah, could try something different, say:
00..0F: E2.F2
0.250, 0.3125, 0.375, 0.4375
0.500, 0.625 , 0.750, 0.875
-, 1.250 , 1.500, 1.750
-, 2.500 , -, 3.500
10..1F:
Map integer values from 0..15.

Mostly as it looked like small integer values represented a fairly
common case here.

Eg:
y=x+5.0;

>>
>> Or some things are left ambiguous, but tend to have reasonable answers.
> <
> The think that took me the longest in my career to master was writing
> such that a competent, industrious engineer could not misunderstand
> what I was specifying.

Fair enough.

Trying to document stuff in a semi-competent way is easier said than done.

A lot of my code also leaves something to be desired.

>>
>> Like, someone can work out my DOB and similar if they want, ..., but
>> would still prefer it not be shown publicly.
>>
>>
>>
>> Well, similar with a lots of other stuff, like ethnic background,
>> religious views, etc. Granted, one can say "white" and
>> "non-denominational Christian" (and/or "agnostic leaning theist" /
>> "theism leaning agnostic", though "openly non YEC" (1), or similar) and
>> this is basically "close enough".
>>
>>
>> 1: The YEC position (eg, 6 day creation with a 6k year history) having
>> way too many issues to really take it seriously. They seem to believe
>> that their interpretation is mandatory, but like, it also seems
>> reasonable to assume that much of Genesis is mostly allegorical and the
>> historical account of events starts in Exodus.
>>
>> So, say:
>> Supernatural intervention into the existence of life and humans:
>> Probably.
> <
> There is no evidence, so I would put this in "almost assuredly not".
> <

Well, possibly not in the "poof, everything came from nothing all at
once sense".

But, say, maybe poking at molecules to get things going, occasionally
poking at the DNA and triggering mutations, ...

Though, granted, if done this way, the end result would be pretty much
indistinguishable from evolution.

Granted, this more leaves it as an untestable assertion, rather than
something that flies in the face of existing evidence.

But, yeah, in this version of events, you still have all of the early
proto-hominids (and the dinosaurs still died off 65 million years ago, ...).

>> History from a few thousand years ago resembling "The Flintstones":
>> Probably not.
> <
> There is no evidence, so I would put this in "almost assuredly not".
> <

Yeah.

In this case, it isn't just lack of evidence, but "a fair bit of counter
evidence".

Well, at least if one discounts all the stuff with people claiming that
archeologists keep digging up giant humanoid skeletons (from a race of
giants who supposedly hunted and ate dinosaurs and similar), and the the
Smithsonian was secretly confiscating and destroying all of this stuff
to maintain their own narrative of the history of the world, ...

Idea is basically that when The Flood came along, it basically drowned
both the giant humanoids and the dinosaurs; where Noah and his family
were normal (non giant) humans. Well, and also after this event, human
lifespan was reduced from 1000 to 100 years, etc...

But, yeah, some of this stuff seems a bit crazy...

But, yeah, absent some large-scale conspiracy, this one is probably busted.

Both geology and physics seem to come down more on the side of favoring
a time-frame of billions of years, rather than thousands...

...

>> Direct supernatural intervention into the life of individuals:
>> Probably rare in general.
> <
> There is no evidence, so I would put this in "almost assuredly not".
> <

It would be both rare and easily missed.

As noted, the interventions would be likely rare enough that most people
would never see anything.

For many that do, they would be mostly in the form of random events, so
a person would have no way of really knowing for certain whether or not
anything had actually happened.

So, not so much "big miraculous events", so much as on the scale of "if
a person rolls a d20, is it possible that the d20 could have been poked
mid-throw such that it lands on a different number than it would have
otherwise?..." (and they will not prod events in any way that would make
it obviously different from that of random chance).

In this case, big miraculous events would be "very rare", confined
mostly to one place of time, and then maybe "hardly anything" for the
next several millennia (except apparently in specific times and places).

>>
>> There are too many issues if one assumes that everyone's life is
>> micromanaged. It seems much more consistent with observation to assume a
>> lack of any sort of direct intervention in most cases.
> <
> If you have ever known someone who took their own life (either rapidly or slowly)
> you would never visualize anyone micromanaging them in any way shape or form.

Such is the issue.

Admittedly, I haven't seen anything outside of what could be attributed
to chance or my own thoughts (or other psychological or neurological
artifacts).

A lot of people claim to have seen stuff though... Or to being directly
guided.

Some of us can just sort of hope they are going in the right general
direction, or whether or not anything they are doing has meaning.

I had seen "some weird stuff", but most of it doesn't really "line up"
with traditional religious descriptions.

A few effects:
Occasionally encountering shadowy doppelgangers of myself;
Occasional "jumps", either forwards or backwards a few minutes, or from
one location to another, ...;
Occasional "wave like" effects which seemingly pass through the space
around me;
...

These effects come and go, and most likely has a neurological
explanation. I don't know what it is (not found a description of a
condition with these particular effects).

Have also noted that some types of optical illusions don't work
correctly on me (such as "Hollow Face" and "Motion Induced Blindness",
...). In some other cases of ambiguous images, rather than seeing one
thing or another, I will see both versions at the same time.

Some of this seems to match up with descriptions of schizophrenia,
though with the apparent difference that (normally) people with this
condition expect that the things they are seeing are real, rather than
due to neurological effects?...

Granted, not entirely sure how most people experience the world.

Could write more, but don't really want to go too much into this.

>>
>>
>> For most things, also makes sense to take an "Occam's razor" approach to
>> the level of supernatural intervention required in any given scenario
>> (with most things tending to operate primarily in terms of physical
>> mechanisms).
> <
> Agreed.

Yeah.
If one says the world is pretty much entirely physical processes, I will
agree.

Open question is more if anything exists beyond this.

I suspect this may be the case, but will not claim to have any real
evidence for this.

Lots of other people claim to have a lot more evidence and experiences
than I have here.

Though, to say that there is nothing, would also need evidence.

>>
>> Similarly, it seems that if/when it does occur, it would (in most cases)
>> be mostly in the form of skewing probabilities or similar.
>> So, one can't really be sure whether or not it was an intervention, or
>> if it would have happened that way regardless. In this case, one would
>> need to look for significant deviations from statistical probability.
>>
> If god wanted to show that she existed, it would be simple:: just have
> no babies born, for an entire month, across the entire planet, with any
> birth defects whatsoever.

Or at least do something to make it "sufficiently obvious".

It doesn't necessarily need to be proven to the world as a whole, but
something smaller-scale to make their presence known would be something
at least.

But, we can't be so lucky...

Thomas Koenig

unread,

Nov 19, 2022, 10:14:34 AM11/19/22

to

MitchAlsup <Mitch...@aol.com> schrieb:

> Given that I have imm5 constants available, I can convert imm5 into
> FP32 or FP 64 is only a couple handfuls of gates. So, I can actually
> encode and use::
><
> FCMP Rt,Rx,#1
><
> With the same effect as
><
> FCMP Rt,Rx,#1.0D0
>>
> However, usage is skewed 8× more positives than negatives, nobody uses
> 13, 17, or 19; so I am looking for a more effective mapping function from
> the 5-bit field I have to the values I want.

I ran the GSL (Gnu Scientific Library) through Brian's My 66000
compiler and grepped through the floating point results. (I ran
across a couple of ICEs, for which I filed issues on github).

Here are the statistics for the first 32 eight-byte constants.
First colum is culminated percentage, second is percentage for the
constant, third is the constant in hex, fourth is the constant in
a more readable number. There are also two four-byte constans
in there, which is why the table has 34 entries :-)

20.88 20.88 3FF0000000000000 1.000000000000000e+00
35.56 14.68 0000000000000000 0.000000000000000e+00
41.19 5.63 3FE0000000000000 5.000000000000000e-01
46.35 5.16 BFF0000000000000 -1.000000000000000e+00
50.44 4.08 3CB0000000000000 2.220446049250313e-16
53.50 3.06 3CC0000000000000 4.440892098500626e-16
55.63 2.14 0000000 0.000000000000000e+00
57.42 1.79 4008000000000000 3.000000000000000e+00
58.69 1.27 4010000000000000 4.000000000000000e+00
59.89 1.21 4000000000000000 2.000000000000000e+00
60.91 1.01 43F0000000000000 1.844674407370955e+19
61.78 0.87 400921FB54442D18 3.141592653589793e+00
62.63 0.85 3FD0000000000000 2.500000000000000e-01
63.42 0.79 BFE0000000000000 -5.000000000000000e-01
64.16 0.74 4018000000000000 6.000000000000000e+00
64.83 0.66 5FEFFFFFFFFFFFFF 1.340780792994260e+154
65.47 0.64 3F800000 1.000000000000000e+00
66.08 0.61 C000000000000000 -2.000000000000000e+00
66.63 0.55 7FF8000000000000 nan
67.12 0.49 2000000000000000 1.491668146240041e-154
67.56 0.44 0010000000000000 2.225073858507201e-308
68.00 0.44 4024000000000000 1.000000000000000e+01
68.43 0.43 4014000000000000 5.000000000000000e+00
68.83 0.40 C008000000000000 -3.000000000000000e+00
69.16 0.34 3FD5555555555555 3.333333333333333e-01
69.49 0.33 3FB999999999999A 1.000000000000000e-01
69.81 0.32 7FF0000000000000 inf
70.11 0.30 4A511B0EC57E649A 1.000000000000000e+50
70.39 0.28 4022000000000000 9.000000000000000e+00
70.65 0.26 401921FB54442D18 6.283185307179586e+00
70.89 0.24 4020000000000000 8.000000000000000e+00
71.13 0.24 3FE5555555555555 6.666666666666666e-01
71.35 0.22 3FD3333333333333 3.000000000000000e-01
71.56 0.21 3FF8000000000000 1.500000000000000e+00

There are 14003 floating point constants (as determined by
grep '#0x[A-F0-9]' *.s | wc -l , which may be inaccurate) for 530193
total lines (I did not count statements), so more than 2.4% (and
probably less han 3%) of statements have floating point constants
for this heavy mathematical code.

(The fact that 1e50 is in there might be due to some check for
overflow).

If anybody wants the full table (or wants me to post more), just
drop me an e-mail.

BGB

unread,

Nov 19, 2022, 12:59:37 PM11/19/22

to

The downside with an ad-hoc table approach is that it may work with one
program, but do much worse for another.

That or some mechanism is needed for the table to be reprogrammable,
which would make it a fair bit more expensive.

So, would likely need to (if anything) come up with a table that can be
both constant, and hopefully has a "reasonably good" hit rate in most
other cases. Or, at least, gets a hit rate good enough to make it "worth
having".

As currently implemented in my case, in any case, the table would be
limited to values which can be represented exactly as Binary16, which
mostly excludes 1/3, 2/3, 1/10, ...

This is partly because the decoder puts a Binary16 into the Imm field,
and a special IMM16F pseudo-register is used which does a
Binary16->Double conversion during register fetch (the decoder stage
can't output full 64-bit constants directly without spreading it across
multiple lanes).

But, in general, at least Binary16 does well, and allows loading an FP
constant via a single 32-bit instruction word.

The imm10fp instructions had basically padded the value left by 6-bits, say:
s.eeeee.ffff
Is decoded as:
s.eeeee.ffff000000

...

Thomas Koenig

unread,

Nov 19, 2022, 3:51:25 PM11/19/22

to

BGB <cr8...@gmail.com> schrieb:

So, I have a proposal for you. If you have 16 bits available, use
a format which has

- a sign
- a three-bit exponent with suitable bias
- eight bit of mantissa
- a four-bit periodic digit, part of the mantissa

(with a hidden leading bit, of course).

The periodic hex digit would just be copied (and rounded,
if necessary).

This will not give you 1/7 and 1/9, but 1/3 and 1/5 would be
representable, and you would get most of the constants in the
table above.

No irrational constants, tough.

MitchAlsup

unread,

Nov 19, 2022, 5:02:13 PM11/19/22

to

In My case, if programming in ASM, one simply enunciates the constant
and ASM does the smallest instruction with the correct semantics. If it is
the compiler, well, he knows which form to use.
<
In any case; all manifestations are 1 instruction: issue-execute-retire,
so the only variable is cache <footprint> performance.
<
So, for instance, say the ASM programmer wrote::
<
FADD R8,R7,#3.14159265358926.....
<
If that constant were in the #imm5 set for FP calculations, the ASM would
choose a single word instruction, otherwise it would choose a 3-word
instruction (1 instruction and 2-word constant).

>
> That or some mechanism is needed for the table to be reprogrammable,
> which would make it a fair bit more expensive.
>
>
> So, would likely need to (if anything) come up with a table that can be
> both constant, and hopefully has a "reasonably good" hit rate in most
> other cases. Or, at least, gets a hit rate good enough to make it "worth
> having".
<

It is reasons like this that my starting point is numbers from -16..+15
{that is a signed 5-bit immediate}. However the negative range is seldom
used (12%) so it is likely to make more sense doing +0..+31 as a second
stick in the sand {unsigned 5-bit immediate}.

MitchAlsup

unread,

Nov 19, 2022, 5:05:12 PM11/19/22

to

On Saturday, November 19, 2022 at 2:51:25 PM UTC-6, Thomas Koenig wrote:
> BGB <cr8...@gmail.com> schrieb:

> > As currently implemented in my case, in any case, the table would be
> > limited to values which can be represented exactly as Binary16, which
> > mostly excludes 1/3, 2/3, 1/10, ...
> So, I have a proposal for you. If you have 16 bits available, use
> a format which has
<

Unfortunately I have only 5-bits, 32-bits, and 64-bits as choices.

>
> - a sign
> - a three-bit exponent with suitable bias
> - eight bit of mantissa
> - a four-bit periodic digit, part of the mantissa
>
> (with a hidden leading bit, of course).
>
> The periodic hex digit would just be copied (and rounded,
> if necessary).
>
> This will not give you 1/7 and 1/9, but 1/3 and 1/5 would be
> representable, and you would get most of the constants in the
> table above.
<

This is a cute (and valuable) trick, as my ISA develops I will see if it
ends up valuable to My 66000.
>
> No irrational constants, tough.

Quadibloc

unread,

Nov 19, 2022, 5:58:56 PM11/19/22

to

On Saturday, November 19, 2022 at 10:59:37 AM UTC-7, BGB wrote:

> That or some mechanism is needed for the table to be reprogrammable,
> which would make it a fair bit more expensive.

Of course, though, the main reason for the... expense... would be that
multiple programs are normally running concurrently on a modern CPU,
and of course they wouldn't all program the table the same way.

So the table could be treated like registers, to be saved and restored at
context switches. The expense in cycles would be more significant than
the expense in transistors.

Of course, though, if the table of constants is _only_ being used to make
instructions more compact, and it isn't required to make table accesses
fast and efficient, one could put the table in _memory_ and just make a
pointer to the table a new register. Think TI 9900.

John Savard

MitchAlsup

unread,

Nov 19, 2022, 6:14:33 PM11/19/22

to

I think that rates an elegant YECH.
>
> John Savard

BGB

unread,

Nov 19, 2022, 11:58:01 PM11/19/22

to

On 11/19/2022 4:05 PM, MitchAlsup wrote:
> On Saturday, November 19, 2022 at 2:51:25 PM UTC-6, Thomas Koenig wrote:
>> BGB <cr8...@gmail.com> schrieb:
>>> As currently implemented in my case, in any case, the table would be
>>> limited to values which can be represented exactly as Binary16, which
>>> mostly excludes 1/3, 2/3, 1/10, ...
>> So, I have a proposal for you. If you have 16 bits available, use
>> a format which has
> <
> Unfortunately I have only 5-bits, 32-bits, and 64-bits as choices.

Approx stats (32-bit space):
3R / 3RI Imm5: ~ 52 spots left (out of 168 initial)
3RI Imm9 : 2 spots left (56 initial)
2R : ~ 256 left out of 384.
2RI Imm10: 28 spots left (64 initial)
2RI Imm16: 7 spots left (16 initial)

This is for the F0, F1, F2, and F8 blocks.
The F3 and F9 blocks are still available.

The F3 block is intended for User or Implementation-Specific encodings.

There is a fair bit more space in terms of 64 or 96 bit encodings, but
this is harder to classify.

Some existing ops add 4b of opcode for Op64 encodings, but if they don't
need an Imm11 or Rp port, it is possible an extra 15 bits could be added
to the opcode.

>>
>> - a sign
>> - a three-bit exponent with suitable bias
>> - eight bit of mantissa
>> - a four-bit periodic digit, part of the mantissa
>>
>> (with a hidden leading bit, of course).
>>
>> The periodic hex digit would just be copied (and rounded,
>> if necessary).
>>
>> This will not give you 1/7 and 1/9, but 1/3 and 1/5 would be
>> representable, and you would get most of the constants in the
>> table above.
> <
> This is a cute (and valuable) trick, as my ISA develops I will see if it
> ends up valuable to My 66000.

Yeah, possibly could be useful.

So, looks like (for Quake):
1105 FP constants present:
~ 806 Fp16 (32b encoding)
~ 178 Fp32 (64b encoding)
~ 121 Fp64 (96b encoding)

It could likely cut down on the latter 2 cases.

>>
>> No irrational constants, tough.

Thomas Koenig

unread,

Nov 20, 2022, 7:07:48 AM11/20/22

to

MitchAlsup <Mitch...@aol.com> schrieb:

> On Saturday, November 19, 2022 at 2:51:25 PM UTC-6, Thomas Koenig wrote:
>> BGB <cr8...@gmail.com> schrieb:
>> > As currently implemented in my case, in any case, the table would be
>> > limited to values which can be represented exactly as Binary16, which
>> > mostly excludes 1/3, 2/3, 1/10, ...
>> So, I have a proposal for you. If you have 16 bits available, use
>> a format which has
><
> Unfortunately I have only 5-bits, 32-bits, and 64-bits as choices.
>>
>> - a sign
>> - a three-bit exponent with suitable bias
>> - eight bit of mantissa
>> - a four-bit periodic digit, part of the mantissa
>>
>> (with a hidden leading bit, of course).
>>
>> The periodic hex digit would just be copied (and rounded,
>> if necessary).
>>
>> This will not give you 1/7 and 1/9, but 1/3 and 1/5 would be
>> representable, and you would get most of the constants in the
>> table above.
><
> This is a cute (and valuable) trick,

Thanks!

>as my ISA develops I will see if it
> ends up valuable to My 66000.

Maybe another alternative: You could do the same sort of thing
extending a 32-bit float to a 64-bit float, with the last 12 bits.
This would give you x/4095 = x/(3*3*5*7*13) as rational numbers
for 64-bit constants. That should be enough for most rational
constants, and would save four bytes per constant.

Stephen Fuld

unread,

Nov 21, 2022, 12:26:53 PM11/21/22

to

First, thanks for doing this. Data is almost always better than
intuition. I do agree with BGB that more data, especially from
different areas (e.g. graphics, embedded) would be useful. However, I
think the idea of allowing a user changeable lookup just isn't worth the
cost.

But taking the data we have, using the five bit field as an index to a
x32 ROM would allow encoding over 70% of the constants. From my
calculations and this data, I get that a simple five bit unsigned
integer would get about about 45%. If that is right, the ROM saves
about 25% of the otherwise requiring an extra 32 bit word constants.
This is about 350 words. Since we have no execution statistics (and
this is a library and we have no information on how frequently each
routine is called), we can't talk about potential i-cache savings, etc.

Note that there are several other plausible alternatives. I don't know
how much each of these cost/save in terms of gate complexity/timing, so
someone else would have to evaluate those.

1. The data confirms Mitch's (and I suspect many other's) idea that
small negative constants (e.g. -1) occur more than many "modest" sized
positive constants (e.g. 29). You could take advantage of that by
either treating the five bits as a signed number, or, as part of
instruction decode, subtracting a small constant from the value (e.g. 7)
to allow an asymmetric range (e.g. -7 to 24).

2. If the area cost of the ROM is large, you could use the high order
bit of the five fit field to indicate the low order 4 bits are an index
to a x16 ROM. The high order bit equal zero would indicate to the those
four bits as a value. You would lose about 3% of the constants but save
a lot of ROM. Of course, you could combine this with #1 above to have a
few more high use values (e.g. -1) encoded as the "value", thus freeing
some space in the ROM for other values.

Overall, I see these solutions as small wins. Are they worth it? That
isn't for me to say.

BGB

unread,

Nov 21, 2022, 3:55:07 PM11/21/22

to

Yeah.
Constant table, "Yeah, a few LUTs, meh whatever".

User-defined table, now one needs to deal with saving/restoring this
table on context switches, along with a bunch of extra LUTs to allow a
mechanism to read/write the tables' constants (either via MMIO or by
adding special instructions or similar).

> But taking the data we have, using the five bit field as an index to a
> x32 ROM would allow encoding over 70% of the constants. From my
> calculations and this data, I get that a simple five bit unsigned
> integer would get about about 45%. If that is right, the ROM saves
> about 25% of the otherwise requiring an extra 32 bit word constants.
> This is about 350 words. Since we have no execution statistics (and
> this is a library and we have no information on how frequently each
> routine is called), we can't talk about potential i-cache savings, etc.
>
> Note that there are several other plausible alternatives. I don't know
> how much each of these cost/save in terms of gate complexity/timing, so
> someone else would have to evaluate those.
>
> 1. The data confirms Mitch's (and I suspect many other's) idea that
> small negative constants (e.g. -1) occur more than many "modest" sized
> positive constants (e.g. 29). You could take advantage of that by
> either treating the five bits as a signed number, or, as part of
> instruction decode, subtracting a small constant from the value (e.g. 7)
> to allow an asymmetric range (e.g. -7 to 24).
>

Could reuse an idea from earlier, but maybe reorganize it slightly and
add some negative values:

Yeah, could try something different, say:
00..0F:

Map integer values from 0..15.

10..1F: E2.F2

0.250, 0.3125, 0.375, 0.4375
0.500, 0.625 , 0.750, 0.875

-1.000, 1.250 , 1.500, 1.750
-2.000, 2.500 ,-3.000, 3.500

Might need to run stats and compare this with the previous schemes, and
see if it is "more useful".

> 2. If the area cost of the ROM is large, you could use the high order
> bit of the five fit field to indicate the low order 4 bits are an index
> to a x16 ROM. The high order bit equal zero would indicate to the those
> four bits as a value. You would lose about 3% of the constants but save
> a lot of ROM. Of course, you could combine this with #1 above to have a
> few more high use values (e.g. -1) encoded as the "value", thus freeing
> some space in the ROM for other values.
>
> Overall, I see these solutions as small wins. Are they worth it? That
> isn't for me to say.
>

Hard to say...

In general, the FP Imm feature adds around 2% (~ 1.3k LUT) to the total
LUT cost, which is a little steep given its effects are mostly:
Makes binary ~ 1% smaller;
Seems to have no real visible effect on performance.

I had recently been off fiddling with neural-net stuff:
* Training NNs to filter an image (kinda meh on this front)
** It is hard to (significantly) beat "ye olde bicubic".
** The NN is (at best) a few orders of magnitude slower.

Had recently reworked it to operate in terms of 4-wide Binary16 vectors,
which can be mapped relatively efficiently to BJX2.

It is currently evaluating 8 neurons at a time as this is what would
best fit into the pipeline. If I saved/restored some registers, and
allowed bundling FP16 SIMD ops (and/or allowed 8-wide Binary16 SIMD), it
might make more sense to evaluate 16 neurons at a time.

Not really much practical use at the moment (still not particularly
great at reversing JPEG artifacts or similar).

Ironically, this combination of semi-efficient mapping, and the
relatively high cost of emulating Binary16 operations in software, turns
into a situation where the BJX2 core would beat my desktop PC in terms
of performance (in an absolute sense), despite my PC having around 74x
the clock speed.

Though, my desktop PC would still win for most scenarios not involving
emulating Binary16 ops in software...

For a net with 8x8 pixel input, 3 hidden layers, and an output layer,
each hidden layer holding 32 neurons (fully connected, so 4096 weights
in the first layer, 1024 weights in each following layer).

This net would be invoked 3 times per pixel (once for each component).

As a blob of BJX2 ASM, it is ~ 15k clock cycles per pixel (~5k per net),
with LP-FPU enabled (would be ~ 120k cycles without LP-FPU).

On my PC, it is more around 1.5 million clock cycles per pixel (mostly
dominated by the FP16 math ops).

Have also been running tests with 4 hidden layers, but it appears that
this doesn't add much beyond making it slower (whereas 3 hidden layers
beats 2 in terms of accuracy). Similarly, 32 neurons per layer seems to
have an advantage over 16 neurons per layer.

The training algorithm (basically a genetic algorithm) has the ability
to adapt the activation per group of 4 neurons, currently:
Linear (y=x);
ReLU (y=(x>=0)?x:0);
SSQRT (y=(x>=0)?sqrt(x):-sqrt(-x));
USQRT (y=(x>=0)?sqrt(x):0).

It appears SSQRT and USQRT seem to be the most popular options here.
SSQRT mostly approximates the TANH function.
USQRT is partway between Signmoid and Heaviside.

In both cases, the SQRT operators do an approximated version of a square
root. Linear does nothing; ReLU is basically "Set output to 0 if the
sign bit is set.".

Training the nets is not exactly fast.
It was faster and easier with FP8, since FP8 can map the operators to a
lookup table. With FP16, it is necessary to emulate the math.

Training the net as Binary32 and then converting to Binary16 would not
work, since if the net's values are not constrained during the training
phases, they will tend to go down paths that take them outside of the
dynamic range of Binary16.

Some of this does make an argument for BF16, which would be faster to
emulate on a PC, but is not natively supported by the SIMD operators on
BJX2.

Some of this does make a possible incentive though for, say:
FExx_xxxx_FExx_xxxx_F88n_xxxx PLDCX.H Imm64, Xn
Which loads 4x Binary16 into a 4x Single vector.

This operator could potentially make it viable to use 4x Single ops,
though the current encodings for the Low-Precision FPU ops would burn
the I$ pretty bad in this case (spitting out the nets as giant blobs of
ASM manages to hit the I$ pretty hard).

...

Thomas Koenig

unread,

Nov 22, 2022, 9:01:32 AM11/22/22

to

BGB <cr8...@gmail.com> schrieb:

Did we just re-invent the coefficient array? :-)

John Dallman

unread,

Nov 22, 2022, 9:34:49 AM11/22/22

to

In article <tlikno$3o18r$1...@newsreader4.netcologne.de>,
tko...@netcologne.de (Thomas Koenig) wrote:
> BGB <cr8...@gmail.com> schrieb:

> > User-defined table, now one needs to deal with saving/restoring
> > this table on context switches, along with a bunch of extra LUTs
> > to allow a mechanism to read/write the tables' constants (either
> > via MMIO or by adding special instructions or similar).
>
> Did we just re-invent the coefficient array? :-)

Yup. Along with a new chunk of context that has to be passed to the
compiler, so that it knows what constants are in the table, and a new way
of making shared libraries mutually incompatible (compiled with different
tables).

John

MitchAlsup

unread,

Nov 22, 2022, 12:26:03 PM11/22/22

to

Sooner (!!) or later (boo) you will realize that constants should be universally
available.
<
>
> John

Stephen Fuld

unread,

Nov 22, 2022, 12:39:09 PM11/22/22

to

Let me be clear. I think that a loadable constant table is a loosing
proposition. The benefits are far outweighed by the costs.

But a fixed table seems to have a modest benefit. For the code above,
it eliminates about 25% of the 32 bit inline constants. That would be
reduced by about 5% if you used a modified 5 bit value, either
subtracting a small offset, or treating it as a signed number.

Again, it would be useful to have a larger corpus of code to modify the
values, it's hard to see how adding new code would significantly reduce
the benefit, and might increase it some.

So there is benefit. The question is the cost of the ROM and the gates
to access it. I can't evaluate that.

MitchAlsup

unread,

Nov 22, 2022, 12:48:03 PM11/22/22

to

On Tuesday, November 22, 2022 at 11:39:09 AM UTC-6, Stephen Fuld wrote:

> Let me be clear. I think that a loadable constant table is a loosing
> proposition. The benefits are far outweighed by the costs.
>
> But a fixed table seems to have a modest benefit. For the code above,
> it eliminates about 25% of the 32 bit inline constants. That would be
> reduced by about 5% if you used a modified 5 bit value, either
> subtracting a small offset, or treating it as a signed number.
>
> Again, it would be useful to have a larger corpus of code to modify the
> values, it's hard to see how adding new code would significantly reduce
> the benefit, and might increase it some.
>
> So there is benefit. The question is the cost of the ROM and the gates
> to access it. I can't evaluate that.
<

The cost of directly converting a 5-bit field into a 32-bit or 64-bit FP constant
being x-gates
Then::
The cost of a ROM which supplies 8-bits of result to be surrounded by
strings of 0's or 1's is close to 2×x
The cost of a ROM which supplies 15-bits so that constants of the form
1/3, 1/5, 1/7 can be encodes is close to 3×
The cost of a ROM that spit out all 32-bits is close to 8×x
<
So, those are the individual costs, but these shrink into insignificance
when one actually looks at PARSE and DECODE and the miscellaneous
cruft they have to do cycle by cycle.

BGB

unread,

Nov 22, 2022, 3:37:57 PM11/22/22

to

On 11/22/2022 11:39 AM, Stephen Fuld wrote:
> On 11/22/2022 6:33 AM, John Dallman wrote:
>> In article <tlikno$3o18r$1...@newsreader4.netcologne.de>,
>> tko...@netcologne.de (Thomas Koenig) wrote:
>>> BGB <cr8...@gmail.com> schrieb:
>>>> User-defined table, now one needs to deal with saving/restoring
>>>> this table on context switches, along with a bunch of extra LUTs
>>>> to allow a mechanism to read/write the tables' constants (either
>>>> via MMIO or by adding special instructions or similar).
>>>
>>> Did we just re-invent the coefficient array? :-)
>>
>> Yup. Along with a new chunk of context that has to be passed to the
>> compiler, so that it knows what constants are in the table, and a new way
>> of making shared libraries mutually incompatible (compiled with different
>> tables).
>
> Let me be clear. I think that a loadable constant table is a loosing
> proposition. The benefits are far outweighed by the costs.
>

Agreed.

This is why my ideas were (if anything) that the table should be fixed
and defined as part of the ISA.

However, making it "useful enough to offset the cost" is more of an
issue here.

I initially tried adding it as my initial stats implied it would have
been usable more often than it seems to be in practice:
Initial estimate: ~ 50%
Observed: ~ 10-15%

> But a fixed table seems to have a modest benefit. For the code above,
> it eliminates about 25% of the 32 bit inline constants. That would be
> reduced by about 5% if you used a modified 5 bit value, either
> subtracting a small offset, or treating it as a signed number.
>
> Again, it would be useful to have a larger corpus of code to modify the
> values, it's hard to see how adding new code would significantly reduce
> the benefit, and might increase it some.
>
> So there is benefit. The question is the cost of the ROM and the gates
> to access it. I can't evaluate that.
>

Not too big for this part.

In my case, there is a much bigger cost associated with adding a
Binary16 to Binary64 converter into the ID2 / "Register Fetch" stage,
which is sort of needed here for FPU Immediate cases as having a
converter in the CONV operation or similar isn't really usable in this
case (this is even with it only being limited to certain register ports).

This was because the "initial" form of FpImm added:
"OP Imm10fp, Rn" (FADD/FMUL/FCMPxx)
Op64 "OP Rm, Imm16fp, Rn" for FADD/FSUB/FMUL

Both of which can use Binary16 for the immediate (but operate on
Binary64 in registers).

Though, in this case, the cost isn't so much about the cost of the
converter per-se, but rather it seems that adding *any* SPRs or other
"new" registers to this path is unreasonably expensive (I suspect
because Vivado mass-clones this logic for feeding the FF's leading into
the various function units).

Likewise, Vivado has no real way of realizing that most of these FU's
have no reason to care about FPU Immediate values or similar (and some
of these paths lead to additional Binary16 converters, etc).

Had before observed in the past that one can shave LUTs in some cases by
being like:
if(some_nonsense_features)
begin
tOutputValue = 64'hXXXX_XXXX_XXXX_XXXX;
end
But, while this helps reduce LUT cost, it can lead to different behavior
between Vivado and Verilator (using '0000' instead is not quite as
effective at reducing LUT cost, but can at least keeps behavior more
consistent).

Though, this is uncommon for the most part (was mostly used in cases
like masking off trying to do "LDTEX" or an "FMOV.S" load into a Control
Register or similar, which is "obviously invalid").

On the input side, it would take a form more like:
assign tRegValRs = tRegIdRsIsValid ? regValRs : 64'h0000_0000_0000_0000;

But, this sort of things is a lot less common (and more hit/miss as to
whether it saves LUTs or makes things more expensive).

Partly this is a case though of playing whack-a-mole with which FUs can
except inputs from which types of registers (and adding logic to detect
and handle these cases, ...).

Then of course, all sorts of wacky stuff that "could exist" apart from a
lack of instructions which could encode the behavior in question (and
Vivado synthesis having possibly no way of realizing "Yeah, no
instruction exists which can encode this behavior").

...

Paul A. Clayton

unread,

Nov 26, 2022, 8:56:30 AM11/26/22

to

Stephen Fuld wrote:
> On 11/22/2022 6:33 AM, John Dallman wrote:
>> In article <tlikno$3o18r$1...@newsreader4.netcologne.de>,
>> tko...@netcologne.de (Thomas Koenig) wrote:
>>> BGB <cr8...@gmail.com> schrieb:
>>>> User-defined table, now one needs to deal with saving/restoring
>>>> this table on context switches, along with a bunch of extra LUTs
>>>> to allow a mechanism to read/write the tables' constants (either
>>>> via MMIO or by adding special instructions or similar).
>>>
>>> Did we just re-invent the coefficient array? :-)
>>
>> Yup. Along with a new chunk of context that has to be passed to the
>> compiler, so that it knows what constants are in the table, and
>> a new way
>> of making shared libraries mutually incompatible (compiled with
>> different
>> tables).
>
> Let me be clear. I think that a loadable constant table is a
> loosing proposition. The benefits are far outweighed by the costs.

I am skeptical that a small table of constants would be
worthwhile. Even a small table of modifiable values does not seem
likely to be very useful (if in thread local storage such can be
the equivalent of additional registers with more restricted use —
program-global might have some uses but introduces synchronization
complexity, sharing across a broader scope [system/virtual machine
or group of programs] may be theoretically interesting).

In terms of save/restore, having such as a cache indexed early in
the pipeline could avoid some of the issues. If modification is
infrequent, a small store queue might suffice even for a deeply
out-of-order design. (A context switch would cause cache misses
for the entire cache, which would be bad if the feature was
heavily used. The laziness/eagerness of prefetching would be an
"interesting" problem.) This is similar to Todd Austin's Knapsack
cache where an offset range from a global pointer provided zero
cycle loads by being indexed like the register file, directly with
an index easily extracted from the instruction. (I liked the
Knapsack Cache concept, but I suspect modern software does not map
well onto such — not just structuring data such that one wants to
use a structure base pointer for all references [not practical for
a Knapsack Cache] but probably having less commonly used global
data due to larger data sets and other factors.)

Within a modest-body-size loop, a cache of such a cache (i.e., a
store queue) might be practical allowing frequent modification.

I have previously mentioned the possibility of providing fast
loads with IP-related addresses, using an offset from a truncated
instruction pointer (e.g., negative offset relative to 16KiB,
256KiB, etc. regions to allow — in a multiple address space
environment — each process to have different constant collection
without having to duplicate entire pages of system-constant code
[in a single address space system, one could not get a free unique
pointer from the page table, but a special global pointer might be
included in the context and a means to convert indexes in
different code blocks to unique indexes into the global region —
but that seems likely to get complicated quickly with shared
libraries et al.]).

Such storage could be in a writable page. Of course with special
access instructions, it might also be possible to have a small
section of code pages be writable. Security issues might make such
unattractive. Providing smaller protection regions would seem more
useful than a special instruction that can write into a specific
small portion of "code space" and possibly not actually more
expensive.

That was a substantial digression.

Hardware support for expanding values (not just size conversions)
might have somewhat general use potential (or not). If such was
more generally useful, its use in expanding/converting immediates
would be "free". A preprocessing stage for immediates might also
be provided to precede the more general conversion, possibly
presenting a reasonable compromise between generating desirable
values and requiring more hardware of limited use.

(I am in a sleepy state where it is difficult to distinguish
creativity from incoherence — not that these are entirely
disjoint.)

MitchAlsup

unread,

Nov 26, 2022, 2:24:57 PM11/26/22

to

On Saturday, November 26, 2022 at 7:56:30 AM UTC-6, Paul A. Clayton wrote:
> Stephen Fuld wrote:
> > On 11/22/2022 6:33 AM, John Dallman wrote:
> >> In article <tlikno$3o18r$1...@newsreader4.netcologne.de>,
> >> tko...@netcologne.de (Thomas Koenig) wrote:
> >>> BGB <cr8...@gmail.com> schrieb:
> >>>> User-defined table, now one needs to deal with saving/restoring
> >>>> this table on context switches, along with a bunch of extra LUTs
> >>>> to allow a mechanism to read/write the tables' constants (either
> >>>> via MMIO or by adding special instructions or similar).
> >>>
> >>> Did we just re-invent the coefficient array? :-)
> >>
> >> Yup. Along with a new chunk of context that has to be passed to the
> >> compiler, so that it knows what constants are in the table, and
> >> a new way
> >> of making shared libraries mutually incompatible (compiled with
> >> different
> >> tables).
> >
> > Let me be clear. I think that a loadable constant table is a
> > loosing proposition. The benefits are far outweighed by the costs.
> I am skeptical that a small table of constants would be
> worthwhile.
<

I am entirely convinced that the table of constants which is not
fixed forever (ROM) is an entirely losing position.

Apply more alcohol and call me at noon.....

BGB

unread,

Nov 26, 2022, 6:00:49 PM11/26/22

to

On 11/26/2022 1:24 PM, MitchAlsup wrote:
> On Saturday, November 26, 2022 at 7:56:30 AM UTC-6, Paul A. Clayton wrote:
>> Stephen Fuld wrote:
>>> On 11/22/2022 6:33 AM, John Dallman wrote:
>>>> In article <tlikno$3o18r$1...@newsreader4.netcologne.de>,
>>>> tko...@netcologne.de (Thomas Koenig) wrote:
>>>>> BGB <cr8...@gmail.com> schrieb:
>>>>>> User-defined table, now one needs to deal with saving/restoring
>>>>>> this table on context switches, along with a bunch of extra LUTs
>>>>>> to allow a mechanism to read/write the tables' constants (either
>>>>>> via MMIO or by adding special instructions or similar).
>>>>>
>>>>> Did we just re-invent the coefficient array? :-)
>>>>
>>>> Yup. Along with a new chunk of context that has to be passed to the
>>>> compiler, so that it knows what constants are in the table, and
>>>> a new way
>>>> of making shared libraries mutually incompatible (compiled with
>>>> different
>>>> tables).
>>>
>>> Let me be clear. I think that a loadable constant table is a
>>> loosing proposition. The benefits are far outweighed by the costs.
>> I am skeptical that a small table of constants would be
>> worthwhile.
> <
> I am entirely convinced that the table of constants which is not
> fixed forever (ROM) is an entirely losing position.
> <

I tried multiple variations on the idea...

Seemingly, the original idea of a naive E3.F2 encoding is still holding
first place.

So, first place option (current):
* 000xx: 3000, 3100, 3200, 3300 ( 0.125, 0.156, 0.188, 0.219)
* 001xx: 3400, 3500, 3600, 3700 ( 0.250, 0.313, 0.375, 0.438)
* 010xx: 3800, 3900, 3A00, 3B00 ( 0.500, 0.625, 0.750, 0.875)
* 011xx: 3C00, 3D00, 3E00, 3F00 ( 1.000, 1.250, 1.500, 1.750)
* 100xx: 4000, 4100, 4200, 4300 ( 2.000, 2.500, 3.000, 3.500)
* 101xx: 4400, 4500, 4600, 4700 ( 4.000, 5.000, 6.000, 7.000)
* 110xx: 4800, 4900, 4A00, 4B00 ( 8.000, 10.000, 12.000, 14.000)
* 111xx: 4C00, 4D00, 4E00, 4F00 (16.000, 20.000, 24.000, 28.000)

Second place:
* 000xx: 3000, 3100, 3200, 3300 ( 0.125, 0.156, 0.188, 0.219)
* 001xx: 3400, 3500, 3600, 3700 ( 0.250, 0.313, 0.375, 0.438)
* 010xx: 3800, 3900, 3A00, 3B00 ( 0.500, 0.625, 0.750, 0.875)
* 011xx: 3C00, 3D00, 3E00, 3F00 ( 1.000, 1.250, 1.500, 1.750)
* 100xx: 4000, 4100, 4200, 4300 ( 2.000, 2.500, 3.000, 3.500)
* 101xx: 4400, 4500, 4600, 4700 ( 4.000, 5.000, 6.000, 7.000)
* 110xx: 4800, 4880, 4900, 4980 ( 8.000, 9.000, 10.000, 11.000)
* 111xx: 4A00, 4A80, 4B00, 4B80 (12.000, 13.000, 14.000, 15.000)

Third Place:
The split 0..15 + E2.F2 schemes.

Some variants had included 00000 => 0.0, but it seems that 0.0 is better
left out in this case as this case isn't useful in practice (and 1/8 is
more useful).

In other news, I am experimenting with a feature that expands the
low-precision FPU to full Binary32 precision and allows eliminating the
split between "low precision" and "normal" 4xFp32 SIMD (effectively all
of the Binary32 SIMD can be routed through the low-precision unit).

Minor issue is that Binary32 FMUL needs a 25-bit multiply rather than a
16-bit multiply, which does not fit into a single DSP.

Doing a "naive" multiply in Vivado burns 3 DSP's per FMUL, and a bunch
of extra LUTs. Have been trying to come up with a cheaper way to
structure the multiplier.

It appears I can get the approximately correct answer by adding two 8x7
bit multipliers, but it is partly a matter of implementing these in a
way that it hopefully cheaper than the 2 extra DSPs.

My initial attempts at "cheesing it" were not accurate enough to pass my
sanity tests (which actually expect the "correct" answer for the
floating-point output in this case).

Does at least still pass timing even with the higher cost.

This mostly reduces things like "PADDX.F" from 10 cycles (non pipelined)
to 3 cycles (pipelined).

MitchAlsup

unread,

Nov 26, 2022, 6:54:50 PM11/26/22

to

On Saturday, November 26, 2022 at 5:00:49 PM UTC-6, BGB wrote:
> On 11/26/2022 1:24 PM, MitchAlsup wrote:
> > On Saturday, November 26, 2022 at 7:56:30 AM UTC-6, Paul A. Clayton wrote:
> >> Stephen Fuld wrote:

> I tried multiple variations on the idea...
>
> Seemingly, the original idea of a naive E3.F2 encoding is still holding
> first place.
>
> So, first place option (current):
> * 000xx: 3000, 3100, 3200, 3300 ( 0.125, 0.156, 0.188, 0.219)
> * 001xx: 3400, 3500, 3600, 3700 ( 0.250, 0.313, 0.375, 0.438)
> * 010xx: 3800, 3900, 3A00, 3B00 ( 0.500, 0.625, 0.750, 0.875)
> * 011xx: 3C00, 3D00, 3E00, 3F00 ( 1.000, 1.250, 1.500, 1.750)
> * 100xx: 4000, 4100, 4200, 4300 ( 2.000, 2.500, 3.000, 3.500)
> * 101xx: 4400, 4500, 4600, 4700 ( 4.000, 5.000, 6.000, 7.000)
> * 110xx: 4800, 4900, 4A00, 4B00 ( 8.000, 10.000, 12.000, 14.000)
> * 111xx: 4C00, 4D00, 4E00, 4F00 (16.000, 20.000, 24.000, 28.000)
<

BTW:: this is only 4 fractions with 8 exponents.

>
> Second place:
> * 000xx: 3000, 3100, 3200, 3300 ( 0.125, 0.156, 0.188, 0.219)
> * 001xx: 3400, 3500, 3600, 3700 ( 0.250, 0.313, 0.375, 0.438)
> * 010xx: 3800, 3900, 3A00, 3B00 ( 0.500, 0.625, 0.750, 0.875)
> * 011xx: 3C00, 3D00, 3E00, 3F00 ( 1.000, 1.250, 1.500, 1.750)
> * 100xx: 4000, 4100, 4200, 4300 ( 2.000, 2.500, 3.000, 3.500)
> * 101xx: 4400, 4500, 4600, 4700 ( 4.000, 5.000, 6.000, 7.000)
> * 110xx: 4800, 4880, 4900, 4980 ( 8.000, 9.000, 10.000, 11.000)
> * 111xx: 4A00, 4A80, 4B00, 4B80 (12.000, 13.000, 14.000, 15.000)
<

This is 8 fractions with 7+1 exponents.

BGB

unread,

Nov 26, 2022, 9:59:17 PM11/26/22

to

On 11/26/2022 5:54 PM, MitchAlsup wrote:
> On Saturday, November 26, 2022 at 5:00:49 PM UTC-6, BGB wrote:
>> On 11/26/2022 1:24 PM, MitchAlsup wrote:
>>> On Saturday, November 26, 2022 at 7:56:30 AM UTC-6, Paul A. Clayton wrote:
>>>> Stephen Fuld wrote:
>
>> I tried multiple variations on the idea...
>>
>> Seemingly, the original idea of a naive E3.F2 encoding is still holding
>> first place.
>>
>> So, first place option (current):
>> * 000xx: 3000, 3100, 3200, 3300 ( 0.125, 0.156, 0.188, 0.219)
>> * 001xx: 3400, 3500, 3600, 3700 ( 0.250, 0.313, 0.375, 0.438)
>> * 010xx: 3800, 3900, 3A00, 3B00 ( 0.500, 0.625, 0.750, 0.875)
>> * 011xx: 3C00, 3D00, 3E00, 3F00 ( 1.000, 1.250, 1.500, 1.750)
>> * 100xx: 4000, 4100, 4200, 4300 ( 2.000, 2.500, 3.000, 3.500)
>> * 101xx: 4400, 4500, 4600, 4700 ( 4.000, 5.000, 6.000, 7.000)
>> * 110xx: 4800, 4900, 4A00, 4B00 ( 8.000, 10.000, 12.000, 14.000)
>> * 111xx: 4C00, 4D00, 4E00, 4F00 (16.000, 20.000, 24.000, 28.000)
> <
> BTW:: this is only 4 fractions with 8 exponents.

Yeah, can't do much more with 5 bits.
Overall hit rate: 39.2% for FP constants (of all the constants).

>>
>> Second place:
>> * 000xx: 3000, 3100, 3200, 3300 ( 0.125, 0.156, 0.188, 0.219)
>> * 001xx: 3400, 3500, 3600, 3700 ( 0.250, 0.313, 0.375, 0.438)
>> * 010xx: 3800, 3900, 3A00, 3B00 ( 0.500, 0.625, 0.750, 0.875)
>> * 011xx: 3C00, 3D00, 3E00, 3F00 ( 1.000, 1.250, 1.500, 1.750)
>> * 100xx: 4000, 4100, 4200, 4300 ( 2.000, 2.500, 3.000, 3.500)
>> * 101xx: 4400, 4500, 4600, 4700 ( 4.000, 5.000, 6.000, 7.000)
>> * 110xx: 4800, 4880, 4900, 4980 ( 8.000, 9.000, 10.000, 11.000)
>> * 111xx: 4A00, 4A80, 4B00, 4B80 (12.000, 13.000, 14.000, 15.000)
> <
> This is 8 fractions with 7+1 exponents.

I was partially incorrect in ranking this one second, this scheme
actually seems to be getting a 39.7% hit rate.

Though, the difference isn't very big either way...

As can be noted, the "bell curve" is pretty dense in this area, though
it could use more fraction bits.

Though, an E2.F3 scheme does worse by having a larger number of
constants falling outside the dynamic range.

>>
>> Third Place:
>> The split 0..15 + E2.F2 schemes.
>

Third Place:

00..0F:
Map integer values from 0..15.
10..1F: E2.F2
0.250, 0.3125, 0.375, 0.4375
0.500, 0.625 , 0.750, 0.875
-1.000, 1.250 , 1.500, 1.750
-2.000, 2.500 ,-3.000, 3.500

Hit rate seems to be worse than the former two schemes.

Recently, had also added an Op96 encoding for "Load a 4xFp16 vector as
4xFp32", though this gets a fairly low hit rate due to the relatively
low frequency of 128-bit vector constants (Eg: "(__vec4f) {1.0, 2.0,
3.0, 4.0}" or similar).

This was more of a "cost disappears in the noise" feature (most of the
"cost" being more whether there is something else "more worthy" of using
the FExx_xxxx_FExx_xxxx_F88n_xxxx encoding space...).

One could maybe argue for "SIMD operation with a vector immediate", but
this is even more niche...

In any case, probably still doing better here than SSE where loading a
vector constant generally requires a memory load or similar...

Terje Mathisen

unread,

Nov 27, 2022, 11:30:22 AM11/27/22

to

MitchAlsup wrote:
> On Saturday, November 26, 2022 at 7:56:30 AM UTC-6, Paul A. Clayton wrote:
>> Stephen Fuld wrote:
>>> On 11/22/2022 6:33 AM, John Dallman wrote:
>>>> In article <tlikno$3o18r$1...@newsreader4.netcologne.de>,
>>>> tko...@netcologne.de (Thomas Koenig) wrote:
>>>>> BGB <cr8...@gmail.com> schrieb:
>>>>>> User-defined table, now one needs to deal with saving/restoring
>>>>>> this table on context switches, along with a bunch of extra LUTs
>>>>>> to allow a mechanism to read/write the tables' constants (either
>>>>>> via MMIO or by adding special instructions or similar).
>>>>>
>>>>> Did we just re-invent the coefficient array? :-)
>>>>
>>>> Yup. Along with a new chunk of context that has to be passed to the
>>>> compiler, so that it knows what constants are in the table, and
>>>> a new way
>>>> of making shared libraries mutually incompatible (compiled with
>>>> different
>>>> tables).
>>>
>>> Let me be clear. I think that a loadable constant table is a
>>> loosing proposition. The benefits are far outweighed by the costs.
>> I am skeptical that a small table of constants would be
>> worthwhile.
> <
> I am entirely convinced that the table of constants which is not
> fixed forever (ROM) is an entirely losing position.

I agree: Such a table must be context-switched so increasing the state
and being a net loss for most progams, while providing a marginal
speedup for the few programs written and compiled to take advantage.

It is somewhat similar to user-loadable microcode, except far less general.

The SRAM needed to store the process constant table would almost
certainly be better used as regular cache: If the constant table is
accessed frequently enough matter, then it will also be $L1 resident.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Stephen Fuld

unread,

Nov 30, 2022, 1:36:35 PM11/30/22

to

Thank you. So it seems that the approach is a modest win at
insignificant cost. The question still remains about what constants are
popular in other software. A search showed a lot of open source
graphics software that could be used as input to what Thomas did for
GSL, but I can't evaluate which of them would be worth doing. But given
a few of those might find a few more good values to use in place of some
of the modest use ones Thomas found, it seems worth while.

Overall, I am still of the opinion that a variable table is a dead
loser, but a fixed (in ROM) table is beneficial.

MitchAlsup

unread,

Nov 30, 2022, 2:27:04 PM11/30/22

to

While having a ROM of several handfuls of FP constants carries insignificant
cost in PARSE and DECODE, my exposure to the breadth of FP constants in
code using FP arithmetic indicates that finding and isolating the "in crowd"
from the "out crowd" is fraught with peril. If BGB's data is even close to
accurate, I suspect that the 5-bit constant will probably never get over 40%
of the FP constants arriving at the code generator.
<
BUT (the big BUT), if we have a distribution of 40% 5-bit, 40% 32-bit, and 20%
64-bit; and that 75% of FP code is (double) with 25% (single), the math for
saving code space looks like:
<
No 5-bit constants, no 32-bit to 64-bit expansion: 40% of instructions are FP,
and 20% FP instructions have a constant:
40%×(20%×(25%×1word+75%×2words))
= 2%×1word+6%×2w
= .02+.12 = .14×words
So,
60% of instructions are 1 word, the other 40% are 1.14words
= 0.6+0.456 = 1.056 words per instruction
<
Now if we do all of the above "optimizations" the math becomes:
40%×(20%×(25%×(40%×0+60%×1word)+75%(40%×0+40%×1word+20%×2words)))
=2%×0.6word+6%×0.8words
=0.012+0.048 = 0.06×words
So,
60% of instructions are 1 word, the other 40% are 1.06words
= 0.6+0.424 = 1.024
<
spitting distance of a 5% gain in code density.

John Dallman

unread,

Nov 30, 2022, 3:39:33 PM11/30/22

to

In article <a436409e-2939-4647...@googlegroups.com>,

Mitch...@aol.com (MitchAlsup) wrote:

> my exposure to the breadth of FP constants in code using FP
> arithmetic indicates that finding and isolating the "in crowd"
> from the "out crowd" is fraught with peril.

I think you're right. Looking at BGB's data, I'm sure that the stuff I
work on would have a fairly different table, with pi further up the table
and -3.14E18 (which is used as a recognisable null value). 0.0, 0.5, 1.0,
and -1.0 are likely universal, but after that, it gets very variable.

John

MitchAlsup

unread,

Nov 30, 2022, 4:43:25 PM11/30/22

to

Comparing to ISAs without FP constants::

Same data as above, but adding "math" for ISAs without any FP constants
(other than 0.0).
<
So, here each FP constant (other than 0.0) is loaded from a constant pool
maintained and condensed by the assembler/linker portion of the tool
chain. I am not going to charge the code the overhead of having a pointer
into the constant pool, nor for the losses associated with consuming another
register to hold the constant; the only charge accounted for here is the LD
instruction itself. I further consider that this LD instruction is 1word in
size (uniformly); and I am going to charge the constant pool for the size
of the constants

<
No 5-bit constants, no 32-bit to 64-bit expansion: 40% of instructions are FP,
and 20% FP instructions have a constant:

40%×(20%×(25%×(1inst+1word)+75%×(1inst+2words)))
= 2%×2words+6%×3words
= 0.04+0.18
Adding in the consuming instruction we get::
= 1.22×words /FP inst
<
So, 1.024 words per FP instruction with is (I would say) usefully better than
1.22 words per FP instruction without constants. This feature would have
"made the cut" even in the MIPS mantra of "only those things gaining 10%
"make the cut".

BGB

unread,

Nov 30, 2022, 4:59:36 PM11/30/22

to

Yeah. I don't really know.

Thus far, the scope of my testing has mostly been GLQuake, as it is at
present the most FPU intensive program in my set of test programs.

I am getting some ambiguous results:
The 5-bit Imm5 stat seems to be getting ~ 39% ish for bare immediates
(globally);
Though, what ends up hitting for the actual instruction encodings seems
to be a little lower than this (around 31%, after adding a specific stat
for this path).

I had fiddled with it, and it seems the initial scheme (E3.F2) is still
back in first place.

The modified schemes, it turned out, were getting some "false positive"
hits due to a bug and were actually a little lower than the original scheme.

A fine-tuned table could maybe be better, but in any case, not getting
much better than this with the various "generic" schemes I have been
trying. A dynamically-modified table is no-go IMO, and defining the
table as cherry-picked values for a specific program seems like a bad
idea IMO.

The Fp10 path (S.E5.F4) seems to be doing a bit better, getting a
roughly 63% hit rate.

I am facing a minor issue at present where, for some as of yet
unresolved issue, trying to use the FpImmed encodings seems to result in
mangled looking alias-models in GLQuake in the simulation (though not in
the emulator). This is most likely a Verilog bug, but haven't found it
yet (nor been able to catch the issue by adding sanity checks).

Though, I have poked at a bunch of stuff in these areas recently, so
having broken something wouldn't exactly be a huge surprise.

> <
> BUT (the big BUT), if we have a distribution of 40% 5-bit, 40% 32-bit, and 20%
> 64-bit; and that 75% of FP code is (double) with 25% (single), the math for
> saving code space looks like:
> <
> No 5-bit constants, no 32-bit to 64-bit expansion: 40% of instructions are FP,
> and 20% FP instructions have a constant:
> 40%×(20%×(25%×1word+75%×2words))
> = 2%×1word+6%×2w
> = .02+.12 = .14×words
> So,
> 60% of instructions are 1 word, the other 40% are 1.14words
> = 0.6+0.456 = 1.056 words per instruction
> <
> Now if we do all of the above "optimizations" the math becomes:
> 40%×(20%×(25%×(40%×0+60%×1word)+75%(40%×0+40%×1word+20%×2words)))
> =2%×0.6word+6%×0.8words
> =0.012+0.048 = 0.06×words
> So,
> 60% of instructions are 1 word, the other 40% are 1.06words
> = 0.6+0.424 = 1.024
> <
> spitting distance of a 5% gain in code density.

Yeah, I do at least see some code density improvement.

For the cases that miss with the Imm5fp encoding, a significant majority
in this case still hit with the Fp16 encoding.

The 10-bit encoding used for 2RI ops (such as FCMPxx), seems to be doing
a little better.

In any case though, "FLDCH Imm16, Rn" (Binary16 constant load) seems to
still be seeing fairly heavy usage (in general, representing the bulk of
the floating point constant loads).

A recent addition is that, if jumbo prefixes are used on the above
instruction, it turns into a 4x Binary16 to 4x Binary32 vector load.
This is also seeing some amount of usage (not "huge", but something).

Mostly matters for things like SIMD vector literals (used a bit in
TKRA-GL and similar).

TKRA-GL also sees a minor speed improvement from reducing 4x Binary32
SIMD ops from 10 cycles to 3 cycles.

This speedup its more obvious in BtMini3 than in GLQuake though (though,
in GLQuake, some cases that were previously 4 fps are now around 5 or 6
fps).

Though, in any case, GLQuake at mostly 6-10 fps still falls short of
being particularly playable (would be closer to around 15-20 fps at 100
MHz).

BtMini3 currently gets better framerates than GLQuake (roughly on par
with Hexen at present), but still has a very short draw-distance.

Then again, it is a "Minecraft-like" 3D engine running on a CPU at 50
MHz with software rasterized OpenGL, so probably can't set the standards
all that high in terms of draw distance.

If one had scenes that looked like the "Money for Nothing" music video
(with minimalist geometry and lots of flat shading and occasional
texture maps), it is probable I could run something like this at better
framerates.

Say, if the scene + models stays around 200..500 polygons per frame,
this is easier than a scene pushing 800..2000 polygons per frame (eg:
GLQuake).

Also, flat-shaded geometry can be drawn faster than texture-mapped and
color-modulated geometry (which is in turn faster than drawing a scene
using multiple passes doing blending operations for the lightmaps).

...

MitchAlsup

unread,

Nov 30, 2022, 5:58:02 PM11/30/22

to

This is an interesting point to discuss::
<
a) a FPCONST instruction which takes the std 16-bit immediate and
moves and expands it into a 32-bit or 64-bit register goes a long way
in satisfying FP constants (probably more than 50% with suitable
encodings).
but
b) it executes as 1-more-instruction
whereas
c) you can get a 32-bit constant in the same code footprint when you
have universal access to constants.
which is why
d) My 66000 did not have to "go there".
>

BGB

unread,

Nov 30, 2022, 6:02:40 PM11/30/22

to

The schemes I am dealing with were mostly using simple constants, mostly
as these are what can be encoded via such a scheme...

But, yeah, this is more just sort of running stats, and seeing a very
obvious "bell curve" shape centered around 1.0, with constants tending
to have most of their entropy near the top of the mantissa.

There are constants like PI and similar showing up some in my case as
well, but currently I lack any viable way to handle them efficiently
(and these seem to be a lot less common in general).

Having special instructions for "Load PI and E and friends" is likely a
losing proposition in terms of resource cost (vs just sort of being like
"meh, whatever" and letting these cases fall back to a generic
full-width 64-bit constant loads).

Can't shove PI into Binary16, as this gives either:
3.140625
Or:
3.142578125

Which is, not exactly good enough...

Granted, BGBCC does sort of allow, say:
f=3.14159sf;
Which will sort of mash it down into a Binary16 friendly form (the 'f'
suffix will also mash constants down to Binary32).

But, in general, the compiler will only use a narrower encoding if it
can exactly encode the constant in question.

Where, I can note:
short float sf;

Is understood by BGBCC to assume that the value is Binary16 (though,
despite scalar ops using Binary64 in registers). Does lead to an
ambiguity as "short float" values used in the ABI (and in local
variables and similar) will have significantly larger range and
precision than in the "actual" type (which is only likely to manifest if
one takes the address of the variable...).

Though, "dynamic range and precision of this variable isn't terrible
enough" isn't exactly a high priority issue IMO.

For in-memory cases (struct members, arrays, and pointers) will use the
actual format though (as is also the case if taking the address of the
variable).

A recent tweak has opened up a new sub-issue though, in terms of making
more incentive to add 32-bit encodings for "FADDA" and similar (since
now "float" ops could be done in 3 cycles rather than 6 via the
low-precision FPU; but the instruction encoding for doing so currently
needs 8 bytes, ...).

Well, and/or I set the rounding mode and use FADDG and similar (which
would then interact in very annoying ways with any code that tries to
use fenv_access).

The other option would be changing the ABI to not keep 'float' as
Binary64 in registers (and then reusing 2x Binary32 SIMD ops for scalar
'float' operations, but this is a much uglier solution).

> John

MitchAlsup

unread,

Nov 30, 2022, 6:45:03 PM11/30/22

to

When I use Pi or E in my transcendental calculations I use more than 64-bits
of fraction, each. Not easy to do in SW.

>
> Which is, not exactly good enough...
>
<

There is always someone who knows the exact value of Pi or E; and some
of these people want Pi-e or Pi+e (same for E) to get guaranteed rounding
bias in the middle of calculations.
<

Quadibloc

unread,

Dec 1, 2022, 8:12:45 PM12/1/22

to

On Saturday, November 26, 2022 at 12:24:57 PM UTC-7, MitchAlsup wrote:
> On Saturday, November 26, 2022 at 7:56:30 AM UTC-6, Paul A. Clayton wrote:
> > Stephen Fuld wrote:

> > > Let me be clear. I think that a loadable constant table is a
> > > loosing proposition. The benefits are far outweighed by the costs.

> > I am skeptical that a small table of constants would be
> > worthwhile.

> I am entirely convinced that the table of constants which is not
> fixed forever (ROM) is an entirely losing position.

I certainly don't disagree *either* with this massive consensus.

However, what I'd like to ask is _why_.

Or, rather, I know _why_. If the table of constants is something that
can be rapidly accessed (otherwise, if it's just in ordinary RAM, it
serves no purpose) then it is a *limited resource*. And while that's
no problem on a computer that only has *one task running* at a
given time, although a desktop computer has a single user, between
the background processes the operating system has running, and
any other open windows in the background on the screen, today's
desktop computers, like the timesharing mainframes of old, have
plenty of tasks running.

So either you've got enough fast storage for a pile of constant
tables, which is occupying space and using transistors that could
be put to far better use, or you have more data to swap in and out
when going from one process to another.

So *here* is where I'm getting...

If one _could_ solve the problem of having a small loadable table
of constants, unique to each process... then, one would also have
solved some _much more important_ problems. Because this
solution could also be applied to make it practical for computers to
have larger register files than they do at present. Which could no
doubt be exploited to make them run faster.

John Savard

MitchAlsup

unread,

Dec 1, 2022, 10:42:09 PM12/1/22

to

In my case: the table of small constants is present to reduce code
footprint--because I have constants available in every instruction.
On the integer side of things, a 5-bit constant in the RS1 position
satisfies constants used in things like::
SLL R7,#1,R9
DIV R7,#10,R9
...
Those seldom facilitated constant positions that when used always
saves an instruction. What we would like is to save as much code space
and as many instruction cycles as possible. Here the 5-bit field does
just that.
<
On the FP side, these 5-bit constants are "not all that useful"--although
0.0, 1.0, 2.0, -1.0, 10.0 occur often enough that converting this 5-bit
integer into a FP constant can be done while the rest of register read
and DECODE proceed. The conversion takes less than 20 gates.
<
But what I am looking for is for the 5-bit constant to satisfy a few more
operand necessities BEFORE calling forth the 32-bit of 64-bit constants
(which, necessarily, eat more code space). So the rejoining question is::
for a few more than those 20 gates, can I get something significantly
more useful ?!?
<
Over on the flip side--I don't want a table per thread--because I am shooting
at 10-cycles per context switch, and a 32-entry table would add 40%-80%
to the total time, plus the added memory overhead (arguably irrelevant),
and whatever system thread needed to service this seldom changing per
thread resource.
<
I had a 128 entry per process constant pool in the H.E.P, and found it more
difficult to use than profitable to have. Learning from this, and from the lack
of constants in M88K, SPARC, x86; lead me to my current position.
>
> John Savard

MitchAlsup

unread,

Dec 2, 2022, 10:38:08 PM12/2/22

to

A spreadsheet, that is accurate, indicates this should be 1.140

<
> > <
> > Now if we do all of the above "optimizations" the math becomes:
> > 40%×(20%×(25%×(40%×0+60%×1word)+75%(40%×0+40%×1word+20%×2words)))
> > =2%×0.6word+6%×0.8words
> > =0.012+0.048 = 0.06×words
> > So,
> > 60% of instructions are 1 word, the other 40% are 1.06words
> > = 0.6+0.424 = 1.024
<

A spreadsheet, that is accurate, indicates this should be 1.060

<
> > <
> > spitting distance of a 5% gain in code density.
> <
> Same data as above, but adding "math" for ISAs without any FP constants
> (other than 0.0).
> <
> So, here each FP constant (other than 0.0) is loaded from a constant pool
> maintained and condensed by the assembler/linker portion of the tool
> chain. I am not going to charge the code the overhead of having a pointer
> into the constant pool, nor for the losses associated with consuming another
> register to hold the constant; the only charge accounted for here is the LD
> instruction itself. I further consider that this LD instruction is 1word in
> size (uniformly); and I am going to charge the constant pool for the size
> of the constants
> <
> No 5-bit constants, no 32-bit to 64-bit expansion: 40% of instructions are FP,
> and 20% FP instructions have a constant:
> 40%×(20%×(25%×(1inst+1word)+75%×(1inst+2words)))
> = 2%×2words+6%×3words
> = 0.04+0.18
> Adding in the consuming instruction we get::
> = 1.22×words /FP inst
<

A spreadsheet, that is accurate, indicates this should be 1.220
<
So,
No FP constants 1.22 words per FP instruction (RISC-V)
No Expansion..... 1.14 words per FP instruction (7% better)
Im32 Expansion. 1.09 words per FP instruction (11.7% better)
Im5 Expansion.. 1.06 words per FP instruction (15.1% better)
<
No Constants corresponds to the case where if you need a 32-bit FP constant
you LD a 32-bit Constant, and if you need a 64-bit Constant you LD a 64-bit
Constant. This costs the LD (1 word) plus the data memory footprint.
<
No expansion corresponds to the case where if you need a 64-bit FP constant
the instruction stream uses a 64-bit constant. This gets rid of the LDs.
<
im32 Expansion corresponds to the case where if the constant fits in 32-bits
you use a 32-bit constant. If the Constant is consumed with a 64-bit FP
calculation, the constant is expanded to 64-bits "on the fly".
<
Im5 Expansion corresponds to the case where if the constant fits in 5-bits
you use a 5-bit constant, if the constant fits in 32-bits you use a 32-bit
constant. Both 5-bit and 32-bit Constants are expanded to 64-bits when
consumed by a 64-bit FP calculation.
<
A use cases that might not be imagined immediately::
<
CVTSD R4,#3 // passing a small integer constant as (double)
<
You would never do something like this using the constant pool and LDs.
<
CVTFD R4,#1.0625
<
Saves a word of code space at the power of executing the convert.
<
Now I have to go back and account of FP=0.0 constants.

Thomas Koenig

unread,

Dec 3, 2022, 4:40:20 AM12/3/22

to

MitchAlsup <Mitch...@aol.com> schrieb:

> On the FP side, these 5-bit constants are "not all that useful"--although
> 0.0, 1.0, 2.0, -1.0, 10.0 occur often enough that converting this 5-bit
> integer into a FP constant can be done while the rest of register read
> and DECODE proceed. The conversion takes less than 20 gates.

Make sense.

> But what I am looking for is for the 5-bit constant to satisfy a few more
> operand necessities BEFORE calling forth the 32-bit of 64-bit constants
> (which, necessarily, eat more code space). So the rejoining question is::
> for a few more than those 20 gates, can I get something significantly
> more useful ?!?

I would probably go for an offset and multiplying by 1/2. If you
encode unsigned n in the five-bit field, then the value of the floating
point constant could be (n-5)/2.

This would give you constants from -2.5, -2,-1.5,...,13,
which would give you most constants in the GSL table, at least.
The only halfway important rational constant missing would be 0.25.
This scheme would give you a bit more than 56% of floating point
constants in the GSL table, not bad (especially considering that
GSL contains approximations for special functions which use a lot
of constants.

(n-offset)/4, to catch 0.25, would lose in the higher integer
numbers. Another possiblilty would be (n-4)/2, but 13.5 isn't
really needed often, or (n-8)/2, could also be done, but
-4 isn't used that often, probably less than 12 (but the
differences would be minute).

Quadibloc

unread,

Dec 3, 2022, 10:12:46 AM12/3/22

to

On Thursday, December 1, 2022 at 8:42:09 PM UTC-7, MitchAlsup wrote:

> Over on the flip side--I don't want a table per thread--because I am shooting
> at 10-cycles per context switch, and a 32-entry table would add 40%-80%
> to the total time, plus the added memory overhead (arguably irrelevant),
> and whatever system thread needed to service this seldom changing per
> thread resource.

And reducing the number of cycles per context switch is a worthy
goal, since context switches happen quite often.

> I had a 128 entry per process constant pool in the H.E.P, and found it more
> difficult to use than profitable to have. Learning from this, and from the lack
> of constants in M88K, SPARC, x86; lead me to my current position.

Your position is a sound one, so I wasn't asking you to justify it...

instead, my post was aimed in the opposite direction.

Could one find some amazing new architectural technique which could
allow applications to be burdened with enormous register files, and _yet_
spend only tiny amounts of time on context switching?

Say a chip which has 128 register files on a circular die, which a MEMS
device rotates so that one of those files is connected to the CPU at a
given time!

Of course, _that's_ a nonsensical idea in current technology, although no
doubt the equivalent could have been done back in the discrete transistor
era.

If *wire delays* weren't the big problem that they are at current high
densities, one thing that could be suggested would be something
similar to an SCR that behaves like a relay - a switchable gate that may
take time to switch, but offers no gate delays to the inputs it switches
between (almost like a switch on a railroad track).

Multiple big register files? Put them in a small fast memory - say 64K
in size, that's small enough to make it way faster than DRAM!

And (other useless and impractical) ideas like that. If it _could_ actually
be done, it would be very useful. Which is why the fact that it isn't being
done is already evidence that it can't, at least for now.

John Savard

John Dallman

unread,

Dec 3, 2022, 11:01:44 AM12/3/22

to

In article <3758c429-8a3c-4f00...@googlegroups.com>,

jsa...@ecn.ab.ca (Quadibloc) wrote:

> Say a chip which has 128 register files on a circular die, which a
> MEMS device rotates so that one of those files is connected to the
> CPU at a given time!

Switching between register files is how GPUs implement their hundreds of
threads. It was implemented with a master register used for indirection
when I learned about this, 15 or so years ago, so you probably need to
look into how it is done now.

John

Stephen Fuld

unread,

Dec 3, 2022, 11:29:28 AM12/3/22

to

On 12/3/2022 1:40 AM, Thomas Koenig wrote:
> MitchAlsup <Mitch...@aol.com> schrieb:
>
>> On the FP side, these 5-bit constants are "not all that useful"--although
>> 0.0, 1.0, 2.0, -1.0, 10.0 occur often enough that converting this 5-bit
>> integer into a FP constant can be done while the rest of register read
>> and DECODE proceed. The conversion takes less than 20 gates.
>
> Make sense.
>
>> But what I am looking for is for the 5-bit constant to satisfy a few more
>> operand necessities BEFORE calling forth the 32-bit of 64-bit constants
>> (which, necessarily, eat more code space). So the rejoining question is::
>> for a few more than those 20 gates, can I get something significantly
>> more useful ?!?
>
> I would probably go for an offset and multiplying by 1/2. If you
> encode unsigned n in the five-bit field, then the value of the floating
> point constant could be (n-5)/2.
>
> This would give you constants from -2.5, -2,-1.5,...,13,
> which would give you most constants in the GSL table, at least.
> The only halfway important rational constant missing would be 0.25.
> This scheme would give you a bit more than 56% of floating point
> constants in the GSL table, not bad (especially considering that
> GSL contains approximations for special functions which use a lot
> of constants.

Given that Mitch said the cost of the 32 entry ROM is insignificant, why
are you (and BGB - though I know is use of FPGA has different
constraints) hung up on an "algorithmic" transformation? By your own
admission, your method would miss .25, and it misses things like pi (#11
on your list) in return for getting values like -11, which doesn't
appear in the table at all? The ROM gets over 71% - a 15% improvement
over your algorithmic method.

BTW, does anyone have any idea why ~2.22 e-16 and two times that appear
so high? Are they related to some special number?

Stephen Fuld

unread,

Dec 3, 2022, 11:42:05 AM12/3/22

to

On 11/30/2022 12:38 PM, John Dallman wrote:
> In article <a436409e-2939-4647...@googlegroups.com>,
> Mitch...@aol.com (MitchAlsup) wrote:
>
>> my exposure to the breadth of FP constants in code using FP
>> arithmetic indicates that finding and isolating the "in crowd"
>> from the "out crowd" is fraught with peril.
>
> I think you're right. Looking at BGB's data, I'm sure that the stuff I
> work on would have a fairly different table, with pi further up the table
> and -3.14E18 (which is used as a recognisable null value).

Is that value as a recognizable null unique to your code or it is more
commonly used?

BTW, while getting the global optimal set of values is essentially
impossible, that isn't necessary to make the idea worthwhile. Given the
constraints, a five bit field, essentially free reasonable amounts of
ROM, etc., using the ROM is worthwhile if you can do better on the
global corpus than 0-31 (or some algorithmic transformation of that).
That is, if e.g. pi occurs more often than say 17, (or whatever the
least used value after the algorithmic transformation) then it is an
improvement. And only if there is a big increase in the use of 17 would
it be a loser. I guess I think that is unlikely.

> 0.0, 0.5, 1.0,
> and -1.0 are likely universal, but after that, it gets very variable.

Yes, but don't let the perfect be the enemy of the good. The test is
not did you get the global optimum, but are the values you pick better
than 0-31 or transformation of that.

John Dallman

unread,

Dec 3, 2022, 12:32:09 PM12/3/22

to

In article <tmfu8q$3cpmj$1...@dont-email.me>, sf...@alumni.cmu.edu.invalid

(Stephen Fuld) wrote:

> > and -3.14E18 (which is used as a recognisable null value).
> Is that value as a recognizable null unique to your code or it is
> more commonly used?

As far as I know, unique to us. There are particular constraints in our
field of mathematical modelling that mean it won't ever be a "sensible"
value. Other modellers in the same field could use it, but I've never
heard that they do so.

> BTW, while getting the global optimal set of values is essentially
> impossible, that isn't necessary to make the idea worthwhile.
> Given the constraints, a five bit field, essentially free
> reasonable amounts of ROM, etc., using the ROM is worthwhile if you
> can do better on the global corpus than 0-31 (or some algorithmic
> transformation of that).

Indeed, if the ROM is used enough to be a worthwhile use of a small
amount of die space, and the interconnections to make it accessible
quickly.

John

John Dallman

unread,

Dec 3, 2022, 12:32:09 PM12/3/22

to

In article <tmfth4$3cnep$1...@dont-email.me>, sf...@alumni.cmu.edu.invalid

(Stephen Fuld) wrote:

> BTW, does anyone have any idea why ~2.22 e-16 and two times that
> appear so high? Are they related to some special number?

I suspect they might be being used in a test for real numbers being
"close enough to be the same", although that doesn't explain why they're
negative.

I'm wondering why 1.844674407370955e+19 is so high in the list: that
might be a test for "number too big" in some way.

John

MitchAlsup

unread,

Dec 3, 2022, 12:34:54 PM12/3/22

to

2.2E-16 is 'epsilon' for numbers between 0.5 and 1.0.

MitchAlsup

unread,

Dec 3, 2022, 12:36:50 PM12/3/22

to

This is where one switches between Code&Waite argument reduction into
Payne&Hanek argument reduction--although there is no particular reason
to have that many bits of precision.
>
> John

Scott Lurndal

unread,

Dec 3, 2022, 12:44:21 PM12/3/22

to

And SPARC tried 'register windows' to attempt to ameliorate the cost
of saving and restoring registers on function entrance/exit. Turned
out to be more trouble than it was worth.

Thomas Koenig

unread,

Dec 3, 2022, 12:59:53 PM12/3/22

to

Stephen Fuld <sf...@alumni.cmu.edu.invalid> schrieb:

Because Mitch wanted "for a few more than those 20 gates" above :-)

> By your own
> admission, your method would miss .25, and it misses things like pi (#11
> on your list) in return for getting values like -11,

That's not in there, the range is from -2.5 to 13, or somesuch.

> which doesn't
> appear in the table at all? The ROM gets over 71% - a 15% improvement
> over your algorithmic method.

> BTW, does anyone have any idea why ~2.22 e-16 and two times that appear
> so high? Are they related to some special number?

$ cat epsilon.f90
program main
integer(kind=8) :: ieps
real(kind=8) :: eps
eps = epsilon(eps)
ieps = transfer(epsilon(1.d0), ieps)
print *,eps
print '(Z16)', ieps
print *,fraction(eps), '*', radix(eps), '**', exponent(eps)
end program main
$ gfortran epsilon.f90 && ./a.out
2.2204460492503131E-016
3CB0000000000000
0.50000000000000000 * 2 ** -51

I hope this answers that question :-)

Anton Ertl

unread,

Dec 3, 2022, 1:26:47 PM12/3/22

to

Quadibloc <jsa...@ecn.ab.ca> writes:
>Say a chip which has 128 register files on a circular die, which a MEMS
>device rotates so that one of those files is connected to the CPU at a
>given time!

Sounds like the Tera MTA (128 register sets), or the Parallax
Propeller (8 register sets).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

Michael S

unread,

Dec 3, 2022, 1:41:49 PM12/3/22

to

2**64.
Actually, I would expect 2**63 to be more common due to its use in
y = (unsigned long long)x where x is 'double'.

Stephen Fuld

unread,

Dec 3, 2022, 2:15:50 PM12/3/22

to

OK. I took his statement that the cost of the ROM was not significant
as the constraint. Yours is probably tighter.

>
>> By your own
>> admission, your method would miss .25, and it misses things like pi (#11
>> on your list) in return for getting values like -11,
>
> That's not in there, the range is from -2.5 to 13, or somesuch.

OK, pick whatever is the least used value in your preferred
transformation. :-)

>
>> which doesn't
>> appear in the table at all? The ROM gets over 71% - a 15% improvement
>> over your algorithmic method.
>
>> BTW, does anyone have any idea why ~2.22 e-16 and two times that appear
>> so high? Are they related to some special number?
>
> $ cat epsilon.f90
> program main
> integer(kind=8) :: ieps
> real(kind=8) :: eps
> eps = epsilon(eps)
> ieps = transfer(epsilon(1.d0), ieps)
> print *,eps
> print '(Z16)', ieps
> print *,fraction(eps), '*', radix(eps), '**', exponent(eps)
> end program main
> $ gfortran epsilon.f90 && ./a.out
> 2.2204460492503131E-016
> 3CB0000000000000
> 0.50000000000000000 * 2 ** -51
>
> I hope this answers that question :-)

Thanks. Another high use value not easily obtainable by reasonable
algorithmic transformation of 0-31. :-)

Thomas Koenig

unread,

Dec 3, 2022, 3:01:16 PM12/3/22

to

MitchAlsup <Mitch...@aol.com> schrieb:

> A use cases that might not be imagined immediately::
><
> CVTSD R4,#3 // passing a small integer constant as (double)

I saw that used for

double foo(double x)
{
return (x*3.14159265358979323846 - 1) + 1;
}

which was compiled into

cvtsd r2,#-1
fmac r1,r1,#0x400921FB54442D18,r2
fadd r1,r1,#1

which confused me a bit at first.

I assume that cvtsd with a five-bit constant could already be
decoded into the equivalent of

mov r2, -1.0

without actually loading the value and converting, correct?

MitchAlsup

unread,

Dec 3, 2022, 6:03:01 PM12/3/22

to

On Saturday, December 3, 2022 at 2:01:16 PM UTC-6, Thomas Koenig wrote:
> MitchAlsup <Mitch...@aol.com> schrieb:
> > A use cases that might not be imagined immediately::
> ><
> > CVTSD R4,#3 // passing a small integer constant as (double)
> I saw that used for
>
> double foo(double x)
> {
> return (x*3.14159265358979323846 - 1) + 1;
> }
>
> which was compiled into
>
> cvtsd r2,#-1
> fmac r1,r1,#0x400921FB54442D18,r2
> fadd r1,r1,#1
<

two 1 word instructions and one 3 word instruction.

>
> which confused me a bit at first.
>
> I assume that cvtsd with a five-bit constant could already be
> decoded into the equivalent of
>

> mov r2, #-1.0 // # added for clarity.

>
> without actually loading the value and converting, correct?
<

Move just transfers the value, whereas CVT mangles the value
while moving it from place to place.

Thomas Koenig

unread,

Dec 4, 2022, 4:13:50 AM12/4/22

to

MitchAlsup <Mitch...@aol.com> schrieb:

I think I understand the semantics.

With a constant, would this conversion be done during decode
(or parse), or during execution? I would assume it is an
obvious candidate for the former, to save latency.

MitchAlsup

unread,

Dec 4, 2022, 2:11:33 PM12/4/22

to

CVTxx performs the conversion during Execute; mostly for the
placement of FP constants into ABI registers; and occasionally
to use a specific FP rounding mode.
<
#im5->FP32 or #im5->FP64 gets done during Decode.

Stephen Fuld

unread,

Dec 4, 2022, 2:25:58 PM12/4/22

to

Right. For Thomas's example above, I don't think you could do the
conversion at parse/decode time, since the instruction is a MOV, and the
CPU doesn't know whether the value needs to be converted or used as a
fixed point value.

I suppose that, in the case where the source operand for the CVTSD
instruction was an immediate, you could do the conversion at decode
time, but I am not sure it is worth the effort.

BGB

unread,

Dec 4, 2022, 2:27:26 PM12/4/22

to

Just did a mock-up for (0..31)/2:
E3.F2 : ~ 39.2%
(0..31)/2: ~ 37.8%

Have noted that it seems that trying to to include negative values
actually reduces hit-rate for the Imm5fp cases, if one has both FADD and
FSUB (FMUL constants seem to be nearly universally positive in this case).

Granted, negative FMUL constants might happen more if, say:
y=-(x*3);
were transformed into:
y=x*(-3);
But, BGBCC doesn't really do this.

> Given that Mitch said the cost of the 32 entry ROM is insignificant, why
> are you (and BGB - though I know is use of FPGA has different
> constraints) hung up on an "algorithmic" transformation? By your own
> admission, your method would miss .25, and it misses things like pi (#11
> on your list) in return for getting values like -11, which doesn't
> appear in the table at all? The ROM gets over 71% - a 15% improvement
> over your algorithmic method.
>

If one uses a ROM of cherry picked values, one is likely to get a table
that does pretty good for one program but poorly for another (whereas a
simple transform is more likely to map similarly well from one program
to another).

Likely getting a "good" table here would require having a reasonable
size corpus of programs which make extensive use of floating point math,
from which to gather stats. I don't really have this.

The major constraint in my case is that the intermediate value is able
to be exactly represented in Binary16. This mostly excludes constants
like PI and E, which can't be represented as such.

Could do a hack, like say defining that for the ImmFp16 case, certain
values in the NaN range or similar expand to a range of "magic
constants", but this "isn't likely worth it".

This possibility was previously rejected for routing ImmFp16 through the
general-case logic for Fp16->Fp64, on the constraint that
Fp16->Fp64->Fp16 and Fp32->Fp64->Fp32 conversions should be "round trip
consistent" (the narrowing conversion gets back to a bit-identical value
as was given to the widening conversion).

Similarly, had defined the rules such that Fp16->Fp64->Fp32 and
Fp16->Fp32 should also give bit-identical results.

Ended up with some "cheap but slightly funky" rules here.

Still also have the minor annoyance that my mechanism for the
Packed-Integer <-> Packed-FP SIMD conversion is also kinda wonky (but,
"better" converters would have cost more).

"Have dedicated special-case logic to fudge inputs to FADD so that it
does most of the conversion work?", "Nah dawg, twiddle the bits so that
the integer maps to a floating-point value between 2.0 and 4.0, add -3.0
and scale the result as needed."

Mostly works for the typical use-cases though...

> BTW, does anyone have any idea why ~2.22 e-16 and two times that appear
> so high? Are they related to some special number?
>
>

No idea on this one.

BGB

unread,

Dec 4, 2022, 2:36:50 PM12/4/22

to

My case, it expands to Fp16 during decode, with some special-case
handling to deal with Fp16 immediate values handled during the ID2 /
Register-Fetch stage.

Stephen Fuld

unread,

Dec 4, 2022, 2:38:06 PM12/4/22

to

What percentage do you get if you just used the values in Thomas' table?

If it isn't high, can you run something like Thomas did on whatever
programs you are using to get the list of FP constants used, sorted by
use? At least you will add to the corpus of programs to generate the
"hit list".

MitchAlsup

unread,

Dec 4, 2022, 3:04:38 PM12/4/22

to

My 66000 ISA allows for encoding as::
<
MUL Ry,-Rx,#3

Paul A. Clayton

unread,

Dec 7, 2022, 1:13:34 PM12/7/22

to

MitchAlsup wrote:
> On Saturday, November 26, 2022 at 7:56:30 AM UTC-6, Paul A. Clayton wrote:

[snip]

>> I am skeptical that a small table of constants would be
>> worthwhile.
>
> I am entirely convinced that the table of constants which is not
> fixed forever (ROM) is an entirely losing position.

I *feel* that there may be benefits to facilitating hardware
recognizing that values are pseudo-constants (possibly extended to
values almost exclusively updated with constants). It *seems* that
the sourcing of such values could be more like immediates. (This
could be similar to patching an executable with different
immediate values — except avoiding permission to write to code
more generally.)

While such seems to provide a potential advantage, the advantage
might not be worth the intrinsic hassle or the hassle when
considering legacy software/system design. However, I do not know
of any study of such. (Negative results seem to rarely be
published, minor improvements are less publishable, and the
concept seems a little odd [likely requiring substantial compiler
tinkering to get a baseline estimate of utility and even more
software work to approach best case — and this would have to be
done in collaboration with hardware design]. Even if the concept
may be intellectually interesting, it seems unlikely to be
academically interesting much less commercially interesting.)

This would seem to be a form of intermediate persistence, possibly
broader and more durable than trace cache optimizations but
narrower and less durable than traditional code generation.

With respect to context switch overhead, if the latency
criticality is not for first use (registers are typically
presented as always consistent low latency — though a POWER
implementation (8?) did use register caching with a miss latency
penalty) such might still be useful. Since the design space seems
very large, a researcher would presumably need to concentrate on
specific use cases.

Quadibloc

unread,

Dec 8, 2022, 11:34:05 AM12/8/22

to

On Thursday, December 1, 2022 at 6:12:45 PM UTC-7, Quadibloc wrote:

> On Saturday, November 26, 2022 at 12:24:57 PM UTC-7, MitchAlsup wrote:
> > On Saturday, November 26, 2022 at 7:56:30 AM UTC-6, Paul A. Clayton wrote:

> > > Stephen Fuld wrote:
>
> > > > Let me be clear. I think that a loadable constant table is a
> > > > loosing proposition. The benefits are far outweighed by the costs.

>
> > > I am skeptical that a small table of constants would be
> > > worthwhile.
>
> > I am entirely convinced that the table of constants which is not
> > fixed forever (ROM) is an entirely losing position.

> I certainly don't disagree *either* with this massive consensus.
>
> However, what I'd like to ask is _why_.
>
> Or, rather, I know _why_. If the table of constants is something that
> can be rapidly accessed (otherwise, if it's just in ordinary RAM, it
> serves no purpose) then it is a *limited resource*. And while that's
> no problem on a computer that only has *one task running* at a
> given time, although a desktop computer has a single user, between
> the background processes the operating system has running, and
> any other open windows in the background on the screen, today's
> desktop computers, like the timesharing mainframes of old, have
> plenty of tasks running.
>
> So either you've got enough fast storage for a pile of constant
> tables, which is occupying space and using transistors that could
> be put to far better use, or you have more data to swap in and out
> when going from one process to another.
>
> So *here* is where I'm getting...
>
> If one _could_ solve the problem of having a small loadable table
> of constants, unique to each process... then, one would also have
> solved some _much more important_ problems. Because this
> solution could also be applied to make it practical for computers to
> have larger register files than they do at present. Which could no
> doubt be exploited to make them run faster.

And here's the sort of "solution" I've come up with for this problem.

No doubt it can be simply explained why no one has ever tried it.

A computer, in addition to having the normal complement of small,
conventional register files -

8, 16, or 32 integer registers, 8, 16, or 32 floating-point registers,
8, 16, or 32 SIMD registers that are 128, 256, or 512 bits wide...

and it has N copies of those register files if it has N-way SMT,

_also_ has eight "fast memories" that are 64K bytes in size.

Normally, they're allocated as follows:
One to the kernel.
One to the unprivileged portion of the operating system.
One each to up to six programs that can run efficiently at the same time.

The fast memory is where large register files are put. The large register
files include such things as the short table of constants, the Cray-style
vector registers, and so on. Those tables are pointed to by "workspace
pointers", similar to the one on the Texas Instruments 9900.

So if an application, or a level of the operating system, switches between
processes that use the large register files, it can swiftly switch betwen
copies of those files.

Presumably, there's no issue in making 512K bytes of storage available to
the CPU with a fast enough speed so that these large register files are still
fast enough to be considered registers, being much faster than external
DRAM.

Note also that when a computer runs at GHz speeds, and the real-time
clock that switches between compute-bound processes to give every
process a chance _still_ only ticks 60 times a second, the way it did back
when computers ran at MHz speeds, the overhead imposed by saving and
restoring registers at each of the clock ticks... would stay the same, if there
were hundreds of times as many registers to worry about. (i.e. compare 16 MHz
to 2 GHz, there could be 125x the amount of storage required for registers)

So there seem to be two avenues for addressing this major architectural issue.

1) Make use of the fact that we have so many transistors, computers have
room for huge cache memories, to also provide duplicate register files in a
convenient form, and

2) Make use of the fact that while computers have become faster, people
haven't.

John Savard

MitchAlsup

unread,

Dec 8, 2022, 4:00:03 PM12/8/22

to

On Thursday, December 8, 2022 at 10:34:05 AM UTC-6, Quadibloc wrote:
> On Thursday, December 1, 2022 at 6:12:45 PM UTC-7, Quadibloc wrote:
> > On Saturday, November 26, 2022 at 12:24:57 PM UTC-7, MitchAlsup wrote:
>
> > If one _could_ solve the problem of having a small loadable table
> > of constants, unique to each process... then, one would also have
> > solved some _much more important_ problems. Because this
> > solution could also be applied to make it practical for computers to
> > have larger register files than they do at present. Which could no
> > doubt be exploited to make them run faster.
>
> And here's the sort of "solution" I've come up with for this problem.
>
> No doubt it can be simply explained why no one has ever tried it.
>
> A computer, in addition to having the normal complement of small,
> conventional register files -
>
> 8, 16, or 32 integer registers, 8, 16, or 32 floating-point registers,
> 8, 16, or 32 SIMD registers that are 128, 256, or 512 bits wide...
>
> and it has N copies of those register files if it has N-way SMT,
<

Given K registers in a file, 2K registers in that file are 2 gates slower
(1 gate in the decoder, on gate in the readout logic.) In 1985 we
could get 2R1W 32-entry register file in ½ clock cycle. As size of
register file grows, so does its delay. Should access time grow
beyond ½ cycle, the pipeline has to be altered to accommodate
generally by whole cycles 1 at the front and 1 at the rear.

>
> _also_ has eight "fast memories" that are 64K bytes in size.
>
> Normally, they're allocated as follows:
> One to the kernel.
> One to the unprivileged portion of the operating system.
> One each to up to six programs that can run efficiently at the same time.
>
> The fast memory is where large register files are put. The large register
> files include such things as the short table of constants, the Cray-style
> vector registers, and so on. Those tables are pointed to by "workspace
> pointers", similar to the one on the Texas Instruments 9900.
>
> So if an application, or a level of the operating system, switches between
> processes that use the large register files, it can swiftly switch betwen
> copies of those files.
>
> Presumably, there's no issue in making 512K bytes of storage available to
> the CPU with a fast enough speed so that these large register files are still
> fast enough to be considered registers, being much faster than external
> DRAM.
>
> Note also that when a computer runs at GHz speeds, and the real-time
> clock that switches between compute-bound processes to give every
> process a chance _still_ only ticks 60 times a second, the way it did back
> when computers ran at MHz speeds, the overhead imposed by saving and
> restoring registers at each of the clock ticks... would stay the same, if there
> were hundreds of times as many registers to worry about. (i.e. compare 16 MHz
> to 2 GHz, there could be 125x the amount of storage required for registers)
<

This is true for Linux-like applications:: it may not be true for more real time stuff.
<
The paucity of cycles at context switch argument holds even if the ticks per
second is 1000 (instead of 60); but starts failing when 15K events per second
start happening.

>
> So there seem to be two avenues for addressing this major architectural issue.
>
> 1) Make use of the fact that we have so many transistors, computers have
> room for huge cache memories, to also provide duplicate register files in a
> convenient form, and
>
> 2) Make use of the fact that while computers have become faster, people
> haven't.
<

3) Make a chunk of HW that sequences entire register files into and out of memory
using as much cache technology as possible. One might want to put a kind of
transaction on the interconnect bus that carries an entire register file in a single
interconnect-message (to gain ATOMICITY). And the ultimate form of this has
this message sourced from something other than a core !!! {It is a push not a pull}
>
> John Savard