Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Why My 66000 is and is not RISC

446 views
Skip to first unread message

MitchAlsup

unread,
Jun 22, 2022, 9:03:25 PM6/22/22
to
I could not find the question asking me to make a list of why My 66000
instruction set architecture is like and unlike the tenets of the original
RISC. So I spent some time looking up what the internet is currently saying
about RISCs. There is a short list, but I will start with a few statements
from Hennessey and Paterson::

Hennessey:: The goal of any instruction format should be: 1. simple decode,
2. simple decode, and 3. simple decode. Any attempts at improved code
density at the expense of CPU performance should be ridiculed at every
opportunity.

Patterson:: more is not better -- microcode is bad
Subroutines need low overhead

RISC axioms:
a) the ISA is primarily designed to make the pipeline simple.
b) the ISA is primarily designed as a target for compilers.
c) instructions only exist if they add performance.
d) frequently accessed data is kept in registers.

RISC tenets:
a) 1 word == 1 instruction
b) 1 instructions flows down the pipeline in 1 cycle
c) 1 instruction can cause 0 or 1 exception
d) instruction encoding uses few patterns
e) there is a large uniformly addressable register space

So where does My 66000 ISA stand with respect to these axioms and
tenets::

RISC axioms: My 66000 ISA embodies all of the RISC axioms
RISC tenets: My 66000 ISA rejects ½ of RISC tenets

With minor exceptions to both::

My 66000 contains 32×64-bit general purpose registers. Some might
think this is too few and a FP register file should be added. Looking
at code such as BLASS, Livermore Loops, Linpack indicates otherwise
-- as long as one assumes some hints of OoO pipelining. Looking at
various C libraries this seems perfectly sufficient.

My 66000 ISA contains 6 decoding patterns; 1 for each of
{instructions with 16-bit immediates, instructions with 12-bit
immediates, scaled memory reference, 2-operand reg-reg,
1-operand reg-reg, 3-operand reg-reg }

The 12-bit immediate format is used for shift instructions and
for Predicate instructions and positioned such that predicate
instructions are only 1-bit different than their corresponding
branch instruction. This saves 6×16-bit immediate encodings.

Scaled memory reference, 1-operand, 2-operand, 3-operand
all have access to 32-bit or 64-bit immediates/displacements
in substitution for a register. This eliminates any need to use
instructions or waste registers pasting constants together.

1-operand, 2-operand, 3-operand instructions all have sign control
over their operands. There is no SUB instruction My 66000 uses
ADD Rd,Rs1,-Rs2 instead. The sign control eliminates most NEG
instructions from execution. The 2-operand group allows the
5-bit register specifier to be used as a 6-bit sign extended
immediate, making ADD Rd,#1,-Rs2 easily encoded.

There are Compare instructions that return a bit-vector of everything
the compare circuitry can determine, including range checks like:
0 < Rs1 <= Rs2, classifications {-infinity, -normal, -denormal, -zero,
+zero, +denormal, +normal, +infinity, SNaN, QNaN} I remain tempted
to add "any byte equal", "any halfword equal", "any word equal".

There are 2 kinds of conditional flow: branching and predication and
each has 2 principle kinds of instructions:: condition is determined
from a single bit in a register, or condition is determined by comparing
a register with 0. In addition there are conditionless branches, jumps,
and a special addition supporting PIC for method calls and switches.
Compare-to-zero and branch can access certain HW know information
that is not capable of being stored in a ISA register--this includes things
like a query to the Memory Unit asking if it has seen any interference
between the start of an ATOMIC sequence and "now". The exception
and interrupts and std. return are also encoded here.

Memory reference instructions enable building of ATOMIC primitives
that can touch as many as 8 cache lines of data in a single ATOMIC
event. This is equivalent to the MIPS LL and SC except it operates
over much larger chunks of data. This is sufficient to move an entry
of a shared data structure from one place to another place in a single
event. This minimizes the number of ATOMIC events that are needed,
and comes with guarantees of forward progress.

The ST instruction can store a constant in either 5-bit sign extended
form, or in 32-bit or 64-bit forms. No need to put a constant into a
register in order to ST it to memory. This is along with the ability
to use 32-bit or 64-bit displacement constants.

There are 5 "special" memory reference instructions:: ENTER is used
to setup a new stack, and save registers, EXIT is used to tear down the
stack and restore registers, LDM loads multiple registers, STM stores
multiple registers, and MM moves data from memory to memory.
MM has the property that both cached and uncached memory smaller
than a page is moved as single ATOMIC transfer. {PCIe can do this,
so should CPUs attached to PCIe peripherals.} There is expected to
be a sequencer in the memory unit that performs these out of the
data-path.

The Floating Point group includes Transcendental instructions.
Ln, LnP1, exp, expM1, SIN, COS, TAN, ATAN and some variants
that are only 1 constant different in the calculations. Ln2 takes
only 14 cycles, sin takes 19 cycles. These are included because
they actually do improve performance.

Conversions between FP and FP or FP and INT are provided by
1 instruction (CVT) which has 49 variants to deal with 5 specified
rounding modes and 1 implied rounding mode (current) any time
a rounding could transpire. This falls into the category of "once
you have the HW to do <say> ANINT (of FORTRAN) you have the
95% of the logic to do them all".

The exception model is based on message passing (as is SVCs),
rather than wandering through the high-level OS exception
dispatcher. This model supports threads (processes or tasks)
that are paranoid of the OS looking at their data (such as banking
applications running on a home PC), and can indeed restrict the
OS from looking at the address space.

I/O devices are virtualized, and operate on the virtual address
space of originating requestor. So while the I/O device can DMA
directly into paranoid application address space, and while OS
can verify the given space and bounds are acceptable, OS cannot
look into that address space. This gets rid of the need of a secured
mode of operation.

Deferred procedure calls are handled as messages (argument
setup + 1 instruction) with continuation. The messaging sub-system
operates over both HyperVisor and GuestOS domains simultaneously.
Anyone with a "method" can call that method and get a response
even if that method is running under a different GuestOS.

There is a 66-bit remapped address space--any thread can access
64-bits of the space. Sub-spaces are {DRAM, configuration, MMIO,
and ROM} The address space is configured to efficiently transport
requests over a significant network (ala HyperTransport and Intel
equivalent). DRAM is cache coherent, configuration is strongly ordered,
MMIO is sequentially consistent, ROM is "lax").

The system repeater transports requests from chip to chip, and
amalgamates coherence requests so that the originator counts
responses from cores on his chip, and the number of chips in
the system (rather than counting from every core).

Memory management cannot be turned off--My 66000 imple-
mentations come out of reset with the MMUs turned on. HostBridge
is configured with a MMU/TLB that uses exactly the same tables as
CPUs and can share tables as applicable. Levels in the virtual
address space translations can be skipped! So an application as
simple as 'cat' can be managed with a single page of translation
overhead.

Memory management is inherently HyperVisor/GuestOS. Privilege
is determined by the assortment of root pointers in use on a per
invocation basis.

GuestOS can activate a thread (taking it from a waiting state to
running in a core) in a single instruction and remotely. So can
HyperVisor.

Finally, there is no notion of one thread morphing into a different
thread over a series of instructions manipulating control registers
one by one. For example: an ISR cleanup handler takes a thread
off a wait state queue, places it on a run state queue, and signals
GuestOS to see what threads should be running "right now". This
is all 1 instruction and 1 cycle as far as the core performing the
instruction sees.

My 66000 is not just another ISA, it is a rethink of most of the components
that make up a system. A context switch from one thread to another
within a single GuestOS is 10 cycles. A context switch from one thread
to a thread under a different GuestOS remains 10 cycles. The typical
current numbers are 1,000 cycles within GuestOS, and 10,000 cycles
across GuestOSs.

OH, and BTW, The FP transcendentals are patented.

Terje Mathisen

unread,
Jun 23, 2022, 2:00:14 AM6/23/22
to
MitchAlsup wrote:
> There are Compare instructions that return a bit-vector of everything
> the compare circuitry can determine, including range checks like:
> 0 < Rs1 <= Rs2, classifications {-infinity, -normal, -denormal, -zero,
> +zero, +denormal, +normal, +infinity, SNaN, QNaN} I remain tempted
> to add "any byte equal", "any halfword equal", "any word equal".

If you can add the in-reg SIMD compare ops without slowing stuff down,
please do so!

Even having VMM, if you can identify the final \0 byte anywhere in a
64-bit reg, then that's a win for lots of code.

That said, just having your current VMM setup would obviate the need for
SIMD style ops in almost all programs.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Marcus

unread,
Jun 23, 2022, 3:01:43 AM6/23/22
to
On 2022-06-23, Terje Mathisen wrote:
> MitchAlsup wrote:
>> There are Compare instructions that return a bit-vector of everything
>> the compare circuitry can determine, including range checks like:
>> 0 < Rs1 <= Rs2, classifications {-infinity, -normal, -denormal, -zero,
>> +zero, +denormal, +normal, +infinity, SNaN, QNaN} I remain tempted
>> to add "any byte equal", "any halfword equal", "any word equal".
>
> If you can add the in-reg SIMD compare ops without slowing stuff down,
> please do so!

In-reg SIMD can be useful. In MRISC32 you can do:

seq.b r2, r1, z ; Byte-wise compare r1 to zero, "Set if EQual"
bz r2, foo1 ; Branch if no byte equal (mask zero)
bnz r2, foo2 ; Branch if any byte equal (mask not zero)
bs r2, foo3 ; Branch if all bytes equal (mask set)
bns r2, foo4 ; Branch if any byte not equal (mask not set)

...and similar with seq.h for half-words. There are also inequality and
gt/lt comparisons, for instance.

Note: "Set" means all bits of the byte/half-word/word are 1. The
opposite (false) outcome of the set instructions is the all bits of the
byte/half-word/word are 0.

The cute part is that I did not have to add special "SIMD" branch
instructions, since the same instructions make sense for both packed and
unpacked comparison results.

/Marcus

MitchAlsup

unread,
Jun 23, 2022, 3:30:44 PM6/23/22
to
In todays installment I touch on things about My 66000 not covered above.

My 66000 ISA requires an instruction buffer and a 2 stage instruction
processing pipeline I call PARSE and DECODE. Hennessey would be booing
at this point. However, using this, I get branch overhead down to 0.03 cycles
per taken branch without having any delay slot. {This also makes a unified
L1 cache feasible. But since Fetch and MemRef are so far apart on the die
My implementations have chosen not to utilize this capability.}

PARSE finds the instruction boundaries (main job) and scans ahead for branches,
determines which function units, and looks for CoIssue opportunities. The scan
ahead branches are processes in parallel by DECODE to fetch branch targets
even before the branch instruction is executed. So if a taken prediction is made
the instructions on the taken path are already ready to enter execution. PARSE
identifies immediates and displacements and cancels register port requests,
providing opportunities for ST to read the register file..........

DECODE processes the instructions from PARSE , accesses register file,
computes forwarding, and starts instruction into the execution pipeline.
DECODE routes immediates and displacements to the required instruction.
ST instruction pass through DECODE twice, the 1st time is for AGEN, the
2nd time is for ST.data when a register file port is available.

---------------------------instruction stuff-----------------------------------------------------------

The shift instructions have 2×6-bit fields dealing with the shift amount and
the width of data being shifted. These are used to access odd-sized data
(ala EXTRACT) and to SMASH data calculated at "machine" size back down
into containers of "language" size so the container cannot contain a value
outside of the range of its container. When the width field is 0 it is considered
to be 64-bits. When encoded as an immediate, the 2 fields are back-to-back,
when found in a register there is 26-bits separating the 2 fields, in data<38:32>
both 1000000 and 0000000 are considered to be 64-bits, while 1xxxxxxx
with any of the x's non-zero is considered an Operand exception.

The Multiplex Instruction MPX. MPX basically allows for selecting bits from
a pair of registers based on another register:: ( a & b ) | ( ~a & c ), however
it has other flavors to provide ( !!a & b ) | ( !a & c ) which is CMOV and by
using the immediate encodings in My 66000 provides MOV Rd,#IMM32 and
MOV Rd,#IMM64 along with MOV Rd, Rs1 and MOV Rd,Rs2. They fall out
for free saving MOV opcodes elsewhere.

Vectorization: My 66000 ISA contains loop vectorization. This allows
for performing vectorized loops are several iterations per cycle even
1-wide machines can perform at 32+ instructions per cycle. My main
(as yet unproven) hope is that this takes the pressure off of the design
width. The basic argument is as follows:
a) 1-wide machines operate at 0.7 IPC
b) 2-wide SuperScalar machines operate at 1.0 IPC
c) GBOoO machines operate at 2.0 IPC
d) programs spend more than ½ their time in loops.
So, if one can get a 2× performance advantage of the 1-wide machine
this puts it in spitting distance of the GBOoO machine, which in turn
means the Medium OoO machine can be competitive with the GBOoO
machine are significantly lower {cost, design time, area, power}

AND while investigating loop vectorization, I discovered that a RISC
pipeline with a 3R-1W register file can perform 1.3 IPC. Branch
instructions (20%) do not use the result register, ST instructions
(10%) can borrow the write port AFTER cache tag and translation
validations, AND in the general code I have seen there is significant
opportunity to perform write-elision in the data path, freeing up even
more ports. This, again takes pressure of the width of the design.
So, with vectorization, a 3 (or 4)-wide machine is competitive with
a 6-wide machine,.....

None of this prevents or makes wide GBOoO more difficult.

----------------------instruction modifiers------------------------------------------

CARRY is the first of the Instruction-Modifiers. An instruction-modifier
supplies "bits" for several future instructions so that one does not need
the cartesian product of a given subset encoded in the ISA. Thus, there
are shift instructions and when used with CARRY these perform shifts
as wide as you like 128, 256, 512,.....no need to clog up the encoding
space for lightly used but necessary functionality. Oven in the FP arena
CARRY provides access to exact FP arithmetics.

CARRY provides access to multiprecision arithmetic both integer and FP.
CARRY provides a register which can be used as either/both Input and Output
to a set of instructions. This provides a link from one instruction to another
where data is transmitted but not encoded in the instruction itself.

Since we are in the realm of power limited, My 66000 ISA has an ABS
instruction. Over in the integer side, this instruction can be performed
by subjugating the sign control built into the data path and be "executed"
without taking any pipeline delay (executes in zero cycles). Over on the
FP side this never adds any latency (executes in zero cycles). ABS always
takes less power than performing the instruction in any other way.

DBLE is an instruction modifier that supplies register encodings and
adds 64-bits to the calculation width of the modified instruction. Applied
to a FP instruction: DBLE Rd1,Rs11,Rs22,Rs33; FMAC Rd2,Rs12,Rs22,Rs33
we execute: FMUL {Rd1,Rd2},{Rs11,Rs22},{Rs21,Rs22},{Rs31,Rs32}
and presto: we get FP128 by adding exactly 1 instruction, the compiler
can pick any 8 registers it desires alleviating register allocation concerns.
DBLE is a "get by" kind of addition, frowned upon by Hennessey.

I can envision a SIMD instruction modifier that defines the SIMD parameters
of several subsequent instructions and allows 64-bit SIMD to transpire.
I am still thinking about these. What I cannot envision is a wide SIMD
register file--this is what VVM already provides.

These instruction-modifiers, it seems to me, are vastly more efficient
than throwing hundreds to thousands of unique instructions into ISA.
Especially if those unique instructions <on average> are not used
"that much".

-----------------------------Safe Stack--------------------------------------------------------

Safe Stack. My 66000 architecture contains the notion of a Safe Stack.
Only 3 instructions have access to Safe Stack: {ENTER, EXIT, and RET}
When Safe Stack is in use, the return address goes directly to the Safe
Stack, and return address comes directly off safe stack. Preserved
registers are placed on Safe Stack {ENTER} and their register values
(conceptually) set to 0. Safe Stack is in normal thread memory but
the PTEs are marked RWE = 000 so any access causes page faults.
EXIT reloads the preserved registers from Safe Stack and transfers
control directly back to caller. When Safe Stack is not in use, R0
is used to hold the return address. Proper compiled code runs the
same when safe stack is on or off, so one can share dynamic libraries
between modes.

Safe Stack monitors the value in SP and KILLs lines that no longer
need to reach out into the cache hierarchy, Safe Stack can efficiently
use Allocate memory semantics. Much/most of the time, nothing
in safe stack leaves the cache hierarchy.

Buffer overflows on the "stack" do not corrupt the call/return flow of
control. ROP cannot happen as application has no access to Return
Address. Application cannot see the values in the preserved registers
augmenting safety and certainly cannot modify them.

-------------------------------ABI----------------------------------------------------------------------

Subroutine Calling Convention {A.K.A. ABI}:
Registers R1..R8 contain the first 8 arguments to the subroutine.
SP points are argument[9]
R9..R15 are considered as temporary registers
R16..R29 are preserved registers
R30=FP is a preserved registers but used as a Frame Pointer when
..............language semantics need.
R31=SP is a preserved register and used as a Stack Pointer. SP must
..............remain doubleword aligned at all times.

ABI is very RISC

So, let's say we want to call a subroutine that wants to allocate 1024
bytes on the stack for its own local data, is long running and needs
to preserve all 14 preserved registers, and is using a FP along with a
SP. Let us further complicate the mater by stating this subroutine
takes variable number of arguments. Entry Prologue:

ENTRY subroutine_name
subrutine_name:
ENTER R16,R8,#(1024 | 2)

At this point the register passed arguments have been saved with the
memory passed arguments, FP is pointing at the "other" end of local
data on the stack, after pushing the registers, 1024 bytes has been
allocated onto the SP, the old FP has been saved and the new FP setup.
{This works both with and without Safe Stack}

Your typical RISC-only ISA would require at least 29 instructions to
do this amount of work getting into the subroutine, and another 17
getting out. If the ISA has both INT and FP register files 29 becomes 37.

The Same happens in Epilogue: 1 instruction.

While ABI is very RISC the ISA of Prologue and Epilogue is not.

As a side note: My 66000 is achieving similar code density as x86-64.

A few other interesting side bits:------------------------------------------------------------

LDM and STM to unCacheable address are performed as if ATOMIC::
that is:: as a single bus transaction. All interested 3rd parties see the
memory before any writes have been performed or after all writes
have been performed. A device driver can read several MMIO device
control registers and know that nobody else in the system has access
to the device control registers that could cause interference. A device
driver can store multiple control register locations without interference.

There is a page ¿in ROM? known to contain zeros. A Memory Move
instruction can cause a page accessing this ¿ROM? data to be zeroed
without even bothering to access ¿ROM?--and the entire page is zeroed
at the target. Thus, pages being reclaimed to the free pool are but 1
instruction away from being in the already zeroed page pool. Zeroing
pages is performed at the DRAM end of the system (coherently). And
no <deleterious> bus activity is utilized.


Brett

unread,
Jun 23, 2022, 7:28:53 PM6/23/22
to
X86-64 has crap code density, your one instruction stack save restore alone
should make you significantly better, unless perhaps you have gone 32+32.

Add some accumulator ops and most instructions will fit in 16 bits ops with
ease, and you have the extra decode stage to do it anyway.

I would argue that 8 bit opcodes are best when you have an accumulator in
your 32 register RISC design, but that is a bridge too far for most.

How big is the code store needed for an IOT (Internet Of Things smart
toaster) code stack? And what is the savings for the next size down?

BGB

unread,
Jun 23, 2022, 9:08:33 PM6/23/22
to
On 6/22/2022 8:03 PM, MitchAlsup wrote:
> I could not find the question asking me to make a list of why My 66000
> instruction set architecture is like and unlike the tenets of the original
> RISC. So I spent some time looking up what the internet is currently saying
> about RISCs. There is a short list, but I will start with a few statements
> from Hennessey and Paterson::
>
> Hennessey:: The goal of any instruction format should be: 1. simple decode,
> 2. simple decode, and 3. simple decode. Any attempts at improved code
> density at the expense of CPU performance should be ridiculed at every
> opportunity.
>
> Patterson:: more is not better -- microcode is bad
> Subroutines need low overhead
>
> RISC axioms:
> a) the ISA is primarily designed to make the pipeline simple.
> b) the ISA is primarily designed as a target for compilers.
> c) instructions only exist if they add performance.
> d) frequently accessed data is kept in registers.
>

BJX2 generally upholds the above.

While some instructions are pretty niche, most still tend to have
use-cases, and I am mostly trying to avoid adding stuff that is
(completely) useless.


> RISC tenets:
> a) 1 word == 1 instruction
> b) 1 instructions flows down the pipeline in 1 cycle
> c) 1 instruction can cause 0 or 1 exception
> d) instruction encoding uses few patterns
> e) there is a large uniformly addressable register space
>

My case, 3 out of 5.

a, 16/32 and bundle-encodings break this one.

d, Some extra complexity exists due to the lack of an architectural Zero
Register and similar, and some instructions (early on) which ended up
with both 2R and 3R encodings.

Early on, I wasn't confident, for example, that "ADD R4, R5" and "ADD
R5, R4, R5" would have been semantically equivalent in all cases.


There were some other cases (Mostly 32-bit 2R Load/Store variants) which
were dropped due to being entirely redundant with the 3R encodings (or
which became redundant once predication was added).

Some other parts of the ISA also ended up being dropped and then later
re-added a few times before becoming more-or-less permanent (and some
other features are in limbo due to not really adding enough to to
justify their existence).


> So where does My 66000 ISA stand with respect to these axioms and
> tenets::
>
> RISC axioms: My 66000 ISA embodies all of the RISC axioms
> RISC tenets: My 66000 ISA rejects ½ of RISC tenets
>
> With minor exceptions to both::
>
> My 66000 contains 32×64-bit general purpose registers. Some might
> think this is too few and a FP register file should be added. Looking
> at code such as BLASS, Livermore Loops, Linpack indicates otherwise
> -- as long as one assumes some hints of OoO pipelining. Looking at
> various C libraries this seems perfectly sufficient.
>

My case: 32|64 x 64-bit.

I am still on the fence as to whether 32 GPRs is "fully sufficient", or
whether 64 GPRs can offer enough gain (in certain use-cases) to justify
its existence. It "kinda helps" for TKRA-GL but is seemingly kinda moot
for pretty much everything else.


The way the encodings for the 64 GPR case are handled is a bit hacky,
but it was a tradeoff (I came up with something which could be done
without breaking binary compatibility or requiring a separate operating
mode). Ironically, everything still works OK so long as "most of the
code" sticks to only using the low 32 GPRs (otherwise, some of the seams
might start to show).


> My 66000 ISA contains 6 decoding patterns; 1 for each of
> {instructions with 16-bit immediates, instructions with 12-bit
> immediates, scaled memory reference, 2-operand reg-reg,
> 1-operand reg-reg, 3-operand reg-reg }
>

Hmm (8 major for 32-bit):
FZnm_ZeoZ //3R "Rm, Ro, Rn"
FZnm_ZeZZ //2R "Rm, Rn"
FZnm_Zeii //3RI (Imm9/Disp9), "Rm, Imm9, Rn" / "(Rm, Disp9), Rn"
FZnZ_Zeii //2RI (Imm10), "Imm10, Rn"
FZZZ_ZeoZ //1R (Ro treated as Rn for these)
FZZn_iiii //2RI (Imm16), "Imm16, Rn"
FZdd_Zddd //Disp20 (Branch)
FZii_iiii //"LDIz Imm24, R0"

Add a few more if one counts the 16-bit ops:
ZZnm //2R
ZZni //2RI (Imm4)
ZZnZ //1R
Znii //2RI (Imm8)
ZZdd //Disp8 (Branch)

The Jumbo and Op64 encodings may or may not be considered new forms,
however they don't actually add "new" instruction-forms per-se, but
rather modify the existing encodings in predefined ways (and reuse the
existing 32-bit decoder; just with more bits "glued on" to the instruction).

One could potentially also interpret the 32-bit encodings as zero-padded
versions of a longer internal encoding space:
FEii_iiii_FZnm_Zeii //3RI, "Rm, Imm33, Rn"
...

With a few special cases, eg:
FEii_iiii_FAii_iiii //"LDIZ Imm48, R0"
FFii_iiii_FAii_iiii //"BRA Abs48"


There are more forms if one considers "minor" patterns, but these don't
really effect instruction encoding, but more how the various parts are
interpreted and mapped to the internal pipeline:
Logically, each instruction is decoded as if it had:
3 read ports, 1 write port;
A 33-bit immediate/displacement field;
Op / Sub-Op;
...

This then combines with an outer stage that deals with the bundle as a
whole, mapping SIMD ops to two lanes, along with Abs48 and Imm64
encodings (where the immediate can't fit into a single pipeline lane).

The output of this using being the configuration for the entire pipeline.


> The 12-bit immediate format is used for shift instructions and
> for Predicate instructions and positioned such that predicate
> instructions are only 1-bit different than their corresponding
> branch instruction. This saves 6×16-bit immediate encodings.
>

Differs in my case:
Shifts and friends use Imm9 forms;
However, because one doesn't need all 9 bits for a typical shift, had
also kinda shoe-horned SIMD shuffle instructions into the mix as well.

Or, in effect, shuffle can be imagined sort of like a conjoined twin
stuck onto the shift instruction (and a variable shift imagined as
masking-off the bit that causes it to able to behave like a shuffle).

Or, one can also imagine that there could have been an alternate
universe where passing a sufficiently out-of-range value to the shift
instruction caused it to shuffle the value instead...


Predicate instructions work very differently in my case, having their
own copy of the 32-bit encoding space which mirrors the format of the
normal opcode space (just replacing the WEX bit with a True/False bit),
and the encoding spots that would have normally encoded Imm24 and Jumbo
being repurposed as Predication+WEX / "PrWEX" (but only applying to a
subset of the ISA).


> Scaled memory reference, 1-operand, 2-operand, 3-operand
> all have access to 32-bit or 64-bit immediates/displacements
> in substitution for a register. This eliminates any need to use
> instructions or waste registers pasting constants together.
>

Via Jumbo, these can all expand to 33 bits.

The 64-bit cases are a bit more limited, but not usually a huge issue.

There are also some Imm56 encodings "on paper" (these are in a similar
limbo as the 48-bit instruction encodings).

Seemingly the vast majority of what one needs a larger immediate for can
be handled via Imm33, where, say, only about 4% of the constants
actually go outside of this limit (the vast majority of these being
either MMIO pointers or irrational floating-point constants).

The Imm56 cases looking like they would be too rare to really be worth
bothering with at present.


> 1-operand, 2-operand, 3-operand instructions all have sign control
> over their operands. There is no SUB instruction My 66000 uses
> ADD Rd,Rs1,-Rs2 instead. The sign control eliminates most NEG
> instructions from execution. The 2-operand group allows the
> 5-bit register specifier to be used as a 6-bit sign extended
> immediate, making ADD Rd,#1,-Rs2 easily encoded.
>

No equivalent in my case.


> There are Compare instructions that return a bit-vector of everything
> the compare circuitry can determine, including range checks like:
> 0 < Rs1 <= Rs2, classifications {-infinity, -normal, -denormal, -zero,
> +zero, +denormal, +normal, +infinity, SNaN, QNaN} I remain tempted
> to add "any byte equal", "any halfword equal", "any word equal".
>

No equivalent.

I did the same thing as SuperH here:
CMPxx instructions twiddles the SR.T bit;
Branches / Predication / ... all operate off the SR.T bit.

Ironically, because of the way Verilog works, so much stuff hanging off
a single bit causes it to get something like 1000x more expensive.


> There are 2 kinds of conditional flow: branching and predication and
> each has 2 principle kinds of instructions:: condition is determined
> from a single bit in a register, or condition is determined by comparing
> a register with 0. In addition there are conditionless branches, jumps,
> and a special addition supporting PIC for method calls and switches.
> Compare-to-zero and branch can access certain HW know information
> that is not capable of being stored in a ISA register--this includes things
> like a query to the Memory Unit asking if it has seen any interference
> between the start of an ATOMIC sequence and "now". The exception
> and interrupts and std. return are also encoded here.
>

Internally, conditional branches are equivalent to predicated branches
in my case.

The partial extension is RISC-V style compare-and-branch instructions,
which have sort of a "shortcut" between the ALU and branch-instruction
logic (the ALU behaves as if it were performing a CMPxx, and the EX1
unit like it were doing a Branch, with the short-cut signal taking the
place of the SR.T bit).


These are effectively mandatory for RISC-V decoder support, but still
regarded as optional in BJX2 (and pretty rarely see much use in
practice, if compared with CMPxx+Bx).

It is possible that a cheaper alternative might have been to rig up
something in the decoder to allow the Bxx instruction to have been split
into a CMPxx+Bx pair in the pipeline (but, I lack a mechanism for this
at present).


> Memory reference instructions enable building of ATOMIC primitives
> that can touch as many as 8 cache lines of data in a single ATOMIC
> event. This is equivalent to the MIPS LL and SC except it operates
> over much larger chunks of data. This is sufficient to move an entry
> of a shared data structure from one place to another place in a single
> event. This minimizes the number of ATOMIC events that are needed,
> and comes with guarantees of forward progress.
>

No equivalent.


> The ST instruction can store a constant in either 5-bit sign extended
> form, or in 32-bit or 64-bit forms. No need to put a constant into a
> register in order to ST it to memory. This is along with the ability
> to use 32-bit or 64-bit displacement constants.
>

No equivalent.
Multi-op sequence required.


> There are 5 "special" memory reference instructions:: ENTER is used
> to setup a new stack, and save registers, EXIT is used to tear down the
> stack and restore registers, LDM loads multiple registers, STM stores
> multiple registers, and MM moves data from memory to memory.
> MM has the property that both cached and uncached memory smaller
> than a page is moved as single ATOMIC transfer. {PCIe can do this,
> so should CPUs attached to PCIe peripherals.} There is expected to
> be a sequencer in the memory unit that performs these out of the
> data-path.
>

No equivalent.

All this stuff needs to be handled manually via Load/Store sequences.
However, there a 128-bit Load/Store that is also used (primarily) to
load/store pairs of registers in prolog/epilog sequences. This was
actually its primary purpose, but 128-bit SIMD was a secondary use-case,
along with it able to be faster for 'memcpy()' and similar than using
64-bit load/store instructions.


> The Floating Point group includes Transcendental instructions.
> Ln, LnP1, exp, expM1, SIN, COS, TAN, ATAN and some variants
> that are only 1 constant different in the calculations. Ln2 takes
> only 14 cycles, sin takes 19 cycles. These are included because
> they actually do improve performance.
>

No equivalent, nearly all math functions done in software in my case.

Originally, there were no FDIV or FSQRT instructions either, but these
exist now.

Current timings are:
FDIV: 130 cycles
FSQRT: 384 cycles


The trig functions generally run from around 500 to 1000 cycles or so
(via unrolled Taylor expansion).

There was some older versions (that came with the C library) which
didn't unroll stuff (naive algorithm), these versions seem to weigh in
at closer to around 50000-200000 clock cycles or so (in cases where they
didn't get stuck in an infinite loop).

I replaced these versions with the unrolled versions because they were
basically unusable.


> Conversions between FP and FP or FP and INT are provided by
> 1 instruction (CVT) which has 49 variants to deal with 5 specified
> rounding modes and 1 implied rounding mode (current) any time
> a rounding could transpire. This falls into the category of "once
> you have the HW to do <say> ANINT (of FORTRAN) you have the
> 95% of the logic to do them all".
>

Separate instructions in my case.

Most FP<->FP cases are dedicated converters.

The INT<->FP cases are routed through FADD (noting as how one can do
some exponent tricks and get the FADD logic to do most of the heavy
lifting).

Rounding for the converters is generally hard-wired, usually either
Approx-RNE (scalar path), or Truncate (SIMD paths).


> The exception model is based on message passing (as is SVCs),
> rather than wandering through the high-level OS exception
> dispatcher. This model supports threads (processes or tasks)
> that are paranoid of the OS looking at their data (such as banking
> applications running on a home PC), and can indeed restrict the
> OS from looking at the address space.
>

My case:
Exception Code: { Address, Tag }
Tag is a 16-bit number
(15:12): Category
(11: 8): Target Core
( 7: 0): Magic Number

The category is used to figure out which offset relative to VBR to
branch to:
VBR basically points somewhere in kernel memory;
Holds a small table of branch instructions;
Exception dispatch branches to one of these branch instructions.

When this happens, the CPU also basically saves PC and SR to special
registers, swaps the stack pointers, and switches to a special operating
mode (which blocks further interrupts and disables the MMU and similar).

Interrupts can be routed along the Ring-Bus, as with memory requests,
intended so that CPU cores can throw interrupts at each other.

The 'RTE' instruction is actually implemented as another type of
"special" exception, which basically tries to pass control back to where
the exception came from (or, at least, where the relevant registers say
the exception came from).


> I/O devices are virtualized, and operate on the virtual address
> space of originating requestor. So while the I/O device can DMA
> directly into paranoid application address space, and while OS
> can verify the given space and bounds are acceptable, OS cannot
> look into that address space. This gets rid of the need of a secured
> mode of operation.
>
> Deferred procedure calls are handled as messages (argument
> setup + 1 instruction) with continuation. The messaging sub-system
> operates over both HyperVisor and GuestOS domains simultaneously.
> Anyone with a "method" can call that method and get a response
> even if that method is running under a different GuestOS.
>

No equivalent.

Basic IO range mapping in the physical address space.


No explicit hypervisor support, but shouldn't be too hard in-theory to
fake this part in software.

Would likely be easier than it would be for x86, given the use of a
software managed TLB.


> There is a 66-bit remapped address space--any thread can access
> 64-bits of the space. Sub-spaces are {DRAM, configuration, MMIO,
> and ROM} The address space is configured to efficiently transport
> requests over a significant network (ala HyperTransport and Intel
> equivalent). DRAM is cache coherent, configuration is strongly ordered,
> MMIO is sequentially consistent, ROM is "lax").
>

My case: 48 or 96 bit virtual, 48 bit physical.

MMIO is synchronous, the bridge to the MMIO bus will effectively "lock"
and not allow another request to pass until the former request has
completed.

All MMIO accesses are fully synchronous from the L1 cache down to the
target device (unlike normal memory), though this does mean that
accessing MMIO carries a fairly steep performance penalty relative to
normal memory accesses.


It is possible that there could be a sub-mode to mimic full 64-bit on
top of 96-bit, by treating the addressing as (32,64) rather than
(48,48). This mode would not allow shoving tag bits into the high-order
parts of base-registers though (since these bits would be interpreted as
part of the virtual address).

The Full-64 mode would effectively use 64-bit pointers, unlike 96-bit
requiring the use of 128-bit pointers. The high bits (from the quadrant
address) would effectively still function as an address-range selector
(as when using 48-bit addresses in 96-bit mode; the local 48-bit space
being a sort of isolated island within the much bigger 96-bit space).


Well, and B-Trees for address translation, because the memory overheads
of an 8-level page table would suck.


> The system repeater transports requests from chip to chip, and
> amalgamates coherence requests so that the originator counts
> responses from cores on his chip, and the number of chips in
> the system (rather than counting from every core).
>
> Memory management cannot be turned off--My 66000 imple-
> mentations come out of reset with the MMUs turned on. HostBridge
> is configured with a MMU/TLB that uses exactly the same tables as
> CPUs and can share tables as applicable. Levels in the virtual
> address space translations can be skipped! So an application as
> simple as 'cat' can be managed with a single page of translation
> overhead.
>
> Memory management is inherently HyperVisor/GuestOS. Privilege
> is determined by the assortment of root pointers in use on a per
> invocation basis.
>
> GuestOS can activate a thread (taking it from a waiting state to
> running in a core) in a single instruction and remotely. So can
> HyperVisor.
>

Very different in my case.

I can use B-Trees as page-tables, which saves memory, but this is mostly
due to this being all software, and the CPU itself doesn't really care
what the page-table looks like.

Could almost mimic x86 segmented addressing, except that pages need to
be page-aligned. I guess if one had non-page-aligned address
translation, they could rig up the MMU to pretend to be an 8086 or
something, but this would make the TLB's job a little harder (and
require more bits in the TLBE).

Most obvious would be that such a mode would come at the expense of the
ASID, reinterpreting the ASIDs bits as a displacement to add to the
translated address. Would also require physical memory pages to be
contiguous, and probably some extra hackery to deal with A20 wrap-around
behavior (I guess one would have to look into how much MS-DOS software
depends on A20 wraparound working).


> Finally, there is no notion of one thread morphing into a different
> thread over a series of instructions manipulating control registers
> one by one. For example: an ISR cleanup handler takes a thread
> off a wait state queue, places it on a run state queue, and signals
> GuestOS to see what threads should be running "right now". This
> is all 1 instruction and 1 cycle as far as the core performing the
> instruction sees.
>

Yeah, no equivalent.


> My 66000 is not just another ISA, it is a rethink of most of the components
> that make up a system. A context switch from one thread to another
> within a single GuestOS is 10 cycles. A context switch from one thread
> to a thread under a different GuestOS remains 10 cycles. The typical
> current numbers are 1,000 cycles within GuestOS, and 10,000 cycles
> across GuestOSs.
>
> OH, and BTW, The FP transcendentals are patented.

I would assume you mean FP transcendentals in hardware (in whatever way
they are implemented), as opposed to in-general.

Their existence in things like "math.h" and so on would likely preclude
any sort of patent protection in the "in general" sense.


Very different, I have doubts about how well a lot of this could be
pulled off in a low-cost implementation. Best I can come up with at the
moment would effectively amount to faking it using lots of microcode or
a software-based emulation layer.


I also consider my ISA to be "fairly unique", albeit in different ways
(and a little more conservative in terms of implementation concerns).

MitchAlsup

unread,
Jun 23, 2022, 9:17:25 PM6/23/22
to
It is a major contributor to getting as small as it got.
>
> Add some accumulator ops and most instructions will fit in 16 bits ops with
> ease, and you have the extra decode stage to do it anyway.
<
I looked at this a few years ago and the damage to long term ISA growth
was catastrophic. As it is I have nearly ½ of the OpCode space in each
OpCode group left for the future. and can PARSE instructions in 31 gates
with only 4 gates of delay. All that goes out the window with a meaningful
16-bit "extension". I pass.
>
> I would argue that 8 bit opcodes are best when you have an accumulator in
> your 32 register RISC design, but that is a bridge too far for most.
<
My 66000 only has 59 total instructions. What makes you think you need 256 ?
>
> How big is the code store needed for an IOT (Internet Of Things smart
> toaster) code stack? And what is the savings for the next size down?
<
I have absolutely no interest in things that small. IoT devices don't need
a HyperVisor, or even that much of a supervisor. I have no interest in
register sizes smaller than 64-bits. And quite frankly, say you did get a
design that small and into production, you have to sell billions (maybe
trillions) of then at $0.05 to pay for the design team and recurring
engineering expenses.
<
If you do, more power to you.

MitchAlsup

unread,
Jun 23, 2022, 9:38:46 PM6/23/22
to
On Thursday, June 23, 2022 at 8:08:33 PM UTC-5, BGB wrote:
> On 6/22/2022 8:03 PM, MitchAlsup wrote:
<snip>
> > The Floating Point group includes Transcendental instructions.
> > Ln, LnP1, exp, expM1, SIN, COS, TAN, ATAN and some variants
> > that are only 1 constant different in the calculations. Ln2 takes
> > only 14 cycles, sin takes 19 cycles. These are included because
> > they actually do improve performance.
> >
> No equivalent, nearly all math functions done in software in my case.
>
> Originally, there were no FDIV or FSQRT instructions either, but these
> exist now.
>
> Current timings are:
> FDIV: 130 cycles
> FSQRT: 384 cycles
>
Mc 88100 did these in / = 56 and SQRT in ~66
Mc 88120 did these in / = 17 and Sqrt in 22
>
> The trig functions generally run from around 500 to 1000 cycles or so
> (via unrolled Taylor expansion).
<
You need to use Chebyshev coefficients--more accurate sometimes fewer
terms, always better error bounds..
>
<<snip>
> My case: 48 or 96 bit virtual, 48 bit physical.
>
> MMIO is synchronous, the bridge to the MMIO bus will effectively "lock"
> and not allow another request to pass until the former request has
> completed.
<
What are you going to do when there are 24 CPUs in a system and
everybody wants to write to the same MMI/O page ?
>
> All MMIO accesses are fully synchronous from the L1 cache down to the
> target device (unlike normal memory), though this does mean that
> accessing MMIO carries a fairly steep performance penalty relative to
> normal memory accesses.
>
The penalty is inherent in the requirements. However, My 66000 can ameliorate
the latency by grouping multiple writes to neighboring MMI/O control registers
into a single bus transaction. In theory, one can write all the necessary stuff
into the control registers to cause a disk drive to DMA a disk sector wherever
in a single write transaction to MMI/O and a single DMA write transaction
when data returns.
>
<
> > My 66000 is not just another ISA, it is a rethink of most of the components
> > that make up a system. A context switch from one thread to another
> > within a single GuestOS is 10 cycles. A context switch from one thread
> > to a thread under a different GuestOS remains 10 cycles. The typical
> > current numbers are 1,000 cycles within GuestOS, and 10,000 cycles
> > across GuestOSs.
> >
> > OH, and BTW, The FP transcendentals are patented.
> I would assume you mean FP transcendentals in hardware (in whatever way
> they are implemented), as opposed to in-general.
<
You might be surprised at what was allowed in the claims.
>
> Their existence in things like "math.h" and so on would likely preclude
> any sort of patent protection in the "in general" sense.
>
Yes, I did not reinvent ancient SW as HW. The algorithms are new (well
different because of what one can do inside a HW function unit compared
to what one can do using only instructions) with several unique features.
They even bother to get the inexact bit set correctly.
>
> Very different, I have doubts about how well a lot of this could be
> pulled off in a low-cost implementation. Best I can come up with at the
> moment would effectively amount to faking it using lots of microcode or
> a software-based emulation layer.
>
Microcode generally refers to a control machine interpreting instructions.
Is a function unit run by ROM sequencer microcode ? What if the ROM got
turned into equivalent gates: Is it still microcode, or just a sequencer ?
In any event there are only 3 different sequences used (reminiscent of
Goldschmidt DIV and SQRT sequences,)
>
> I also consider my ISA to be "fairly unique", albeit in different ways
> (and a little more conservative in terms of implementation concerns).
<
And hard to read..............

BGB

unread,
Jun 24, 2022, 1:15:18 AM6/24/22
to
On 6/23/2022 8:38 PM, MitchAlsup wrote:
> On Thursday, June 23, 2022 at 8:08:33 PM UTC-5, BGB wrote:
>> On 6/22/2022 8:03 PM, MitchAlsup wrote:
> <snip>
>>> The Floating Point group includes Transcendental instructions.
>>> Ln, LnP1, exp, expM1, SIN, COS, TAN, ATAN and some variants
>>> that are only 1 constant different in the calculations. Ln2 takes
>>> only 14 cycles, sin takes 19 cycles. These are included because
>>> they actually do improve performance.
>>>
>> No equivalent, nearly all math functions done in software in my case.
>>
>> Originally, there were no FDIV or FSQRT instructions either, but these
>> exist now.
>>
>> Current timings are:
>> FDIV: 130 cycles
>> FSQRT: 384 cycles
>>
> Mc 88100 did these in / = 56 and SQRT in ~66
> Mc 88120 did these in / = 17 and Sqrt in 22

It is partly based on the strategy used:
Rig the FMUL unit into a feedback loop;
Wait N cycles for answer to converge;
Assume it has converged on the answer.

Generally seems to take roughly this long for the algo to converge on
the answer.

When I first re-added FDIV, it was using the same basic algo (just with
slightly different inputs), and took a similar number of clock-cycles.

Then I had the idea that I could tweak a few things in the Shift-Add
integer divider, and get it to also do FDIV. Though, the way it was
rigged up still needs ~ 130 cycles, but 130 is still less than 384.


>>
>> The trig functions generally run from around 500 to 1000 cycles or so
>> (via unrolled Taylor expansion).
> <
> You need to use Chebyshev coefficients--more accurate sometimes fewer
> terms, always better error bounds..


Possible.

There are also a few faster algos, such as "lookup and interpolate",
but, while faster, these don't give sufficient precision to really be a
good option for the "math.h" functions (assumed to be accurate, even if
not the fastest possible).

There is also CORDIC, but I haven't really messed with it.


In any case, unrolled Taylor expansion is a few orders of magnitude
faster than calculating an exponential and factorial and performing a
floating-point divide and similar every time around the loop...

I was not the person who wrote that code originally, not sure why they
originally wrote it this way.


>>
> <<snip>
>> My case: 48 or 96 bit virtual, 48 bit physical.
>>
>> MMIO is synchronous, the bridge to the MMIO bus will effectively "lock"
>> and not allow another request to pass until the former request has
>> completed.
> <
> What are you going to do when there are 24 CPUs in a system and
> everybody wants to write to the same MMI/O page ?

It all gets serialized to them accessing it one at a time.

Though, ideally, only device drivers and similar should be accessing
MMIO, so this isn't likely to be a huge issue.

By the time I get to 24 cores, will have probably came up with a
different solution.


There is also the option of putting devices stuff on the ringbus. I had
partly already started going this way for VRAM (it is faster to write to
the framebuffer by going through the RAM interface than by going through
the MMIO interface).

However, for accessing hardware devices, in general, one kind of wants
"slower but strictly synchronous" IO over "faster but chaotic" IO.

For VRAM, it is a little different, because generally one is trying to
push several MB/sec out to the screen and don't really care if things
are strictly in-order (if things arrive in the framebuffer in a slightly
different order than they were stored into the L1 cache, who cares?...).


>>
>> All MMIO accesses are fully synchronous from the L1 cache down to the
>> target device (unlike normal memory), though this does mean that
>> accessing MMIO carries a fairly steep performance penalty relative to
>> normal memory accesses.
>>
> The penalty is inherent in the requirements. However, My 66000 can ameliorate
> the latency by grouping multiple writes to neighboring MMI/O control registers
> into a single bus transaction. In theory, one can write all the necessary stuff
> into the control registers to cause a disk drive to DMA a disk sector wherever
> in a single write transaction to MMI/O and a single DMA write transaction
> when data returns.

I was generally accessing MMIO 32 or 64 bits at a time (depending on the
device).

No DMA at present, pretty much everything is still polling IO and similar.


So, for example, for SDcard:
Store a byte to Data register.
Load Control register.
OR a bit in loaded value.
Store modified value to Control register.
Dummy load from Status register (1)
Loop:
Load from Status register.
If BUSY, Continue.
Load byte from Data register.
Repeat until bytes have been moved.

As noted, in the original form, this hit a wall at around 600 K/s.


The modified interface adds a QDATA register (64-bit), and a different
control register bit for "Transfer 8 bytes".

This QDATA version instead hits a wall at around 5 MB/s.

This interface is sufficient for SPI, but if I went to a faster mode,
pretty much as soon as I made the switch, I would be at the bandwidth
limit of this interface (and would then need to come up with something
different).

*1: When operating at "hitting the wall" speeds, the first Status load
will almost invariably be BUSY, nut the second load will typically be
"not BUSY", since the SPI transfer would have completed by the time it
has taken for the request to transfer all the way around the ring and
back again. So, a dummy load can make it faster.


Say, 13 MHz SPI gives 1.5 MB/s, but 13 MHz in UHS-I mode would boost
this up to 13 MB/s (basically, pushing 4 bits per clock-edge).

Much faster than this, and I almost may as well consider going "full
hardware" and memory mapping the SDcard...


>>
> <
>>> My 66000 is not just another ISA, it is a rethink of most of the components
>>> that make up a system. A context switch from one thread to another
>>> within a single GuestOS is 10 cycles. A context switch from one thread
>>> to a thread under a different GuestOS remains 10 cycles. The typical
>>> current numbers are 1,000 cycles within GuestOS, and 10,000 cycles
>>> across GuestOSs.
>>>
>>> OH, and BTW, The FP transcendentals are patented.
>> I would assume you mean FP transcendentals in hardware (in whatever way
>> they are implemented), as opposed to in-general.
> <
> You might be surprised at what was allowed in the claims.

OK.

>>
>> Their existence in things like "math.h" and so on would likely preclude
>> any sort of patent protection in the "in general" sense.
>>
> Yes, I did not reinvent ancient SW as HW. The algorithms are new (well
> different because of what one can do inside a HW function unit compared
> to what one can do using only instructions) with several unique features.
> They even bother to get the inexact bit set correctly.

OK.

In my case, they don't generally get used that heavily IME, so software
is OK so long as it is not unreasonably slow.


In cases where they would have gotten used more heavily, such as sin/cos
being used for the water-warping effects in Quake, lookup tables had
been used instead to good effect.

It is also possible to reduce these lookup tables to half float
precision, since the water warp effect doesn't seem to mind all that much.


>>
>> Very different, I have doubts about how well a lot of this could be
>> pulled off in a low-cost implementation. Best I can come up with at the
>> moment would effectively amount to faking it using lots of microcode or
>> a software-based emulation layer.
>>
> Microcode generally refers to a control machine interpreting instructions.
> Is a function unit run by ROM sequencer microcode ? What if the ROM got
> turned into equivalent gates: Is it still microcode, or just a sequencer ?
> In any event there are only 3 different sequences used (reminiscent of
> Goldschmidt DIV and SQRT sequences,)


Dunno. I was just sort of imagining doing it as a big ROM on top of a
RISC-style core, with chunks of the ISA being effectively treated like
special function calls into this ROM.

It is likely that parts of the Verilog would be procedurally generated,
such as the entry points into the various functions within this ROM.


I had considered something like this a few times in my case, but
generally ended up taking a different approach:
If I can't do it directly in hardware, I wont do it at all.

Only reason I ended up with the functionality of the RISC-V 'M'
extension was because I had thought up a way to implement it affordably.


Even then, it wasn't until earlier today that I got around to adding
"proper" support for 32-bit integer divide (reducing its latency from 68
to 36 cycles). Mostly because in some cases it was being used often
enough to become significant.

But, then I ended up needing to burn a few more 3R encoding spots to add
DIVx.L and MODx.L instructions.

Would have almost just assumed making them be Op64 encodings, but then
it seemed stupid to have DIVS.Q be a 32-bit encoding but DIVS.L be
64-bit, when DIVS.L would be somewhat more common.



>>
>> I also consider my ISA to be "fairly unique", albeit in different ways
>> (and a little more conservative in terms of implementation concerns).
> <
> And hard to read..............


I would suspect a bit less so than IA-64 though.


Granted, writing ASM effectively in this case does generally require
mentally keeping track of the pipeline state and which combinations of
instructions may or may not be bundled (vs x86 which will basically just
accept whatever one throws at it, running at more-or-less full speed).

Though, as an assistance feature, the emulator will generally lint this
stuff and turn illegal combinations into a breakpoint.


Syntax mostly developed out of the SuperH syntax, with a few bits of
TMS320 and similar thrown in, shouldn't be that difficult to figure out.

...

Timothy McCaffrey

unread,
Jun 24, 2022, 10:27:53 AM6/24/22
to
On Thursday, June 23, 2022 at 7:28:53 PM UTC-4, gg...@yahoo.com wrote:

> X86-64 has crap code density, your one instruction stack save restore alone
> should make you significantly better, unless perhaps you have gone 32+32.
>
The X86-64 was left with a lot of baggage because of bad design decision to
try and reuse the X86 decoder. Most of the remaining 1 byte opcodes are
either barely used (STC, CLC) or deprecated (PUSH/POP). It would have been
great if the instruction encoding had been refactored, and some other cruft
removed (e.g. only being able to use CL for a dynamic shift count).

It would have also been a great time to be able set up the encodings so that
the instruction parser could figure out the instruction length from the first chunk
(whatever size that was, I suspect 16 bit chunks make sense).

- Tim

John Dallman

unread,
Jun 24, 2022, 11:01:49 AM6/24/22
to
In article <2f5c8378-de57-4ef2...@googlegroups.com>,
timca...@aol.com (Timothy McCaffrey) wrote:

> The X86-64 was left with a lot of baggage because of bad design
> decision to try and reuse the X86 decoder. Most of the remaining
> 1 byte opcodes are either barely used (STC, CLC) or deprecated
> (PUSH/POP).

Remember that the design was done by AMD, who have to tread carefully to
avoid giving Intel an excuse to claim they're breaching their X86 license
in some way.

At the time, Intel were still under the impression that Itanium was going
to conquer the world. When they realised better, AMD had Opterons on the
market. Intel wanted to build an AMD-incompatible 64-bit x86 to drive AMD
out of the market. They were restrained by Microsoft, who weren't
interested in supporting two different extended x86 ISAs.

Given how we got here, things could be a lot worse.

John

Thomas Koenig

unread,
Jun 24, 2022, 11:19:55 AM6/24/22
to
Brett <gg...@yahoo.com> schrieb:

> How big is the code store needed for an IOT (Internet Of Things smart
> toaster) code stack? And what is the savings for the next size down?

It will be hard to beat an ARM Cortex-M - based microcontroller
which are firmly embedded in the market, and for which a lot of
software has been written, and which cost a bit more than four
dollars per unit.

And if that's too expensive and you do not need the performance,
you can always use a MSP430-based one for considerably less,
less than a dollar at quantity.

The ROM on the latter is somewhere between 1KB and whatever you're
willing to pay for, and the RAM 256 bytes or more. But of course
you're still getting some analog hardware thrown in, such as an
ADC or a comparator.

Not a lot of savings, I'd say.

MitchAlsup

unread,
Jun 24, 2022, 12:15:58 PM6/24/22
to
Yes, it was MS that made intel do x86-64. Intel had a model that was within
spitting distance and MS told them the x86-64 port was already done. So,
for the first time in its life, Intel complied.
<
But look at how they have diverged after 2 decades of being almost
identical !!

BGB

unread,
Jun 24, 2022, 12:49:09 PM6/24/22
to
On 6/24/2022 10:19 AM, Thomas Koenig wrote:
> Brett <gg...@yahoo.com> schrieb:
>
>> How big is the code store needed for an IOT (Internet Of Things smart
>> toaster) code stack? And what is the savings for the next size down?
>
> It will be hard to beat an ARM Cortex-M - based microcontroller
> which are firmly embedded in the market, and for which a lot of
> software has been written, and which cost a bit more than four
> dollars per unit.
>

Another ISA which could potentially compete with Cortex-M might be
RISC-V RV32IMC or similar.

Pros/cons with C though, it is 'dog chewed' to a point (somewhat more so
than Thumb) where I wonder about decoding cost.


Something like RV32GC would likely be a bit more expensive, as the A/F/D
extensions do a lot of stuff that I have doubts about being able to pull
off cheaply.

A more cost-effective option might be:
RV32IMZfinxZdinxC
But, not a lot of code is built for this.


> And if that's too expensive and you do not need the performance,
> you can always use a MSP430-based one for considerably less,
> less than a dollar at quantity.
>

For hobbyist use, the ones in DIP packaging (MSP430Gxxxx) were typically
being sold at several $ per chip last I bought any, but dunno about now.

QFP variants were cheaper per-chip, but QFP is much less usable (can't
use it with perfboard or DIP sockets).


They were generally cheaper than AVR8 chips, though the AVR8's typically
had more RAM and ROM space.

Performance per clock seemed to be better on MSP430 than AVR8, IME.
Though, in either case, one is not usually going to be using them for
performance-intensive tasks.

IIRC:
MSP430, 16 registers, each 16 bit, Mem/Mem addressing, Von Neumann
AVG8: 32x8b or 16x16b, Load/Store, Harvard (Split code/data spaces)


> The ROM on the latter is somewhere between 1KB and whatever you're
> willing to pay for, and the RAM 256 bytes or more. But of course
> you're still getting some analog hardware thrown in, such as an
> ADC or a comparator.
>
> Not a lot of savings, I'd say.


From what I remember, for 'G' style MSP430 chips:
ROM: ~ 4K to 32K
RAM: ~ 256B to 2K
Address space, something like:
MMIO 0000..01FF
RAM 0200..09FF (Say, 0200..02FF for 256B)
(More MMIO and/or RAM, depending on device)
ROM 8000..FFFF
For less, lower-bound moves upward
FFF0..FFFF is reset/interrupt vectors.

The G chips were typically available in DIP16/20/24 packaging IIRC.

Multi-channel ADC/DAC/... are common.
IO pins are typically capable of both In/Out in digital mode;
ADC/DAC is typically limited to certain pins;
...


The 'X' chips have a larger address space, and may have considerably
more RAM and ROM space (within a 20-bit address space). But, typically
only available in QFP packaging or similar.


One can do bit-banged SPI on the MSP430, but practically one would be
limited to fairly slow IO speeds (kHz territory). Low-speed serial is
also possible.

...

BGB

unread,
Jun 24, 2022, 1:51:56 PM6/24/22
to
Though, more extensive redesign would have made it effectively an
entirely new ISA, just with an x86 backward-compatibility mode.

But, yeah, 16-bit chunking makes sense, this is what I use in my ISA in
the baseline case (16/32), though one is mostly limited to 32-bit
encodings for WEX bundles.


As can be noted, x86-64 code density ranges from "kinda meh" to
"spectacularly bad", depending mostly on the compiler.

That said, i386 and Thumb2 are both a bit more competitive, kinda harder
to beat them on the code-density front.


I am not entirely sure what exactly is going on here (for x86-64) to
make the code density so bad (it is pretty bad even in size-optimized
modes). The difference is often somewhat outside of what could easily be
explained just by the REX prefix and similar.


Say, for example:
x86 does an Abs32 load, 6 bytes;
x86-64 does a RIP+Disp32 Load, 7 bytes.
Delta: 17% bigger.

Or:
x86 does an 2R-ADD, 2B
x86-64 does a 2R-ADD (w/ REX), 3B
Delta: 50% bigger.

Then again, things like REX prefix and tending to save/restore more
stack variables and similar could be a factor.

Possibly also an increase in 64 bit constant loads, ...

But, often times, the expansion is significantly larger than the
theoretically expected 20-50% or so.

Timothy McCaffrey

unread,
Jun 24, 2022, 3:31:52 PM6/24/22
to
You can't have a 64 bit constant in an instruction, except for immediate load (IIRC), so
you have to waste a register loading the constant and then use it.

The calling ABI is much different than the 386, where you just usually pushed stuff
on the stack. Now you have some stuff in registers (which registers depends on whether
you are running Windows or Linux) and some stuff on the stack.

Some registers you are required to save before the call (caller save)
and others after the call (callee save).

Since you are not saving values with a simple push (usually) in the subroutine, you go
from a 1 byte PUSH to a 5 or 6 byte MOV to stack.

Simple INC/DEC doubled in size, so probably most code now uses ADD instead (which
is probably faster because you don't have a partial CC update).

Due to stack and structure storage expanding for 8 byte values (e.g. pointers),
you can only store so many of them there using an 8 bit offset. Unfortunately,
the next step up is a 32 bit offset.

And probably a bunch of other stuff I've forgotten....

- Tim

MitchAlsup

unread,
Jun 24, 2022, 3:56:28 PM6/24/22
to
I worked on some x86 decode mechanisms while at AMD, and learned a lot about
x86 encoding {which I still consider BETTER than SPARC-Vis}
<
My 66000 ISA format and encoding is a direct result of this, and indeed, follows
your tenet of having everything needed to determine size in the first word.
>
> - Tim

MitchAlsup

unread,
Jun 24, 2022, 4:02:53 PM6/24/22
to
On Friday, June 24, 2022 at 2:31:52 PM UTC-5, timca...@aol.com wrote:
> On Friday, June 24, 2022 at 1:51:56 PM UTC-4, BGB wrote:
> >
> > Possibly also an increase in 64 bit constant loads, ...
> >
> > But, often times, the expansion is significantly larger than the
> > theoretically expected 20-50% or so.
<
> You can't have a 64 bit constant in an instruction, except for immediate load (IIRC), so
> you have to waste a register loading the constant and then use it.
<
My 66000 does not have this problem. AND while BGB may be able to get buy with
this restriction now, you won't in 10 years hence.
>
> The calling ABI is much different than the 386, where you just usually pushed stuff
> on the stack. Now you have some stuff in registers (which registers depends on whether
> you are running Windows or Linux) and some stuff on the stack.
>
> Some registers you are required to save before the call (caller save)
> and others after the call (callee save).
<
With <realistically> 30-64-bit registers in use by compiler and 16 of these preserved,
I am not seeing very much caller-save register traffic from Brian's LLVM port. It is more
like R9-R15 are simply temps used whenever and forgotten.
>
> Since you are not saving values with a simple push (usually) in the subroutine, you go
> from a 1 byte PUSH to a 5 or 6 byte MOV to stack.
<
I go to a single instruction that pushes as much stuff as desired (by compiler)
and then allocates a stack frame for the local-variables.

Thomas Koenig

unread,
Jun 24, 2022, 4:40:20 PM6/24/22
to
MitchAlsup <Mitch...@aol.com> schrieb:

First, thanks again for the good explanations.

> DBLE is an instruction modifier that supplies register encodings and
> adds 64-bits to the calculation width of the modified instruction. Applied
> to a FP instruction: DBLE Rd1,Rs11,Rs22,Rs33; FMAC Rd2,Rs12,Rs22,Rs33
> we execute: FMUL {Rd1,Rd2},{Rs11,Rs22},{Rs21,Rs22},{Rs31,Rs32}
> and presto: we get FP128 by adding exactly 1 instruction,

This means pair-of-doubles 128-bit, not IEEE 128-bit. I think S/360
introduced this; POWER still has it as the only option up to POWER8.
POWER9 has hardware support for IEEE 128-bit, and IBM is moving away
from double double to IEEE FP for POWER 9+ (I helped a bit in that
transition, for gfortran).

> the compiler
> can pick any 8 registers it desires alleviating register allocation concerns.

Eight registers is a lot if there are only 32 to go around...

> DBLE is a "get by" kind of addition, frowned upon by Hennessey.
>
> I can envision a SIMD instruction modifier that defines the SIMD parameters
> of several subsequent instructions and allows 64-bit SIMD to transpire.
> I am still thinking about these. What I cannot envision is a wide SIMD
> register file--this is what VVM already provides.

I think a lot of the use cases could also be covered if the
processor were able to process int8 ... int64 and fp16..fp64
(with fp128 being an exception) at the width of an SIMD unit,
so something like

MOV R4,#0
VEC {R5}
LDUH R6,[R10+R4] ! Load half float into R6
LDUH R7,[R11+R4] ! Second one
FADD.F2 R7,R7,R6
STH R7,[R12+R4]
ADD R4,R4,#2
LOOP (something)

could be executed at full SIMD with. Is this feasible? Or
would it be better to do this kind of thing via SIMD?

Ivan Godard

unread,
Jun 24, 2022, 5:02:17 PM6/24/22
to
Binary compatibility is a real bear. Either leave a *lot* of free
entropy (my66), or push the problem to the software and build machinery
(Mill), or suffer bloat and slow decode (x86, RISCV).

Brett

unread,
Jun 24, 2022, 5:54:05 PM6/24/22
to
You are missing the I in internet, no wifi I can find in that chip.

Talking about a network stack to talk to your phone. Smart color changing
lightbulbs and soon all the appliances in your home, washer, dryer, stove,
microwave, thermostat, security cameras, just everything.

Plus your home router, which uses a much more powerful wifi block and CPU.

There are markets here that will pay for better code density, assuming a
network stack is significant?

MitchAlsup

unread,
Jun 24, 2022, 5:55:11 PM6/24/22
to
On Friday, June 24, 2022 at 3:40:20 PM UTC-5, Thomas Koenig wrote:
> MitchAlsup <Mitch...@aol.com> schrieb:
>
> First, thanks again for the good explanations.
> > DBLE is an instruction modifier that supplies register encodings and
> > adds 64-bits to the calculation width of the modified instruction. Applied
> > to a FP instruction: DBLE Rd1,Rs11,Rs22,Rs33; FMAC Rd2,Rs12,Rs22,Rs33
> > we execute: FMUL {Rd1,Rd2},{Rs11,Rs22},{Rs21,Rs22},{Rs31,Rs32}
> > and presto: we get FP128 by adding exactly 1 instruction,
<
> This means pair-of-doubles 128-bit, not IEEE 128-bit. I think S/360
<
No this means FP with 14-bit exponent and 113-bit fraction (if my math is
right) pairs of double are available using exact FP arithmetics via CARRY
not DBLE.
<
> introduced this; POWER still has it as the only option up to POWER8.
> POWER9 has hardware support for IEEE 128-bit, and IBM is moving away
> from double double to IEEE FP for POWER 9+ (I helped a bit in that
> transition, for gfortran).
> > the compiler
> > can pick any 8 registers it desires alleviating register allocation concerns.
<
> Eight registers is a lot if there are only 32 to go around...
<
It is not a machine designed to crunch FP128 all the time.
It is a machine designed so the occasional use is satisfactory.

MitchAlsup

unread,
Jun 24, 2022, 5:57:42 PM6/24/22
to
I don't see it:: a 10G or 100G network interface already has a memory footprint
(for its own buffering concerns) that skimping on the CPU and ROM seems a
waste.

BGB

unread,
Jun 24, 2022, 8:55:52 PM6/24/22
to
On 6/24/2022 3:02 PM, MitchAlsup wrote:
> On Friday, June 24, 2022 at 2:31:52 PM UTC-5, timca...@aol.com wrote:
>> On Friday, June 24, 2022 at 1:51:56 PM UTC-4, BGB wrote:
>>>
>>> Possibly also an increase in 64 bit constant loads, ...
>>>
>>> But, often times, the expansion is significantly larger than the
>>> theoretically expected 20-50% or so.
> <
>> You can't have a 64 bit constant in an instruction, except for immediate load (IIRC), so
>> you have to waste a register loading the constant and then use it.
> <
> My 66000 does not have this problem. AND while BGB may be able to get buy with
> this restriction now, you won't in 10 years hence.

Still better in my case than it is in RISC-V where this case would
require a memory load...


As-is (in BJX2), encodings have been defined ("on paper"), eg, for Imm56
encodings for some instructions; just they haven't been put into use yet.

Partly is is a combination of:
Non-zero decoding cost;
It is pretty rare to exceed the existing 33-bit limit for 3RI ops.

Spending 1 extra cycle to load a constant into a register isn't usually
a huge issue.


IME, the vast majority of 64-bit constant loads thus far tend to be
Binary64 constants; usually irrational or repeating 'double' constants
or similar (most other constants will be compacted down to a smaller
format).


Some constant-load stats (from my GLQuake port):
Imm8: 11% (Byte range)
Imm16: 70% (Int16 or UInt16)
Binary16: 12% (Double encoded as Half-Float)
Imm32/33: 4.4%
Imm33s: 2.2% (Int32 or UInt32, Zero/Sign Extend)
Imm32Hi: 1.1% (Int32 into high-order 32 bits, low 32 are 0)
Binary32: 0.8% (Double as a pair of Binary32)
2xBinary16: 0.3% (2xBinary32 as 2xBinary16)
Imm64: 2.5% (Fallback Case)


Some "rarely used" types:
Load value into the high 10 bits of target (rare);
Say: zzz0000000000000
Load bits into the middle of a 64-bit value (rare);
Say: 0000zzzzzzzz0000
Load 4xFP16 encoded as 4xFP8;
...
These cases seem to be rare enough to be mostly ignored.


Grouping constant loads by instruction length:
16-bit: 11%
32-bit: 82%
64-bit: 4.4%
96-bit: 2.5%

Note that this is only for discrete constant loads, and does not count
immediate values or displacements.

Constant loads reflect ~ 7.4% of the total Jumbo prefixes, with the rest
going into immediate fields.

Calculating stats:
85% Imm9/Disp9
15% Imm33/Disp33

Rough estimate of upper-bound of overflowerd immed cases:
Less than 3% (Excluding Load/Store ops)
Less than 0.4% (Including Load/Store ops)

Where the Imm/Disp balance is roughly:
89% Disp (Load/Store Displacements)
11% Imm (Immediates for ALU instructions and similar).


However, given that most of the 64-bit constants (dumped into a log) are
fairly obviously either MMIO addresses or floating-point constants, the
actual bound for overflowing the 33-bit immediate range is likely much
smaller.


I don't have a stat for the relative use of Jumbo between Imm and Disp
encodings, however (based on what I typically see in disassembly dumps),
I would estimate Disp to be the dominant case.


It can be noted that a significant chunk of the cases which are being
encoded as Imm33/Disp33 could also be handled by Imm17/Disp17 encodings
(my compiler doesn't typically use these unless the instruction is
*also* using XGPR).



While arguably code "could change" here, such as due to ever-expanding
memory usage, I suspect this is less likely to be an issue in a
statistical sense.


The main thing that would be the "likely existential risk" for this,
would be programs exceeding 4GB in the ".bss" section, which would
require a bigger displacement.

For x86-64, one would run into a similar problem if text+data+bss
exceeds 2GB (thus breaking ones' ability to use RIP-relative addressing).



>>
>> The calling ABI is much different than the 386, where you just usually pushed stuff
>> on the stack. Now you have some stuff in registers (which registers depends on whether
>> you are running Windows or Linux) and some stuff on the stack.
>>
>> Some registers you are required to save before the call (caller save)
>> and others after the call (callee save).
> <
> With <realistically> 30-64-bit registers in use by compiler and 16 of these preserved,
> I am not seeing very much caller-save register traffic from Brian's LLVM port. It is more
> like R9-R15 are simply temps used whenever and forgotten.

That is presumably how it is supposed to be...


In my case, it is roughly a 50/50 split between caller save (scratch)
and callee save (preserved) registers.

For leaf functions, one wants a lot of scratch registers, and for
non-leaf functions, a lot of callee-save registers.

But, sadly, no party can be entirely happy:
Leaf functions wishing they could have more registers to play with,
without needing to save them first;
Non-leaf functions wishing they could have more registers for variables
which wont get stomped on the next call;
...


Can note that, IIRC:
Win64 gave a bigger part of this pie to callee-save;
SysV/AMD64 gave a bigger part of the pie to caller-save.

A roughly even split seemed like an easy answer, lacking any good way to
find a general/optimal balance across a range of programs.


Conceivably, it could also be possible to have a certain number of
"flexible" registers which a compiler could use to "fine tune" the
balance in the ABI, but these would be annoying at DLL/SO edges, as it
would require "worst case" handling (treating them like caller-save when
calling an import, and like callee-save for DLL exports).

In such an ABI, likely:
2/3: Nominally Callee Save
1/3: Caller Save / Scratch
With 1/3 of the register space able to be re-balanced from callee-save
to caller save by the compiler.


>>
>> Since you are not saving values with a simple push (usually) in the subroutine, you go
>> from a 1 byte PUSH to a 5 or 6 byte MOV to stack.
> <
> I go to a single instruction that pushes as much stuff as desired (by compiler)
> and then allocates a stack frame for the local-variables.
>


I once had PUSH/POP in BJX2, but then I dropped them (mostly for
cost-saving reasons; after noting that adjusting the stack-pointer and
then using a series of stores, or performing a series of loads and then
adjusting the stack pointer, could be similarly effective).

So, it is basically using Load/Store instructions...

However, in most cases:
MOV.X Rn, (SP, Disp4*8)
Can also be encoded in a 16-bit instruction format...


x86-64 would need 2x as many instructions here, and each instruction
would also need 5-bytes to encode, ...

So, so roughly a 500% encoding-cost delta in this case for x86-64 vs BJX2.


Then again, can also note that I am often seeing around a 300% delta
between BJX2 and x86-64 in terms of ".text" sizes and similar.

Though, I still tend to fall a bit short of being able to match Thumb2
or similar at this game...

MitchAlsup

unread,
Jun 24, 2022, 9:36:04 PM6/24/22
to
On Friday, June 24, 2022 at 7:55:52 PM UTC-5, BGB wrote:
> On 6/24/2022 3:02 PM, MitchAlsup wrote:

> > With <realistically> 30-64-bit registers in use by compiler and 16 of these preserved,
> > I am not seeing very much caller-save register traffic from Brian's LLVM port. It is more
> > like R9-R15 are simply temps used whenever and forgotten.
> That is presumably how it is supposed to be...
>
>
> In my case, it is roughly a 50/50 split between caller save (scratch)
> and callee save (preserved) registers.
<
I, too, have 50%/50%:: R1-15 are temps, R16-30 are preserved.
R0 receives Return Address, R31 is Stack Pointer. ½ of the temps
can be used to carry arguments and results covering the 98%-ile.
>
> For leaf functions, one wants a lot of scratch registers, and for
> non-leaf functions, a lot of callee-save registers.
>
> But, sadly, no party can be entirely happy:
> Leaf functions wishing they could have more registers to play with,
> without needing to save them first;
> Non-leaf functions wishing they could have more registers for variables
> which wont get stomped on the next call;
> ...
>
>
> Can note that, IIRC:
> Win64 gave a bigger part of this pie to callee-save;
> SysV/AMD64 gave a bigger part of the pie to caller-save.
<
CRAY-1 had only temp registers at the call/return interface. (Lee Higbe circa 1990)
IBM 360 had only preserved registers.
VAX had only preserved registers--both had 16 registers.
>
> A roughly even split seemed like an easy answer, lacking any good way to
> find a general/optimal balance across a range of programs.
>
The choice is a lot easier 50%/50% when you have 32 registers.
>
<snip>
> >
> I once had PUSH/POP in BJX2, but then I dropped them (mostly for
> cost-saving reasons; after noting that adjusting the stack-pointer and
> then using a series of stores, or performing a series of loads and then
> adjusting the stack pointer, could be similarly effective).
<
Push instructions make::
PUSH R1
PUSH R2
PUSH R3
more expensive than:
SUB SP,SP,#12
ST R1,[SP+8]
ST R1,[SP+4]
ST R1,[SP]
due to the serial dependency.
<
The peep hole HW optimizer in K9 would perform this transformation.
{Yes, the optimizer was a piece of HW the compiler knew nothing about.}

Ivan Godard

unread,
Jun 24, 2022, 10:30:17 PM6/24/22
to
And Mill has only preserved (from the view of the caller) but you don't
have to preserve them. And only has temps (from the viewpoint of the
callee) but you don't have to clear them.

David Brown

unread,
Jun 25, 2022, 4:15:02 AM6/25/22
to
Only a /very/ tiny proportion of network interfaces are 10 G or above.
Microcontrollers rarely have more than 100 Mbit Ethernet. The next big
thing in wired networking in the embedded world is two-wire Ethernet, to
get the convenience of Ethernet networking at low cost. It comes in 10
Mb and 100 Mb varieties (a 1 Gb variety may come eventually).

For wireless communication, speeds are usually even lower. Modern NBIOT
cellular systems are designed to be extremely low power, cheap, have
longer range (20 km more than 3G and the rest). You send packets of up
to about 200 bytes of data, perhaps once a day, with a delivery time of
several seconds. Perfect for environmental monitoring, finding your
sheep, and many other tasks.

For local Wifi (or Zigbee, Z-Wave, etc.) devices, small and low
bandwidth is also fine. You can get away with a few hundred bytes ram
and still have enough to control a lightbulb, thermostat, etc.

The IOT world is /full/ of systems running on 8-bit AVR's, 16-bit
MSP430's, and other small devices. Code density matters for many of them.

(Of course it's a different matter for wireless cameras and all the
other devices that need high bandwidth.)

BGB

unread,
Jun 25, 2022, 11:45:17 AM6/25/22
to
Seems like it would also be fairly trivial to chop 10/100 Ethernet down
to a 4-wire variant as well, probably using RJ11 plugs or similar.
Advantage of 4-wire as that this could allow for POE (and 4-wire
phone-wire could be cheaper than CAT5E or similar).

Could also be electrically compatible with existing hubs and switches
via an RJ11 to RJ45 adapter.


> For wireless communication, speeds are usually even lower.  Modern NBIOT
> cellular systems are designed to be extremely low power, cheap, have
> longer range (20 km more than 3G and the rest).  You send packets of up
> to about 200 bytes of data, perhaps once a day, with a delivery time of
> several seconds.  Perfect for environmental monitoring, finding your
> sheep, and many other tasks.
>
> For local Wifi (or Zigbee, Z-Wave, etc.) devices, small and low
> bandwidth is also fine.  You can get away with a few hundred bytes ram
> and still have enough to control a lightbulb, thermostat, etc.
>
> The IOT world is /full/ of systems running on 8-bit AVR's, 16-bit
> MSP430's, and other small devices.  Code density matters for many of them.
>
> (Of course it's a different matter for wireless cameras and all the
> other devices that need high bandwidth.)
>

I would have figured a network stack would have been a bit much for this
class of device...

David Brown

unread,
Jun 25, 2022, 12:31:00 PM6/25/22
to
10 Mbps and 100 Mbps Ethernet already only use 4 wires - one pair in
each direction. Passing (non-isolated) DC power over these wires is
extremely simple, and requires nothing more than a few diodes and an LC
filter. Unfortunately, the PoE standards were developed by a committee
of morons that produced a ridiculously over-engineered system that is
too bulky and expensive to have caught on outside a few specific use-cases.

Even easier, however, is simply to pass the power over the spare pairs
in a standard 4-pair Ethernet cable.

The two-wire Ethernet standards already include support for simpler and
cheaper PoE solutions.

> Could also be electrically compatible with existing hubs and switches
> via an RJ11 to RJ45 adapter.
>
>
>> For wireless communication, speeds are usually even lower.  Modern
>> NBIOT cellular systems are designed to be extremely low power, cheap,
>> have longer range (20 km more than 3G and the rest).  You send packets
>> of up to about 200 bytes of data, perhaps once a day, with a delivery
>> time of several seconds.  Perfect for environmental monitoring,
>> finding your sheep, and many other tasks.
>>
>> For local Wifi (or Zigbee, Z-Wave, etc.) devices, small and low
>> bandwidth is also fine.  You can get away with a few hundred bytes ram
>> and still have enough to control a lightbulb, thermostat, etc.
>>
>> The IOT world is /full/ of systems running on 8-bit AVR's, 16-bit
>> MSP430's, and other small devices.  Code density matters for many of
>> them.
>>
>> (Of course it's a different matter for wireless cameras and all the
>> other devices that need high bandwidth.)
>>
>
> I would have figured a network stack would have been a bit much for this
> class of device...

I have a book on my shelf describing a TCP/IP stack for an 8-bit PIC
microcontroller.

However, the network stack needed for small Wifi or NB-IOT systems is
vastly smaller than you need for a full IP and TCP/IP stack.

BGB

unread,
Jun 25, 2022, 3:01:30 PM6/25/22
to
That is kinda the point of how it would be electrically compatible:
Use the pairs that are in-use in 10/100;
Skip the other wires;
Maybe use smaller/cheaper RJ11 (6P4C variant) rather than RJ45.

Normal twisted-pair telephone wire would probably have sufficient
electrical properties to 10/100 in many cases.

Only thing is that it would require an adapter to plug RJ11 into RJ45,
though other options:
Use RJ45 but with only 2 pairs (effectively a 10/100-only wire);
Cable which has RJ11 on one end but RJ45 on the other.
(Side-stepping the need for an adapter at the switch).


Probably put the pins in the plug in such a way that it doesn't have
adverse effects if someone tries to plug a telephone into it.

Say:
NC, A+, B-, B+, A-, NC

Normal phone only connecting to the B pair (vs across the A/B pairs).

With the POE system I am imagining, if one did connect across the A/B
pairs, plugging a phone into it would result in it ringing continuously,
whereas if only the B pair is connected (probably the TX pair from the
phone's end), it would be silent and there would be zero net voltage
from the phone's end.


> Even easier, however, is simply to pass the power over the spare pairs
> in a standard 4-pair Ethernet cable.
>

I had assumed doing a thing of running 48 VAC or similar between the two
differential pairs.

This should work OK, but needs at least 4 wires (2 for each pair).
On the device side, there would probably be a bridge rectifier connected
to the center-taps of an isolation transformer.


For PoE with this system, an RJ11<->RJ45 adapter could also function as
the AC injector, say with a pair of isolation transformers (to let the
data through), with the center taps connected (via another transformer)
up to the mains power.

Could make sense in the PoE case to have it as a multi-port block
though, say, 4-8 simultaneous connections, rather than 1 adapter per cable.


> The two-wire Ethernet standards already include support for simpler and
> cheaper PoE solutions.
>

OK, would need to look into it.


But, would assume that a two-wire interface is not likely to be
electrically compatible with traditional Ethernet, at least not without
some additional trickery (additional isolation transformers and probably
a ground wire).


Signaling and power would maybe be done in a similar way to a 2-wire
telephone, but this wouldn't be able to be (passively) connected up to
existing hubs or switches.


>> Could also be electrically compatible with existing hubs and switches
>> via an RJ11 to RJ45 adapter.
>>
>>
>>> For wireless communication, speeds are usually even lower.  Modern
>>> NBIOT cellular systems are designed to be extremely low power, cheap,
>>> have longer range (20 km more than 3G and the rest).  You send
>>> packets of up to about 200 bytes of data, perhaps once a day, with a
>>> delivery time of several seconds.  Perfect for environmental
>>> monitoring, finding your sheep, and many other tasks.
>>>
>>> For local Wifi (or Zigbee, Z-Wave, etc.) devices, small and low
>>> bandwidth is also fine.  You can get away with a few hundred bytes
>>> ram and still have enough to control a lightbulb, thermostat, etc.
>>>
>>> The IOT world is /full/ of systems running on 8-bit AVR's, 16-bit
>>> MSP430's, and other small devices.  Code density matters for many of
>>> them.
>>>
>>> (Of course it's a different matter for wireless cameras and all the
>>> other devices that need high bandwidth.)
>>>
>>
>> I would have figured a network stack would have been a bit much for
>> this class of device...
>
> I have a book on my shelf describing a TCP/IP stack for an 8-bit PIC
> microcontroller.
>
> However, the network stack needed for small Wifi or NB-IOT systems is
> vastly smaller than you need for a full IP and TCP/IP stack.

OK.


Once (when I was much younger) I implemented a TCP/IP stack and Ethernet
card driver in a hobby OS project.

Lots of little lesser-known protocols in this mix, like ICMP and ARP and
similar, ...



In my current projects, I haven't gotten back around to this part yet.
Partly, it looks like to do it from an FPGA, one is basically driving
out the bits themselves, and has to write their own logic for
transmitting and receiving Ethernet frames at the level of bits going
over the wires (IIRC).


With the old card I was targeting, IIRC it was at the level of
abstraction of Ethernet frames getting transmitted and received via a
pair of ring buffers.

The card I am using does have an Ethernet port and similar at least, so
could maybe get to this eventually.

BGB

unread,
Jun 25, 2022, 4:52:19 PM6/25/22
to
On 6/24/2022 8:36 PM, MitchAlsup wrote:
> On Friday, June 24, 2022 at 7:55:52 PM UTC-5, BGB wrote:
>> On 6/24/2022 3:02 PM, MitchAlsup wrote:
>
>>> With <realistically> 30-64-bit registers in use by compiler and 16 of these preserved,
>>> I am not seeing very much caller-save register traffic from Brian's LLVM port. It is more
>>> like R9-R15 are simply temps used whenever and forgotten.
>> That is presumably how it is supposed to be...
>>
>>
>> In my case, it is roughly a 50/50 split between caller save (scratch)
>> and callee save (preserved) registers.
> <
> I, too, have 50%/50%:: R1-15 are temps, R16-30 are preserved.
> R0 receives Return Address, R31 is Stack Pointer. ½ of the temps
> can be used to carry arguments and results covering the 98%-ile.

Yeah:
R0/R1: Special
R2..R7: Scratch
R8..R14: Preserved
R15: SP
R16..R23: Scratch
R24..R31: Preserved

So: 14 scratch, 15 Preserved.

ABI:
R2/R3 Return Value
R2: Struct Pointer (Struct Return)
R3: 'this'
R4..R7, R20..R23: Arguments

If XGPR:
R32..R39, R48..R55: Scratch
R40..R47, R56..R63: Preserved


If the 128-bit ABI:
R36..R39, R52..R55: Arguments
Some other registers in the ABI are moved around.

SP is at R15 mostly for historical reasons, does result in some cruft
though.


>>
>> For leaf functions, one wants a lot of scratch registers, and for
>> non-leaf functions, a lot of callee-save registers.
>>
>> But, sadly, no party can be entirely happy:
>> Leaf functions wishing they could have more registers to play with,
>> without needing to save them first;
>> Non-leaf functions wishing they could have more registers for variables
>> which wont get stomped on the next call;
>> ...
>>
>>
>> Can note that, IIRC:
>> Win64 gave a bigger part of this pie to callee-save;
>> SysV/AMD64 gave a bigger part of the pie to caller-save.
> <
> CRAY-1 had only temp registers at the call/return interface. (Lee Higbe circa 1990)
> IBM 360 had only preserved registers.
> VAX had only preserved registers--both had 16 registers.

OK.

IIRC, x86:
EAX: Scratch
ECX: Scratch
EDX: Scratch
EBX: "It Depends" (1)
ESP: Stack
EBP: Base-Pointer / Preserved
ESI: Preserved
EDI: Preserved

1: Compilers didn't really seem to entirely agree on whether EBX was
Scratch or Preserved, but Preserved seemed to be more common.


>>
>> A roughly even split seemed like an easy answer, lacking any good way to
>> find a general/optimal balance across a range of programs.
>>
> The choice is a lot easier 50%/50% when you have 32 registers.

Yeah.


>>
> <snip>
>>>
>> I once had PUSH/POP in BJX2, but then I dropped them (mostly for
>> cost-saving reasons; after noting that adjusting the stack-pointer and
>> then using a series of stores, or performing a series of loads and then
>> adjusting the stack pointer, could be similarly effective).
> <
> Push instructions make::
> PUSH R1
> PUSH R2
> PUSH R3
> more expensive than:
> SUB SP,SP,#12
> ST R1,[SP+8]
> ST R1,[SP+4]
> ST R1,[SP]
> due to the serial dependency.
> <
> The peep hole HW optimizer in K9 would perform this transformation.
> {Yes, the optimizer was a piece of HW the compiler knew nothing about.}

Partial issue was that, once I got around to pipelining Load/Store
operations, the Push/Pop would have been generally slower as well as
they require an extra interlock stage to deal with the SP updates.

No hardware level optimizers in my case.

It was cheaper and easier simply to drop them from the ISA.

Typically, the:
ADD Imm8s, SP
Or:
ADD Imm16s, SP
Instructions are used for stack-pointer adjustments.


Had ended up treating the "LDISH16 Imm16, SP" encoding as a special type
of breakpoint, mostly one with a magic number, intended mostly to help
with debugging (if a "__debugbreak()" is hit, I can use the magic number
to figure out which debugbreak was hit, where otherwise I might not know
where exactly the loader has put the loaded program, meaning PC by
itself is not sufficient to identify the offending breakpoint).

But, this does seem like a bit of a hack.

I guess, can also note that 0000 is also interpreted as a breakpoint (so
it will trigger a breakpoint if branching into zeroed memory), but this
is not the official BREAK instruction (3030 or F000_3030).

Technically, it is actually:
MOV.B DLR, (DLR, DLR)
But, like, this encoding is pointless enough, that there is no issue
with interpreting it instead as a breakpoint.

...

Brett

unread,
Jun 25, 2022, 5:17:18 PM6/25/22
to
I don’t get why you think you need to reserve half your opcode space for
future extensions, I would have thought we are at the end of history for
opcode extensions.

What is the cost of reserving 3 bits of one pattern and the same pattern at
the 16 bit border, so that you can add 16 bit opcodes in the future?

Besides the cost of rearranging bits the cost is nothing with half the
opcode space reserved. If you are clever and spread the three bits to fit
your open slots.

This gives your customers choices, customers like choices. You could upsell
this as a future feature like RISCV does and just never deliver, like
RISCV. ;)


You have a two wide design, even once you implement 16 bit opcodes handling
the unaligned case of long instructions is not that hard, an extra gate
delay? Two?Depending on which of a dozen approaches you pick for size.

The resulting 13 bit opcode is plenty for one register a source ACC and a
dest ACC, and a full set of integer operations. And some short loads/store
offsets to/from ACC using one address register.

MitchAlsup

unread,
Jun 25, 2022, 6:29:37 PM6/25/22
to
On Saturday, June 25, 2022 at 4:17:18 PM UTC-5, gg...@yahoo.com wrote:
> MitchAlsup <Mitch...@aol.com> wrote:
<snip>
> > I looked at this a few years ago and the damage to long term ISA growth
> > was catastrophic. As it is I have nearly ½ of the OpCode space in each
> > OpCode group left for the future. and can PARSE instructions in 31 gates
> > with only 4 gates of delay. All that goes out the window with a meaningful
> > 16-bit "extension". I pass.
<
> I don’t get why you think you need to reserve half your opcode space for
> future extensions, I would have thought we are at the end of history for
> opcode extensions.
<
a) because I have watched what happens to processors over 50 years and
how every time you turn around they have more OpCodes--mainly to address
stuff forgotten earlier.
<
b) remember I compacted everything into only 59 actual instructions.
>
> What is the cost of reserving 3 bits of one pattern and the same pattern at
> the 16 bit border, so that you can add 16 bit opcodes in the future?
<
The cost of reserving space for 16-bit is that it over-constrains the 32-bit
OpCode space. For example: I could not give the 16-bit OpCodes a typical
subgroup (6-bit Major OpCode) because the first instruction would only
have 10-bits left !! (16-6=10)
<
Also note: Where there are instructions in several formats (like ADD with 16-bit
immediate and ADD of 2 registers), in all cases, the bit pattern used to recognize
ADD remains identical.
<
There are several OpCode groups reserved in perpetuity, these were chosen such
that if one branches into data there is very little possibility to finding anything other
than INVALID instruction decoding sitting there. From the M7 66000 ISA document::
<----------------------------------------------------------------------------------------------------------------------------
A number of the Major OpCode specifiers are reserved in perpetuity as illegal
OpCodes. Transferring control into 32-bit or 64-bit integer data, or into 32-bit or
64-bit floating point data will very likely result in the decoding of an illegal
instruction and raise the OPERATION exception. In particular, small 32-bit positive
integers and small 32-bit negative integers are illegal instructions. 32-bit Floating
point values in the range ±[1/128..32) are also illegal OpCodes. Should control be
transferred into typical integer or floating point data, there is little likelihood of
executing for a long time before running into an illegal instruction. Executing data
can also be prevented in the MMU when desired (almost always.)
<----------------------------------------------------------------------------------------------------------------------------
This catches programming errors. When I looked, you can't do this with 16-bit inst-
ructions; as basically every encoding has to be used.
<
That is: some architects think of things other than code density--we think of
code safety--even when the GuestOS fights us (or JavaScript,...) programming
attack modes, and design architectures that are substantially more robust than
current ones.
>
In particular, My 66000 is immune to the current laundry list of attack strategies
{Meltdown, Spectré, RowHammer, RoP, ForeShadow, ...}
<
> Besides the cost of rearranging bits the cost is nothing with half the
> opcode space reserved. If you are clever and spread the three bits to fit
> your open slots.
<
>
> This gives your customers choices, customers like choices. You could upsell
> this as a future feature like RISCV does and just never deliver, like
> RISCV. ;)
>
I give my customers a choice: Do you want a machine that is secure or not ?
>
> You have a two wide design,
<
Technically, I have a 1-wide design that can CoIssue some pairs of instructions
It is 1-wide because there is 3R1W register file. It can CoIssue because some
pairs of instructions do not consume certain register ports. It is not at all
SuperScalar !
<
I also have a 6-wide design. Nothing in ISA or the rest of the architecture makes
either design necessarily harder than it needs to be. For many of the control
logic calculations My 66000 ISA requires, I went to the trouble of implementing
the gate circuitry to perform said duties to verify that the choices were good.
For example, I can look at a 32-bit word (which takes 320 gates to hold in flip-
flops) and determine the instruction length in 31 gates (4-gates of delay). Thereby
Instruction caches for machines less than 16-wide do not need predecoding bits.
FCMP is performed in the same function unit as Integer CMP. The integer part
requires 55 gates (4 gates of delay) the FP addition is 17 gates (remains 4 gates
of delay) So adding FCMP to CMP is a small price {Oh and BTW, I restrict myself
to 4-in NAND gates and 3-in NOR gates}
<
> even once you implement 16 bit opcodes handling
> the unaligned case of long instructions is not that hard, an extra gate
> delay? Two? Depending on which of a dozen approaches you pick for size.
<
I can assure you that if I had set out to design a good 16-bit OpCode ISA
I would have succeeded. But over my 40 year career, I have found some
things more valuable, and I pursued those interests. You are free to do
differently.
>
> The resulting 13 bit opcode is plenty for one register a source ACC and a
> dest ACC, and a full set of integer operations. And some short loads/store
> offsets to/from ACC using one address register.
<
Then you end up with a cartesian product of various ways to do the same thing.
This makes the compiler have to figure out if::
<
INC Rd ; NoOp
is better or worse than:
ADD Rd,Rd,#1
and what set of rules (or heuristics) govern the selection. I have worked inside
compilers trying to figure this kind of stuff out. It ain't straightforward.
<
I submit that compilers are best when they only need to figure how to do
something exactly one way.

David Brown

unread,
Jun 26, 2022, 5:27:01 AM6/26/22
to
On 25/06/2022 21:01, BGB wrote:
> On 6/25/2022 11:30 AM, David Brown wrote:
>> On 25/06/2022 17:45, BGB wrote:

>>> Seems like it would also be fairly trivial to chop 10/100 Ethernet
>>> down to a 4-wire variant as well, probably using RJ11 plugs or
>>> similar. Advantage of 4-wire as that this could allow for POE (and
>>> 4-wire phone-wire could be cheaper than CAT5E or similar).
>>
>> 10 Mbps and 100 Mbps Ethernet already only use 4 wires - one pair in
>> each direction.  Passing (non-isolated) DC power over these wires is
>> extremely simple, and requires nothing more than a few diodes and an
>> LC filter.  Unfortunately, the PoE standards were developed by a
>> committee of morons that produced a ridiculously over-engineered
>> system that is too bulky and expensive to have caught on outside a few
>> specific use-cases.
>>
>
> That is kinda the point of how it would be electrically compatible:
>   Use the pairs that are in-use in 10/100;
>   Skip the other wires;
>   Maybe use smaller/cheaper RJ11 (6P4C variant) rather than RJ45.
>
> Normal twisted-pair telephone wire would probably have sufficient
> electrical properties to 10/100 in many cases.

RJ11 connectors are not going to be much cheaper than RJ45, if anything.
And telephone wire is not going to be good enough for anything here.
In particular, there are no real standards or quality control (except
for long-distance lines that cost more than good Ethernet cables because
no one installs new ones any more). So your cheapo bit of telephone
wire might work in one system, but a different cheapo wire won't. It
might work fine until your motors start, then the ESD interference
disrupts it. Having something that /might/ work or /usually/ works is
not very useful.

>
> Only thing is that it would require an adapter to plug RJ11 into RJ45,
> though other options:
>   Use RJ45 but with only 2 pairs (effectively a 10/100-only wire);
>   Cable which has RJ11 on one end but RJ45 on the other.
>     (Side-stepping the need for an adapter at the switch).
>
>
> Probably put the pins in the plug in such a way that it doesn't have
> adverse effects if someone tries to plug a telephone into it.
>
> Say:
>   NC, A+, B-, B+, A-, NC
>
> Normal phone only connecting to the B pair (vs across the A/B pairs).
>
> With the POE system I am imagining, if one did connect across the A/B
> pairs, plugging a phone into it would result in it ringing continuously,
> whereas if only the B pair is connected (probably the TX pair from the
> phone's end), it would be silent and there would be zero net voltage
> from the phone's end.
>
>
>> Even easier, however, is simply to pass the power over the spare pairs
>> in a standard 4-pair Ethernet cable.
>>
>
> I had assumed doing a thing of running 48 VAC or similar between the two
> differential pairs.
>

AC is expensive in embedded systems. DC is cheap.

> This should work OK, but needs at least 4 wires (2 for each pair).
> On the device side, there would probably be a bridge rectifier connected
> to the center-taps of an isolation transformer.
>
>
> For PoE with this system, an RJ11<->RJ45 adapter could also function as
> the AC injector, say with a pair of isolation transformers (to let the
> data through), with the center taps connected (via another transformer)
> up to the mains power.
>
> Could make sense in the PoE case to have it as a multi-port block
> though, say, 4-8 simultaneous connections, rather than 1 adapter per cable.
>
>
>> The two-wire Ethernet standards already include support for simpler
>> and cheaper PoE solutions.
>>
>
> OK, would need to look into it.
>
>
> But, would assume that a two-wire interface is not likely to be
> electrically compatible with traditional Ethernet, at least not without
> some additional trickery (additional isolation transformers and probably
> a ground wire).
>

It is not electrically compatible, even with such trickery. The
trickery involved to get full duplex signalling on a single twisted pair
involves a good deal more than a transformer!

Currently, two-wire PHY's are rare and expensive. But that's due to
their novelty - if they become popular and quantities go up, prices will
drop. Note that short-range two-wire 10 Mbps Ethernet is a multi-drop
bus, and does not need a switch. (I'm not sure if it supports PoE.)

BGB

unread,
Jun 26, 2022, 2:43:21 PM6/26/22
to
For telephone wire, I was mostly thinking of 2-pair CAT3 (as opposed to
4 pair CAT3).

For 10/100, it should be OK, since usually the other two pairs are just
sitting around mostly doing nothing.

Apparently, 2-pair CAT3 (still with RJ45) was actually a thing at one
point for some LANs.



However, I have noted that for some short/cheap Ethernet cables, have
seen the RJ45 connectors crimped onto pieces of flat ribbon cable,
implying that one "can" probably get by with cheaper (non-twisted /
CAT1) wire in some cases (I would guess probably for runs of say 1 or 2
meters or similar, would likely need some testing).


Though, I guess one difference between RJ11 and RJ45 is that a lot of
the through-hole RJ45 plugs have built-in isolation transformers,
whereas a board build for RJ11 plugs might need to supply these itself.


>>
>> Only thing is that it would require an adapter to plug RJ11 into RJ45,
>> though other options:
>>    Use RJ45 but with only 2 pairs (effectively a 10/100-only wire);
>>    Cable which has RJ11 on one end but RJ45 on the other.
>>      (Side-stepping the need for an adapter at the switch).
>>
>>
>> Probably put the pins in the plug in such a way that it doesn't have
>> adverse effects if someone tries to plug a telephone into it.
>>
>> Say:
>>    NC, A+, B-, B+, A-, NC
>>
>> Normal phone only connecting to the B pair (vs across the A/B pairs).
>>
>> With the POE system I am imagining, if one did connect across the A/B
>> pairs, plugging a phone into it would result in it ringing
>> continuously, whereas if only the B pair is connected (probably the TX
>> pair from the phone's end), it would be silent and there would be zero
>> net voltage from the phone's end.
>>
>>
>>> Even easier, however, is simply to pass the power over the spare
>>> pairs in a standard 4-pair Ethernet cable.
>>>
>>
>> I had assumed doing a thing of running 48 VAC or similar between the
>> two differential pairs.
>>
>
> AC is expensive in embedded systems.  DC is cheap.
>

Small transformer and 4 diodes, should be manageable.

With 48VDC, one is going to need a buck converter, which requires an
inductor and a sense and control circuits.

One can do AC->DC with 4 diodes and a capacitor, which is not a huge
cost in any sense, and AC allows using a linear transformer to step down
to 5V or similar.


It would likely be simpler and cheaper than traditional DC PoE:
Doesn't care which pair is which;
Doesn't need any logic to detect what is on the other end.

Traditional PoE involves a pointlessly complicated "negotiation" step.


Eg: For this, just sorta stick 48VAC on the line and assume it is good.
Though, this would not be compatible with devices assuming DC PoE.

But, in a way, this would be a potential advantage to using RJ11, so one
doesn't accidentally mix them up (not sure how well the DC devices would
tolerate being plugged into AC).


>> This should work OK, but needs at least 4 wires (2 for each pair).
>> On the device side, there would probably be a bridge rectifier
>> connected to the center-taps of an isolation transformer.
>>
>>
>> For PoE with this system, an RJ11<->RJ45 adapter could also function
>> as the AC injector, say with a pair of isolation transformers (to let
>> the data through), with the center taps connected (via another
>> transformer) up to the mains power.
>>
>> Could make sense in the PoE case to have it as a multi-port block
>> though, say, 4-8 simultaneous connections, rather than 1 adapter per
>> cable.
>>
>>
>>> The two-wire Ethernet standards already include support for simpler
>>> and cheaper PoE solutions.
>>>
>>
>> OK, would need to look into it.
>>
>>
>> But, would assume that a two-wire interface is not likely to be
>> electrically compatible with traditional Ethernet, at least not
>> without some additional trickery (additional isolation transformers
>> and probably a ground wire).
>>
>
> It is not electrically compatible, even with such trickery.  The
> trickery involved to get full duplex signalling on a single twisted pair
> involves a good deal more than a transformer!
>
> Currently, two-wire PHY's are rare and expensive.  But that's due to
> their novelty - if they become popular and quantities go up, prices will
> drop.  Note that short-range two-wire 10 Mbps Ethernet is a multi-drop
> bus, and does not need a switch.  (I'm not sure if it supports PoE.)
>

OK.


Not sure how it works, haven't found much information thus far.


I guess if I were to consider designing something, a few possibilities:
Simple DC signaling, say:
An RX and TX line, possibly at 3.3v or similar;
Using 8b/10b or similar;
Would need a common ground;
Not likely suitable for longer-distance signaling:
Signal integrity and possible ground-loop issues.
Likely point-to-point only (couldn't do a passive bus).
Something sorta like a CAN bus:
Maybe still using 8b/10b or similar;
Basically, one has a wire that is normally pulled high;
Pull low to send bits, half-duplex.
Likely still subject to ground-loop and propagation delays
Differential signaling, likely tri-state (+/0/-):
Would at least avoid ground loop issues and similar;
Back to requiring isolation transformers and similar.
Wired in a star, there might also be an issue with echoes (1)



1: Potentially echoes would be worse with the isolation transformers, as
induction is likely to reflect an inverted version of the signal back
down the wire, which if if hits the other transformers might flip back
into a positive image, ... Cases where constructive interference occurs
could potentially raise the "noise floor" enough to interfere with
transmission (such as corrupting transmitted frames).


Echo rate at 10 Mbps would be around 1 bit every 30 meters, probably
manageable, and the reflection should mostly "self annihilate" at the
point it is received. For a "sane" network size, the echo is likely to
almost entirely dissipate within a few bits.


At 100 Mbps, it would be 1 bit every 3 meters, potentially a bigger
issue, as echoes would propagate for comparably longer. A lot would also
depend on the inductive properties of the isolation transformers.

(It might take several hundred bits of "silence" for the echoes to die
down following a transmitted message with, say, a 100 meter network).

This issue would likely be significantly reduced if doing point-to-point
signaling (constructive interference could not occur). Would depend
mostly on the "sensitivity" at each end, would want to have sensitivity
low enough that it doesn't detect reflected bits, but high enough that
it does detect transmitted bits.

...

Brett

unread,
Jun 26, 2022, 4:31:25 PM6/26/22
to
An optimizer pass to convert all loads that are only used once to ACC ops
is pretty trivial, as a software guy I am not asking for Itanic compiler
changes.

All the other 16 bit variants use restricted register sets and has deep
effects on register coloring, etc. I tried a dozen of these on paper and
despite my propaganda of the time they all sucked. ;(

ACC ops follow the KISS principle. Simple one for one substitution of 16
bit opcodes for the longer ones where the operation tree makes it possible.

To make best use you need a bunch of 32 bit ops that use ACC, or better yet
just use a register like R1, and remove that register from normal use, a
reduction of one register for the other compiler passes.

MitchAlsup

unread,
Jun 26, 2022, 5:56:38 PM6/26/22
to
While I can, in general, agree that you are not asking for Titanic additions
to the compiler, you are asking for Titanic alterations of the axioms and
tenets underlying the encoding philosophy of My 66000 ISA. Basically,
you are asking for a complete reset. I am not interested in starting over.
However, You are free to design whatever 16-bit instructions you want for
your architecture.
>
> All the other 16 bit variants use restricted register sets and has deep
> effects on register coloring, etc. I tried a dozen of these on paper and
> despite my propaganda of the time they all sucked. ;(
<
You and Quadriblock should get together and compare notes........maybe
you could teach him about "what to leave out" as part of your comp.arch
meeting.
>
> ACC ops follow the KISS principle. Simple one for one substitution of 16
> bit opcodes for the longer ones where the operation tree makes it possible.
<
I am well aware of how accumulator machines perform. Rather well on the
60%+ code sequences, and less well on the 40%- code sequences. Whereas
a 32-bit only ISA has but one way of expressing arithmetic and does not so
suffer.
<
Remember I have a <essentially> pure IRSC ISA that is achieving x86-64
code density--significantly better than <¿almost?> all other pure RISC ISAs.
I got here by eliminating instruction functionality that could be embodied
elsewhere in the data-path and uniformly encoded in ISA. This makes each
instruction more powerful without adding delay to its execution and makes
the compilers job a bit easier in expressing the necessary semantic.
>
> To make best use you need a bunch of 32 bit ops that use ACC, or better yet
> just use a register like R1, and remove that register from normal use, a
> reduction of one register for the other compiler passes.
<
Yech.

BGB

unread,
Jun 26, 2022, 6:15:25 PM6/26/22
to
A few thoughts:
The restricted set should not be smaller than 1/2 the baseline
"full-sized" set.

So, for 32 GPRs, 4b (16 regs) is OK, 3b (8 regs) a bit less so.

I skipped 3R encodings in the 16-bit space, as they would be
"essentially useless".

For size-optimized code, one does have to bias the compiler towards
using a smaller set of registers (the set usable by 16-bit encodings),
which does typically come at a performance cost (in my case, this is
mixed with the other drawback that one can't predicate or bundle the
16-bit encodings).


So, the general result is that the program is roughly 50% bigger if
built in speed-optimized modes.

Size optimized mode: around 60% 16-bit, 40% 32-bit;
Speed optimized mode: around 20% 16-bit, 80% 32-bit.

Speed-optimized 16/32 is still generally smaller than a fixed-length
32-bit subset though (well, and more so if one disallows Jumbo
encodings, which adds an additional size penalty).



> ACC ops follow the KISS principle. Simple one for one substitution of 16
> bit opcodes for the longer ones where the operation tree makes it possible.
>
> To make best use you need a bunch of 32 bit ops that use ACC, or better yet
> just use a register like R1, and remove that register from normal use, a
> reduction of one register for the other compiler passes.
>

Side note:
Why are R0 and R1 "special" in my case?...

Mostly because early on, I removed them from normal use to have
registers which the ASM stage could stomp without warning.


R0 was used typically to load temporary values into if the Immed field
was insufficient.

Say, without Jumbo:
ADD R4, 123, R5 //OK, can use an immediate form
ADD R6, 123456, R7 //Not OK, doesn't fit.

So, the assembler would quietly emit it as:
LDIZ 123456, R0 //OK
ADD R6, R0, R7 //Also OK
And, pretend that a larger immediate existed.

This simplified higher-level code-generation, which could then mostly
pretend that the immediate field was always large enough.

This is also why "LDIz Imm24, R0" exists, even despite eating a
relatively huge amount of encoding space.

With Jumbo, these cases can usually be encoded directly (usually saving
1 or 2 clock cycles).


Though, there are still a lot of edge cases where LDIZ ends up being used:
Because I have reloc types for LDIZ but not for the Jumbo forms;
Cases where the instruction-encoding logic lacks the case to deal with
the jumbo-form encoding;
...

Sadly, some of this part of my compiler has gathered a fair bit of cruft.

Partly, it is a case where my initial strategy:
Emitter function for each instruction type;
Bit "switch()" blocks to fill out instruction words for each mnemonic;
...
Didn't scale as well as I had hoped...


At this stage, I might have been better able with a different approach:
Use a big listing (grouped by mnemonic);
Pattern match against the listing;
Emit the first-found instruction which matches the pattern.

I had used this approach in my x86 assemblers, which is admittedly a
little more scalable than "giant switch() blocks".

However, "switch()" blocks (with logic paths for every possible
instruction encoding) are initially simpler if one assumes a reasonably
small ISA.



Then of course, some parts of my compiler, such as the WEXifier, do
their work mostly on already-emitted machine code, which is kind of a
stupid approach, but was the easiest approach at the time.

Well, and the way it works is kind limited:
Look at 3 instructions (A,B,C);
Is A or B fixed/immovable?
If so, Advance 1 and Continue.
Can A, B, and C be bundled? If yes:
Bundle A, B, and C
Advance 3;
Continue.
Can A and B be Bundled
Bundle A and B;
Advance 2;
Continue.
Can B and C be Swapped?
Will swapping B and C allow bundling A and C? If yes:
Swap B and C;
Continue.
Can A and B be Swapped?
Will swapping A and B allow bundling A and C? If yes:
Swap A and B;
Continue.
Advance 1 and Continue.
(Couldn't do anything here, so leave instruction as-is).

There are a few more cases than this, but this is the general algorithm.


The logic for "can A and B be swapped" mostly checks for things like
register dependencies, whether they are memory load/store ops, whether
they depend on or modify SR.T, etc...

Immovable cases mostly involve cases which are no longer simple 32-bit
instructions:
Instructions which have a reloc;
The LDIz Imm24 op (generally assumed to have a reloc).
Load/Store ops with a PC-relative encoding;
Jumbo encodings;
Stuff that is already bundled;
...


There is another step that runs before this, similar algo, just tries to
swap instructions if doing so will reduce the number of pipeline interlocks.

As can be noted, these aren't particularly good at extracting all that
much ILP. They will also not perform swaps if doing so would require an
intermediate swap which is not allowed, ...

(For example, manually optimizing stuff in ASM frequently involves
non-local instruction movements, which aren't really allowed here; since
validating a movement much beyond 2-instruction swaps becomes
exponentially harder).


It is likely that better could be to have a more general process, say:
Generate a permutation index (0..N);
Look for a permutation which minimizes the number of interlocks and
maximizes the number of bundled encodings;
Swaps the instructions into the found ordering.

Then say, use either a gradient descent approach:
Scan instructions;
Check if A can be swapped with B;
Evaluate if doing so would improve the metric (globally).
If so, Swap.
Continue until no more swaps occur.

Or, maybe a genetic-algorithm style approach:
Pick instructions at random to try to swap to breed more patterns;
Rank each by the metric;
Continue for 4*N generations;
Use the best-found as the final ordering.


But, these would be a bit of work to add, not sure how much they would
gain over the current approach (which only operates on a window of 3
instructions).


Note that before the WEXifier runs, the compiler behaves as if it is
generating code for a plain RISC-style ISA (the compiler has no
understanding of modulo scheduling, which is at present mostly limited
to hand written ASM).

...


In other news, after tweaking a few things in my compiler, to prioritize
the "LDIz Imm10, Rn" and "ADD Imm10, Rn" encodings over their Imm16
equivalents, the Imm10 encodings now dominate over the Imm16 encodings
in terms of instruction count (roughly a 90% / 10% split).

This implies that most of the constant loads are still fairly small (can
fit easily into an Imm10).

The reason for the change was mostly that Imm10 allows a few encoding
cases (such as PrWEX) which the Imm16 versions do not (without this
change, the compiler was missing a lot of cases where it could WEXify
the predicated ops, which happened to include a lot of constant loads
and similar).

This modifies the prior stats some, eg:
Imm8: 11%
Imm10: 63%
Imm16: 7%
Binary16: 12% (These are now the majority of the F8 block)
Imm33: 4.4%
Imm64: 2.5%

...

BGB

unread,
Jun 26, 2022, 8:28:11 PM6/26/22
to
On 6/26/2022 4:56 PM, MitchAlsup wrote:
> On Sunday, June 26, 2022 at 3:31:25 PM UTC-5, gg...@yahoo.com wrote:
>> MitchAlsup <Mitch...@aol.com> wrote:
>>> On Saturday, June 25, 2022 at 4:17:18 PM UTC-5, gg...@yahoo.com wrote:
>
>>> Then you end up with a cartesian product of various ways to do the same thing.
>>> This makes the compiler have to figure out if::
>>> <
>>> INC Rd ; NoOp
>>> is better or worse than:
>>> ADD Rd,Rd,#1
>>> and what set of rules (or heuristics) govern the selection. I have worked inside
>>> compilers trying to figure this kind of stuff out. It ain't straightforward.
>>> <
>>> I submit that compilers are best when they only need to figure how to do
>>> something exactly one way.
> <
>> An optimizer pass to convert all loads that are only used once to ACC ops
>> is pretty trivial, as a software guy I am not asking for Itanic compiler
>> changes.
> <
> While I can, in general, agree that you are not asking for Titanic additions
> to the compiler, you are asking for Titanic alterations of the axioms and
> tenets underlying the encoding philosophy of My 66000 ISA. Basically,
> you are asking for a complete reset. I am not interested in starting over.
> However, You are free to design whatever 16-bit instructions you want for
> your architecture.

Yeah, there are only so many possible combinations.


>>
>> All the other 16 bit variants use restricted register sets and has deep
>> effects on register coloring, etc. I tried a dozen of these on paper and
>> despite my propaganda of the time they all sucked. ;(
> <
> You and Quadriblock should get together and compare notes........maybe
> you could teach him about "what to leave out" as part of your comp.arch
> meeting.

Yeah, the funky obsession with non-power-of-2 data sizes and similar
would probably be high on my list.


>>
>> ACC ops follow the KISS principle. Simple one for one substitution of 16
>> bit opcodes for the longer ones where the operation tree makes it possible.
> <
> I am well aware of how accumulator machines perform. Rather well on the
> 60%+ code sequences, and less well on the 40%- code sequences. Whereas
> a 32-bit only ISA has but one way of expressing arithmetic and does not so
> suffer.
> <
> Remember I have a <essentially> pure IRSC ISA that is achieving x86-64
> code density--significantly better than <¿almost?> all other pure RISC ISAs.
> I got here by eliminating instruction functionality that could be embodied
> elsewhere in the data-path and uniformly encoded in ISA. This makes each
> instruction more powerful without adding delay to its execution and makes
> the compilers job a bit easier in expressing the necessary semantic.

I will assume in this you *don't* mean the apparent common variation of
x86-64 which takes upwards of 1.5MB to build Doom...



>>
>> To make best use you need a bunch of 32 bit ops that use ACC, or better yet
>> just use a register like R1, and remove that register from normal use, a
>> reduction of one register for the other compiler passes.
> <
> Yech.

Agreed...

There isn't really a good reason to have hard-coded registers "in
general" with 32-bit instruction encodings, particularly not for an
accumulator.


I will make a partial assumption for loading a big constant to a fixed
register, where having an instruction for a larger constant load could
partly offset the drawback of having "not particularly large" immediate
fields in other contexts.


Also at the time, 24 bits would be sufficient in-general for things like
data/bss loads and stores, whereas something more modest (such as 16 or
20 bit) would not have been sufficient (several of these programs
effectively have several MB of '.bss').

...

Ivan Godard

unread,
Jun 26, 2022, 9:17:30 PM6/26/22
to
There are advantages to dedicated base registers, set as a side effect
of other operations. You can keep them where they are used, instead of
in a regfile, saving wire delay. You need fewer of them (Mill: 8 vs 32)
saving fan-in. If you restrict addressable regions to not cross a 4Gb
(or less) boundary then you can have a narrower address adder (Mill: 32
bits vs 64). You avoid the loads with fat constants. You lower pressure
on the genregs. The address-using instructions need fewer bits to encode
the base (Mill: 3 vs 5). The base can be used to select a particular WKR
for range checking, avoiding the trip to the PLB. The base can be used
as the lwb in its WHR, reducing state for task switch or call.

And so on.

MitchAlsup

unread,
Jun 26, 2022, 9:18:47 PM6/26/22
to
On Sunday, June 26, 2022 at 7:28:11 PM UTC-5, BGB wrote:
> On 6/26/2022 4:56 PM, MitchAlsup wrote:
> > On Sunday, June 26, 2022 at 3:31:25 PM UTC-5, gg...@yahoo.com wrote:
> >> MitchAlsup <Mitch...@aol.com> wrote:
> >>> On Saturday, June 25, 2022 at 4:17:18 PM UTC-5, gg...@yahoo.com wrote:
> >
> >>> Then you end up with a cartesian product of various ways to do the same thing.
> >>> This makes the compiler have to figure out if::
> >>> <
> >>> INC Rd ; NoOp
> >>> is better or worse than:
> >>> ADD Rd,Rd,#1
> >>> and what set of rules (or heuristics) govern the selection. I have worked inside
> >>> compilers trying to figure this kind of stuff out. It ain't straightforward.
> >>> <
> >>> I submit that compilers are best when they only need to figure how to do
> >>> something exactly one way.
> > <
> >> An optimizer pass to convert all loads that are only used once to ACC ops
> >> is pretty trivial, as a software guy I am not asking for Itanic compiler
> >> changes.
> > <
> > While I can, in general, agree that you are not asking for Titanic additions
> > to the compiler, you are asking for Titanic alterations of the axioms and
> > tenets underlying the encoding philosophy of My 66000 ISA. Basically,
> > you are asking for a complete reset. I am not interested in starting over.
> > However, You are free to design whatever 16-bit instructions you want for
> > your architecture.
> Yeah, there are only so many possible combinations.
<
nearly infinite purmutations........
> >>
> >> All the other 16 bit variants use restricted register sets and has deep
> >> effects on register coloring, etc. I tried a dozen of these on paper and
> >> despite my propaganda of the time they all sucked. ;(
> > <
> > You and Quadriblock should get together and compare notes........maybe
> > you could teach him about "what to leave out" as part of your comp.arch
> > meeting.
> Yeah, the funky obsession with non-power-of-2 data sizes and similar
> would probably be high on my list.
<
I was wondering if anyone caught the palindrome of his thread "...life the universe
and everything." In Hitchhikers Guide to the Galaxy the answer to "life the universe
and everything" is/was 42 whereas Quadriblock's thread converges on 24 which
is the simple reverse of 42.
> >>
> >> ACC ops follow the KISS principle. Simple one for one substitution of 16
> >> bit opcodes for the longer ones where the operation tree makes it possible.
> > <
> > I am well aware of how accumulator machines perform. Rather well on the
> > 60%+ code sequences, and less well on the 40%- code sequences. Whereas
> > a 32-bit only ISA has but one way of expressing arithmetic and does not so
> > suffer.
> > <
> > Remember I have a <essentially> pure IRSC ISA that is achieving x86-64
> > code density--significantly better than <¿almost?> all other pure RISC ISAs.
> > I got here by eliminating instruction functionality that could be embodied
> > elsewhere in the data-path and uniformly encoded in ISA. This makes each
> > instruction more powerful without adding delay to its execution and makes
> > the compilers job a bit easier in expressing the necessary semantic.
<
> I will assume in this you *don't* mean the apparent common variation of
> x86-64 which takes upwards of 1.5MB to build Doom...
<
No, in general I am talking about "never having to" in terms of loading
constants, pasting constants together, negating or inverting operands,
and the prologue and epilogue handling instructions.
> >>
> >> To make best use you need a bunch of 32 bit ops that use ACC, or better yet
> >> just use a register like R1, and remove that register from normal use, a
> >> reduction of one register for the other compiler passes.
> > <
> > Yech.
> Agreed...
>
> There isn't really a good reason to have hard-coded registers "in
> general" with 32-bit instruction encodings, particularly not for an
> accumulator.
>
The only hard coded registers is::
a) R0 receives the return address when control is delivered to a subroutine.
b) ENTER and EXIT use R31 = SP to build and tear down stack fames.
<
However; there are circumstances where the HW understands that ranges
of registers have certain properties. For example: when Safe-Stack mode
is in effect, R16-R31 are saved on Safe Stack, R0 does not receive the return
address, and R16-R30 when read before write return zeros.
>
> I will make a partial assumption for loading a big constant to a fixed
> register, where having an instruction for a larger constant load could
> partly offset the drawback of having "not particularly large" immediate
> fields in other contexts.
>
My 66000 never has to load a constant to a register. Somewhere close
to 9% of my instruction count vanishes by uniformly supplying large
constants.
>
> Also at the time, 24 bits would be sufficient in-general for things like
> data/bss loads and stores, whereas something more modest (such as 16 or
> 20 bit) would not have been sufficient (several of these programs
> effectively have several MB of '.bss').
>
In My 66000 memory mapping; a simple application such as cat can have
the .txt, .data, .bss, ... each separated by GBs in the virtual address
space and yet only need 1 page of memory mapping tables !
> ...

Ivan Godard

unread,
Jun 26, 2022, 10:57:39 PM6/26/22
to
Doesn't have to, but isn't it advisble when a constant has widespread uses?


Brett

unread,
Jun 26, 2022, 11:00:24 PM6/26/22
to
MitchAlsup <Mitch...@aol.com> wrote:
> On Saturday, June 25, 2022 at 4:17:18 PM UTC-5, gg...@yahoo.com wrote:
>> MitchAlsup <Mitch...@aol.com> wrote:
> <snip>
>>> I looked at this a few years ago and the damage to long term ISA growth
>>> was catastrophic. As it is I have nearly ½ of the OpCode space in each
>>> OpCode group left for the future. and can PARSE instructions in 31 gates
>>> with only 4 gates of delay. All that goes out the window with a meaningful
>>> 16-bit "extension". I pass.
> <
>> I don’t get why you think you need to reserve half your opcode space for
>> future extensions, I would have thought we are at the end of history for
>> opcode extensions.
> <
> a) because I have watched what happens to processors over 50 years and
> how every time you turn around they have more OpCodes--mainly to address
> stuff forgotten earlier.
> <
> b) remember I compacted everything into only 59 actual instructions.
>>
>> What is the cost of reserving 3 bits of one pattern and the same pattern at
>> the 16 bit border, so that you can add 16 bit opcodes in the future?
> <
> The cost of reserving space for 16-bit is that it over-constrains the 32-bit
> OpCode space. For example: I could not give the 16-bit OpCodes a typical
> subgroup (6-bit Major OpCode) because the first instruction would only
> have 10-bits left !! (16-6=10)

I can work with 10 bits, ideally 11 for 59 instructions, which would be two
sub groups.

ACC opcodes only need one register, plus 5 bits for the opcode hits all the
common cases this extension would use, so 10 bits works fine.

Note that this extension will not slow down code like all the other 16 bit
architectures. You will get a small boost from a smaller code footprint. 32
bit instructions are unaffected, and do all the heavy lifting.

With two opcode groups it should wire directly into your existing logic,
which actually happens anyway if you split this extension into smaller
groups spread into the open opcode spaces where appropriate.

BGB

unread,
Jun 26, 2022, 11:14:27 PM6/26/22
to
FWIW: 128/3 ~= 42 ...

But, yeah, in a more practical sense, unusual data sizes aren't much of
a win.


>>>>
>>>> ACC ops follow the KISS principle. Simple one for one substitution of 16
>>>> bit opcodes for the longer ones where the operation tree makes it possible.
>>> <
>>> I am well aware of how accumulator machines perform. Rather well on the
>>> 60%+ code sequences, and less well on the 40%- code sequences. Whereas
>>> a 32-bit only ISA has but one way of expressing arithmetic and does not so
>>> suffer.
>>> <
>>> Remember I have a <essentially> pure IRSC ISA that is achieving x86-64
>>> code density--significantly better than <¿almost?> all other pure RISC ISAs.
>>> I got here by eliminating instruction functionality that could be embodied
>>> elsewhere in the data-path and uniformly encoded in ISA. This makes each
>>> instruction more powerful without adding delay to its execution and makes
>>> the compilers job a bit easier in expressing the necessary semantic.
> <
>> I will assume in this you *don't* mean the apparent common variation of
>> x86-64 which takes upwards of 1.5MB to build Doom...
> <
> No, in general I am talking about "never having to" in terms of loading
> constants, pasting constants together, negating or inverting operands,
> and the prologue and epilogue handling instructions.

My case is pretty much comparable to x86-64 in what you can do with
immediate and displacement fields:
x86-64: 8 and 32; 64 via a dedicated load.
BJX2: 9 and 33, 64 via a dedicated load.

But, differs in that it has 3R, and is Load/Store.

Also, I have both more registers, and encodings that are often smaller.
Though, x86-64 has:
MOV Rn, Imm64 (80-bits)
Where:
LDI Imm64, Rn (96-bits)
So, x86-64 wins this one...


However, compared with some of the other options (eg: SH-4 or Thumb2),
x86-64's code density seems to be "pretty bad".

Even something like ARMv5 (with fixed-length 32-bit instructions) seems
to be able to beat x86-64 on this metric.




>>>>
>>>> To make best use you need a bunch of 32 bit ops that use ACC, or better yet
>>>> just use a register like R1, and remove that register from normal use, a
>>>> reduction of one register for the other compiler passes.
>>> <
>>> Yech.
>> Agreed...
>>
>> There isn't really a good reason to have hard-coded registers "in
>> general" with 32-bit instruction encodings, particularly not for an
>> accumulator.
>>
> The only hard coded registers is::
> a) R0 receives the return address when control is delivered to a subroutine.
> b) ENTER and EXIT use R31 = SP to build and tear down stack fames.
> <
> However; there are circumstances where the HW understands that ranges
> of registers have certain properties. For example: when Safe-Stack mode
> is in effect, R16-R31 are saved on Safe Stack, R0 does not receive the return
> address, and R16-R30 when read before write return zeros.

OK.

I have R0, R1, and R15/SP hard-coded in certain contexts.

LR is nominally in CR space, but I sometimes used R1 as a "Secondary
Link-Register" or "Saved-Link-Register", mostly in the context of prolog
and epilog compression.


R0 and R1 can be used as scratch registers (with care), however:
The assembler may stomp them without warning in some cases;
For some instructions, they are not allowed:
In some cases, the encodings are special cases;
Some instructions are simply not allowed to use them.
...


>>
>> I will make a partial assumption for loading a big constant to a fixed
>> register, where having an instruction for a larger constant load could
>> partly offset the drawback of having "not particularly large" immediate
>> fields in other contexts.
>>
> My 66000 never has to load a constant to a register. Somewhere close
> to 9% of my instruction count vanishes by uniformly supplying large
> constants.

Stuff like:
if(x>=100)
x=99;
Is still kind of a thing...

But, yeah:
y=x+12345678;
Can be handled with an Imm33s encoding.



But, as noted, the FAzz_zzzz and FBzz_zzzz encodings predate the
addition of Jumbo encodings. In the original form of the ISA (and in ISA
subsets without Jumbo) they are more useful.


But, as noted, these only exist in one of the "unconditional subspace":
FAzz_zzzz LDIZ Imm24u, R0
FBzz_zzzz LDIN Imm24n, R0
WEX Space (Same spot, just WEX=1):
FEzz_zzzz Jumbo
FFzz_zzzz Op64
Predicate Space:
EAzz_zzzz PrWEX F0?T
EBzz_zzzz PrWEX F2?T
EEzz_zzzz PrWEX F0?F
EFzz_zzzz PrWEX F2?F

So, whether or not these are "still" useful, their "twins" elsewhere in
the encoding space allow for a few other encoding spaces to exist.

Likewise:
FFdd_dddd_FAdd_dddd BRA Abs48
FFdd_dddd_FBdd_dddd BSR Disp48
It is itself effective bundled with itself to encode the Abs48 branches.

Had I put pretty much anything else there, I would not have been able to
do this stuff in this way.


>>
>> Also at the time, 24 bits would be sufficient in-general for things like
>> data/bss loads and stores, whereas something more modest (such as 16 or
>> 20 bit) would not have been sufficient (several of these programs
>> effectively have several MB of '.bss').
>>
> In My 66000 memory mapping; a simple application such as cat can have
> the .txt, .data, .bss, ... each separated by GBs in the virtual address
> space and yet only need 1 page of memory mapping tables !

OK.

I am mostly dealing with programs like Doom and Quake effectively
needing ~ 21..23 bits to be able to address across the size of their
'.bss' section.

One can eliminate a lot of bits by using GBR (Global-Base-Register), or
similar, but still need a lot of bits to deal with the size of the section.

One could use less bits by using a GOT, but this just sorta replaces the
large displacements with additional memory loads and tables (not really
a win).


Excluding things like larger displacements needed for branching and
addressing into '.bss' and similar the, most other constants tend to be
a lot smaller.


...

Thomas Koenig

unread,
Jun 27, 2022, 1:33:40 AM6/27/22
to
Ivan Godard <iv...@millcomputing.com> schrieb:

> There are advantages to dedicated base registers, set as a side effect
> of other operations. You can keep them where they are used, instead of
> in a regfile, saving wire delay. You need fewer of them (Mill: 8 vs 32)
> saving fan-in. If you restrict addressable regions to not cross a 4Gb
> (or less) boundary then you can have a narrower address adder (Mill: 32
> bits vs 64).

Hmm... does that mean that the Mill has to do special things to
address arrays > 4 GB?

Thomas Koenig

unread,
Jun 27, 2022, 1:37:35 AM6/27/22
to
MitchAlsup <Mitch...@aol.com> schrieb:

> My 66000 never has to load a constant to a register. Somewhere close
> to 9% of my instruction count vanishes by uniformly supplying large
> constants.

It does not have to, but it makes sense to do so (and is
now done) if the same constant is stored multiple times, see
https://github.com/bagel99/llvm-my66000/issues/2 .

BGB

unread,
Jun 27, 2022, 2:51:05 AM6/27/22
to
Agreed, sometimes it makes sense.


FWIW: In BGBCC, constants are generally treated like a sort of read-only
variable, and so may be pulled into a register like that of a variable.

Though, admittedly, my compiler isn't smart enough to make decisions
about when it might be better to use an immediate, or pull the value
into a register (this decision would need to be made at the
code-generation level rather than the assembler level). This would
likely require adding the use of a heuristic of some sort.

Though, I would suspect cases like that shown are probably in the
minority of cases.


The specific case shown in the issue-tracker would not come up in BJX2,
because it doesn't support directly storing a constant to memory, so
would be forced in this case to load the value into a register and then
store it to memory, and by extension the register would likely be reused
across the multiple stores (since each time one tries to reload it, the
compiler will see that it was already present in a register).

There are cases where this could be useful, just I don't expect they
would likely come up often enough to justify the encoding.

Ivan Godard

unread,
Jun 27, 2022, 10:27:21 AM6/27/22
to
Arrays can be of any size that mmap is willing to give you. Generated
code for constant offsets bigger than 2^32 builds a pointer by explicit
arithmetic rather than by using the address adder. The arithmetic costs
two instructions (con, addp) and a cycle, but saves in the AA for all
offsets < 2^32. We judge that to be a worthwhile tradeoff.

EricP

unread,
Jun 27, 2022, 1:11:03 PM6/27/22
to
Is there something odd about the assembler at the top where
it seems to repeatedly spill a register to local stack frame
then use it as a pointer. e.g.

std r26,[sp,216]
.loc 1 268 1 ; fatigue2.f90:268:1
std #-4317352126650676160,[r26]

plus does this 5 other times.
Just checking.

BGB

unread,
Jun 27, 2022, 3:31:19 PM6/27/22
to
On 6/26/2022 10:14 PM, BGB wrote:
> On 6/26/2022 8:18 PM, MitchAlsup wrote:
>> On Sunday, June 26, 2022 at 7:28:11 PM UTC-5, BGB wrote:
>>> On 6/26/2022 4:56 PM, MitchAlsup wrote:
>>>> On Sunday, June 26, 2022 at 3:31:25 PM UTC-5, gg...@yahoo.com wrote:
>>>>> MitchAlsup <Mitch...@aol.com> wrote:
>>>>>> On Saturday, June 25, 2022 at 4:17:18 PM UTC-5, gg...@yahoo.com
>>>>>> wrote:
>>>>

<snip>

>
> My case is pretty much comparable to x86-64 in what you can do with
> immediate and displacement fields:
>   x86-64: 8 and 32; 64 via a dedicated load.
>   BJX2: 9 and 33, 64 via a dedicated load.
>
> But, differs in that it has 3R, and is Load/Store.
>
> Also, I have both more registers, and encodings that are often smaller.
>   Though, x86-64 has:
>     MOV Rn, Imm64 (80-bits)
>   Where:
>     LDI Imm64, Rn (96-bits)
>   So, x86-64 wins this one...
>
>
> However, compared with some of the other options (eg: SH-4 or Thumb2),
> x86-64's code density seems to be "pretty bad".
>
> Even something like ARMv5 (with fixed-length 32-bit instructions) seems
> to be able to beat x86-64 on this metric.
>
>


So, thought is, comparing a few common-case encodings for size between
x86-64 (x64) and BJX2 (mostly for cases where direct analogs exist):
MOV Rxx, Imm64 | LDI Imm64, Rn
10(x64) vs 12(BJX2), x64 wins
MOV Rxx, Imm32 | LDI Imm8, Rn
7(x64) vs 2(BJX2), BJX2 wins
(x86-64 lacking a smaller immediate than Imm32 here)

MOV Rxx, [Rb+Ri*8] | MOV.Q (Rm, Ri), Rn
4(x64) vs 4(BJX2), tie
MOV Rxx, [Rb+Disp8] | MOV.Q (Rm, Disp9u), Rn
4 (x64) vs 4(BJX2), tie

MOV Rxx, [Rb+Ri*Sc+Disp8] | MOV.Q (Rb, Ri*Sc, Disp11), Rn
5(x64) vs 8(BJX2), x64 wins
MOV Rxx, [RIP+Disp32] | MOV.Q (PC, Disp33), Rn
7(x64) vs 8(BJX2), x64 wins

ADD Rxx, Rxx | ADD Rm, Rn
3(x64) vs 2(BJX2), BJX2 wins

ADD Rn, Imm32 | ADD Imm8, Rn
7(x64) vs 2(BJX2), BJX2 wins
(another scenario where x86-64 lacks smaller immeds)
(You don't get byte values unless working on byte registers).
ADD Rn, Imm32 | ADD Imm16s, Rn
7(x64) vs 4(BJX2), BJX2 wins
ADD Rn, Imm32 | ADD Imm33s, Rn
7(x64) vs 8(BJX2), x64 wins
MOV Rt, Imm64; ADD Rn, Rt | ADD Imm64, Rn
13(x64) vs 12(BJX2), BJX2 wins


In many areas where there is a direct 1:1 comparison, x86-64 seems to be
ahead; Except in cases where x86-64 only has a 32-bit immediate.

I am generally assuming encodings here where REX is used (typical case,
basically required to use registers as 64-bit).


I decided to allow comparing REX-prefixed encodings against 16-bit
encodings on the basis that both cases have the same number of usable
registers (16 in this case).

In general, BJX2 has 2x or 4x (XGPR) as many GPRs x86-64.
Most ops from R16..R31, or R32..R63, will require 32-bit encodings.


In many cases, speed-optimized modes would add a penalty for BJX2, as in
this mode it significantly reduces the number of 16-bit encodings used.
This is partly due to a mix of register assignments (nearly always
enables R16..R31), and because is needs 32-bit encodings for the
WEXifier to be able to do its thing (instruction-level swap-and-bundle
would not be viable with 16-bit encodings thrown in the mix).

The 16-bit encodings remain fairly common in prolog/epilog sequences
though (and in a lot of the hand-written ASM).



Main differences come up in areas where things are not 1:1, eg:
MOV Rnn, Rss; ADD Rnn, Rtt | ADD Rs, Rt, Rn
6(x64) vs 4(BJX2), BJX2 wins
MOV Rxx+0, [RSP+Disp]; MOV Rxx+1, [RSP+Disp] | MOV.X (SP, disp4), Xn
10(x64) vs 2(BJX2), BJX2 wins

Would be a bigger difference for more extreme cases:
MOV Rnn0, Rss0; MOV Rnn1, Rss1;
CLC; ADC Rnn0, Rtt0; ADC Rnn1, Rtt1
Vs:
ADDX Xs, Xt, Xn
13(x64) vs 4(BJX2), BJX2 wins.

3R (Basic):
MOV Rnn, Rss; ADD Rnn, Rtt
Vs:
ADD Rs, Rt, Rn
6(x64), 4(BJX2), BJX2 wins.

3R (Load):
MOV Rnn, Rss; ADD Rnn, [Rtt, Disp8]
Vs:
MOV.Q (Rt, Disp9), Rx; ADD Rs, Rx, Rn
7(x64) vs 8(BJX2), x64 wins.


Or, predication:
CMP Rss, Rtt; JGT .L0; MOV Rnn, Rtt; .L0:
Vs:
CMPQGT Rt, Rs; MOV?F Rt, Rn
8(x64), 6(BJX2), BJX2 wins


...


Will mostly ignore SIMD / SSE here.



One other area of differences is that x86-64 allows the source or
destination to be memory.

This could give an advantage to x86-64 in cases where it operates on
memory, eg:
ADD Rn, [Rs] | MOV.Q (Rs), Rx; ADD Rx, Rn
3 (x64) vs 4 (BJX2), x64 wins

ADD Rn, [RSP+8] | MOV.Q (SP, 8), Rx; ADD Rx, Rn
5 (x64) vs 4 (BJX2), BJX2 wins

ADD [Rn], Rs | MOV.Q (Rn), Rx; ADD Rs, Rx; MOV.Q Rx, (Rn)
3 (x64) vs 6 (BJX2), x64 wins

ADD [RSP+8], Rs | MOV.Q (SP, 8), Rx; ADD Rs, Rx; MOV.Q Rx, (SP, 8)
5 (x64) vs 6 (BJX2), x64 wins

...


Though, BGBCC tries to minimize the number of loads and stores, whereas
a lot of generated x86-64 code uses memory operands fairly often.

Granted, one can make use of having a larger register space (32 or 64)
by statically-assigning commonly used variables to registers, which is
less viable with 16 registers.


Ironically, this is an area where RISC-V could have an advantage with
the 'A' extension, which does allow a limited set of operations to use
direct memory operands. But, I am less a fan of this, as I would assume
sticking with plain Load/Store unless there is a good reason to do
otherwise.

In effect, doing something like this would likely involve needing to
stick an additional ALU into the L1 D$ or similar. One could argue
though that it could make sense on the basis that Load+Op and
Load+Op+Store sequences are still "not particularly rare"...



Looking at it on this level, x86-64 and BJX2 should be more comparable
in terms of code density; and x86-64 a little more competitive on this
metric than what I am often seeing.

This is not what I see in practice though, where the x86-64 binaries
tend to be quite a bit larger (assuming uncompressed binary sizes).


I suspect that quite possibly the compilers are wasting a large amount
of space somewhere (though ".text" still tends to be pretty large, IME,
even for size-optimized builds).

More so given the (often fairly large) size delta between 32 and 64 bit
binaries (somewhat beyond what could be explained away via the REX byte,
PUSH going from 1 to 3 bytes, ...).


While ABIs are different, I can note that both x64 and BJX2 tend to use
a similar ABI design in this area (passing arguments in registers, ...).


Granted, I guess the counter-point would be if other people are not
seeing x86-64 binaries that are seemingly "overly large for no
particularly obvious reason".

At first, I was thinking it was mostly an MSVC thing, but then noted
that GCC also appears to be doing this in my case.

I could speculate on possible causes, but lack anything solid at the moment.

...

MitchAlsup

unread,
Jun 27, 2022, 6:24:48 PM6/27/22
to
An interesting list. It would be informative if you looked not at instructions pairings
(x86-64 versus BJX2) but semantic pairings::
a) -------leaf subroutine with 4 arguments 1 result prologue+epilogue
b) non-leaf subroutine with 8 arguments 1 result prologue+epilogue
for both instruction size and expected latency.
<
>
>
> Though, BGBCC tries to minimize the number of loads and stores, whereas
> a lot of generated x86-64 code uses memory operands fairly often.
>
> Granted, one can make use of having a larger register space (32 or 64)
> by statically-assigning commonly used variables to registers, which is
> less viable with 16 registers.
>
>
> Ironically, this is an area where RISC-V could have an advantage with
> the 'A' extension, which does allow a limited set of operations to use
> direct memory operands. But, I am less a fan of this, as I would assume
> sticking with plain Load/Store unless there is a good reason to do
> otherwise.
<
The A option basically mandates the pipeline grow by 3 cycles.
>
> In effect, doing something like this would likely involve needing to
> stick an additional ALU into the L1 D$ or similar. One could argue
> though that it could make sense on the basis that Load+Op and
> Load+Op+Store sequences are still "not particularly rare"...
>
Yep, you identified those 3 cycles. I have called this the IBM 360 pipeline
model in the past.

BGB

unread,
Jun 27, 2022, 11:11:02 PM6/27/22
to
Hmm, simple scenarios, assuming "reasonably efficient" code generation.

For x86-64, will assume the Win64 ABI.


int add4(int a, int b, int c, int d)
{ return(a+b+c+d); }

BJX2:
add4:
ADD R4, R5, R16 | ADD R6, R7, R17 //8B, 2c (interlock)
ADD R16, R17, R2 //4B, 1c
RTS //2B, 2c
Cost: 14 bytes, 5 cycles.

Or, size-optimizing:
add4:
ADD R4, R5 //2B, 1c
ADD R6, R7 //2B, 1c
ADD R5, R7, R2 //4B, 1c
RTS //2B, 2c
Cost: 10 bytes, 5 cycles


x86-64:
add4:
ADD ECX, EDX //2B, 0c ? (naive model)
ADD R8D, R9D //3B, 1c
MOV EAX, ECX //2B, 1c
ADD EAX, R8D //3B, 1c
RET //1B, 2c
Cost 11 bytes, latency ~ 5 cycles?...


int add8(int a, int b, int c, int d, int e, int f, int g, int h)
{ return(add4(a, b, c, d)+add4(e, f, g, h)); }


BJX2 (assuming we know nothing about the called function):
add8:
ADD -48, SP //2B, 1c
MOV LR, R1 //2B, 1c
MOV.Q R1, (SP, 40) //2B, 1c
MOV.Q R14, (SP, 32) //2B, 1c
MOV.X R12, (SP, 16) //2B, 1c
MOV.X R10, (SP, 0) //2B, 1c

MOVX R20, R10 //4B, 1c
MOVX R22, R12 //4B, 1c
BSR add4 //4B, 2c
MOV R2, R14 //2B, 1c

MOVX R10, R4 //4B, 1c
MOVX R12, R6 //4B, 1c
BSR add4 //4B, 2c

ADD R14, R2 //2B, 1c

MOV.Q (SP, 40), R1 //2B, 1c
MOV.X (SP, 0), R10 //2B, 1c
MOV.X (SP, 16), R12 //2B, 1c
MOV.Q (SP, 32), R14 //2B, 1c
ADD 48, SP //2B, 1c
JMP R1 //2B, 2c

Cost, 52 bytes, 22 cycles (excluding called functions).
32 cycles with called functions.


BGBCC would not currently be able to generate the above though.

This has a penalty case because the function has 8 arguments in
registers, and a function call requires putting 4 of them "somewhere
non-volatile".


I guess, "what if we know which registers the called function stomps"?

add8:
MOV LR, R1 //2B, 1c
BSR add4 //4B, 2c
MOV R2, R19 //2B, 1c
MOVX R20, R4 //4B, 1c
MOVX R22, R6 //4B, 1c
BSR add4 //4B, 2c
ADD R19, R2 //2B, 1c
JMP R1 //2B, 2c

Cost, 24B, 11 cycles.
Combined, 22 cycles.

However, the C ABI rules wouldn't really allow this.



x86-64 (following ABI rules):
add8:
PUSH RBP //3B, ~ 1
MOV RBP, RSP //3B, ~ 1
SUB RSP, 40 //7B, ~ 1
CALL add4 //5B, ~ 2
MOV [RBP-8], RAX //4B, ~ 1
MOV RCX, [RBP+40] //4B, ~ 1
MOV RDX, [RBP+48] //4B, ~ 1
MOV R8, [RBP+56] //4B, ~ 1
MOV R9, [RBP+64] //4B, ~ 1
CALL add4 //5B, ~ 2
ADD EAX, [RBP-8] //4B, ~ 3
MOV RSP, RBP //3B, ~ 1
POP RBP //3B, ~ 1
RET //1B, ~ 2

Cost: 54 bytes, 19 cycles.


x86-64 (lax ABI rules):
add8:
CALL add4 //5B, ~ 2
PUSH RAX //3B, ~ 1
MOV RCX, [RSP+40] //5B, ~ 1
MOV RDX, [RSP+48] //5B, ~ 1
MOV R8, [RSP+56] //5B, ~ 1
MOV R9, [RSP+64] //5B, ~ 1
CALL add4 //5B, ~ 2
POP RCX //3B, ~ 1
ADD EAX, ECX //2B, ~ 1
RET //1B, ~ 2

Cost: 39 bytes, est ~ 13 cycles.

(This breaks several major rules for the C ABI, so probably wouldn't be
allowed).


As noted, above, in this case the ABI only having 4 arguments in
registers turns out to be helpful in this case, as it is not necessary
to save or restore them before making another call.


In this scenario though, x86-64 does appear to be putting up a pretty
good fight.

...


> <
>>
>>
>> Though, BGBCC tries to minimize the number of loads and stores, whereas
>> a lot of generated x86-64 code uses memory operands fairly often.
>>
>> Granted, one can make use of having a larger register space (32 or 64)
>> by statically-assigning commonly used variables to registers, which is
>> less viable with 16 registers.
>>
>>
>> Ironically, this is an area where RISC-V could have an advantage with
>> the 'A' extension, which does allow a limited set of operations to use
>> direct memory operands. But, I am less a fan of this, as I would assume
>> sticking with plain Load/Store unless there is a good reason to do
>> otherwise.
> <
> The A option basically mandates the pipeline grow by 3 cycles.

Possibly would need to do something, if one assumes they can't just
casually shove an ALU into the middle of the Load or Store operation.


>>
>> In effect, doing something like this would likely involve needing to
>> stick an additional ALU into the L1 D$ or similar. One could argue
>> though that it could make sense on the basis that Load+Op and
>> Load+Op+Store sequences are still "not particularly rare"...
>>
> Yep, you identified those 3 cycles. I have called this the IBM 360 pipeline
> model in the past.

OK.

MitchAlsup

unread,
Jun 28, 2022, 1:34:46 PM6/28/22
to
Thanks.

Terje Mathisen

unread,
Jun 28, 2022, 4:38:25 PM6/28/22
to
If the constant already exists in a register from a previous load, then
obviously you just reuse it, but when the constant is consumed each time
then you allocate one register up front just to reduce the code size
when iterating a loop.

I suspect the loop count needs to be high, and/or the constant happens
to be the 2/4/8 bytes that increases the loop size past some performance
boundary, i.e. VMM won't work?

The key idea is that all such inner loops fits nicely in $L1 code cache,
so preloading constants just increases the total code size in order to
reduce the loop size by something less.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

MitchAlsup

unread,
Jun 28, 2022, 4:56:44 PM6/28/22
to
On Tuesday, June 28, 2022 at 3:38:25 PM UTC-5, Terje Mathisen wrote:
> Ivan Godard wrote:
> > On 6/26/2022 6:18 PM, MitchAlsup wrote:
> >> My 66000 never has to load a constant to a register. Somewhere close
> >> to 9% of my instruction count vanishes by uniformly supplying large
> >> constants.
> >
> > Doesn't have to, but isn't it advisble when a constant has widespread uses?
> If the constant already exists in a register from a previous load, then
> obviously you just reuse it, but when the constant is consumed each time
> then you allocate one register up front just to reduce the code size
> when iterating a loop.
>
> I suspect the loop count needs to be high, and/or the constant happens
> to be the 2/4/8 bytes that increases the loop size past some performance
> boundary, i.e. VMM won't work?
>
> The key idea is that all such inner loops fits nicely in $L1 code cache,
> so preloading constants just increases the total code size in order to
> reduce the loop size by something less.
<
Both sides of the argument are correct:
a) there are circumstances where code density increases
b) there are circumstances where code performance increases
BOTH ways (preload constant, use directly)
<
And to a certain extent, dependent on whether you are compiling for
performance or density.
<
With any reasonable code cache, the differences are expected to be
tiny. Processing of constants in PARSE and DECODE to not take cycles
on anything bigger than the most minimal implementation ever to be
considered. This implementation would have to be fetching Words from
the I$. My current lowest end fetches quadwords--which is the same size
as the D$ due to misaligned accesses being "fast".*
<
Apparently, one has to look for this situation in an earlier pass of LLVM
in order for the code generator to make use of it.
<
ALSO NOTE: In most implementations, constants do not count against the
maximal number of instructions in a VVM loop. In effect, like Scalar registers,
they are simply pre-delivered values sitting in the instruction queue.
<
And finally, this is one of those things one can fix in the compiler later..........
>
> Terje
>
> --
> - <Terje.Mathisen at tmsw.no>
> "almost all programming can be viewed as an exercise in caching"
<
(*) As a design point, I like to configure the I$ and D$ such that one
(the D$) can be designed, and then another instance used as the I$
(with a few sequences simply not being used in the I$ block). Verilog
gate-eater will remove the logic not in use, so there is no waste in area
(however tiny). AND making 2 instances of 1 cache design is vastly
easier in the verification side of things, this also saves SCHEDULE.
<
You may choose to do otherwise.

Brett

unread,
Jun 29, 2022, 5:34:57 PM6/29/22
to
59 instructions is overkill, short ACC instructions will only use a
fraction of that.
I have 16 integer instructions which is 4 bits plus a 5 bit register for 9
bits.
I also have 4 loads and 3 stores with no offsets which would be in the
load/store major blocks. Tempted to expand these out with a very short
offset for the 64 bit ops to fill out the major block.
This can all be added to My 66000 using free opcode major blocks without
breaking anything.

Two source, Reg op ACC = ACC

Add, Sub, Mul, And, Or, XOR, NAND,


Two source, const op ACC = ACC

5 bit add, 5 bit sub, 5 bit shift left, 5 bit shift right, bit set, bit
clear.


One source Reg op = ACC

NOT, move to ACC, move from ACC


Load base [Reg] = ACC

Load8, load8signed, load 32, load 64,


Store ACC = [Reg]

Store 8, store 32, store 64,


This may give less code compression than traditional forms, which would
explain why it is not used so far. But it is simpler and easier to
implement.

Stephen Fuld

unread,
Jun 29, 2022, 6:12:08 PM6/29/22
to
On 6/29/2022 2:34 PM, Brett wrote:

snip

> 59 instructions is overkill, short ACC instructions will only use a
> fraction of that.
> I have 16 integer instructions which is 4 bits plus a 5 bit register for 9
> bits.
> I also have 4 loads and 3 stores with no offsets which would be in the
> load/store major blocks. Tempted to expand these out with a very short
> offset for the 64 bit ops to fill out the major block.
> This can all be added to My 66000 using free opcode major blocks without
> breaking anything.

Not quite. You have reduced the range of the offsets in control
transfer instructions by a factor of 2. You either need a 16 bit NOP,
or the ability to have 32 bit instructions on 16 bit boundaries. If you
have a NOP, you either need the ability to throw this away in an early
pipeline stage, or it costs you a cycle. And, of course, you loose some
scheduling flexibility when multiple instructions have the Accumulator
as a destination.

I may be wrong, but it seems like a lot of work for a very modest benefit.


--
- Stephen Fuld
(e-mail address disguised to prevent spam)

MitchAlsup

unread,
Jun 29, 2022, 6:12:54 PM6/29/22
to
Well there is the rule where the low order 2-bits of IP have to equal zero.
<
But, there are good reasons accumulator design style fell from fashion.
<
Say:: increment stack pointer::
<
MOV ACC, SP
ADD ACC,4
MOV ACC,SP
<
>
> Two source, Reg op ACC = ACC
<
If you are going to write in left to right arithmetic notation use =: or =>
>
> Add, Sub, Mul, And, Or, XOR, NAND,
>
>
> Two source, const op ACC = ACC
>
> 5 bit add, 5 bit sub, 5 bit shift left, 5 bit shift right, bit set, bit
> clear.
>
>
> One source Reg op = ACC
>
> NOT, move to ACC, move from ACC
>
>
> Load base [Reg] = ACC
>
> Load8, load8signed, load 32, load 64,
>
>
> Store ACC = [Reg]
>
> Store 8, store 32, store 64,
>
>
> This may give less code compression than traditional forms, which would
> explain why it is not used so far. But it is simpler and easier to
> implement.
<
The only thing it seems to buy is if you can leave the loop variable in ACC,
which looks tough. I would rather do PDP-11 style.
<
But, look, you want it, why don't you go spend the time to::
a) invent the ISA
b) port the compiler
c) teach the compiler about accumulator code sequences
d) build a simulator
and come back with a report.
<
As I have stated in the past:
I am not interested in this,
but I encourage you to pursue you interests
<
And as a hint I rarely give out:: you get to the end faster by writing the compiler first
and letting the ISA bend itself to the compiler rather than the other way around.
<
I spent 9 years of my career doing interpreters and compilers. My 66000 ISA
is a result of my encounters with instructions which worked with the compiler
(2-operand 1-result) and what didn't (condition codes). Also note: I did pure RISC
designs and x86-64 designs and used these experiences to moderate my views
on both sides. Thus: [Rbase+Rindex<<scale+Displacement] made the cut while
[Rs++] and *[Rs] did not; why ENTER and EXIT made the cut, but POLY did not.
<

MitchAlsup

unread,
Jun 29, 2022, 6:15:39 PM6/29/22
to
On Wednesday, June 29, 2022 at 5:12:08 PM UTC-5, Stephen Fuld wrote:
> On 6/29/2022 2:34 PM, Brett wrote:
>
> snip
> > 59 instructions is overkill, short ACC instructions will only use a
> > fraction of that.
> > I have 16 integer instructions which is 4 bits plus a 5 bit register for 9
> > bits.
> > I also have 4 loads and 3 stores with no offsets which would be in the
> > load/store major blocks. Tempted to expand these out with a very short
> > offset for the 64 bit ops to fill out the major block.
> > This can all be added to My 66000 using free opcode major blocks without
> > breaking anything.
<
> Not quite. You have reduced the range of the offsets in control
> transfer instructions by a factor of 2. You either need a 16 bit NOP,
> or the ability to have 32 bit instructions on 16 bit boundaries. If you
> have a NOP, you either need the ability to throw this away in an early
> pipeline stage, or it costs you a cycle. And, of course, you loose some
> scheduling flexibility when multiple instructions have the Accumulator
> as a destination.
<
In any seriously wide implementation (6-8-10 wide) an accumulator is
no better than a pure stack, and you make HW jump through tons of hoops
to attain reasonable performance.
>
> I may be wrong, but it seems like a lot of work for a very modest benefit.
<
I agree with you. but there is always the chance I could be wrong; so I encourage
him to go out and spend the effort to design ISA, port compiler, build simulator
and then come back and report.

Brett

unread,
Jun 30, 2022, 4:02:05 AM6/30/22
to
You would use a long opcode to do that in one instruction.

This is NOT an accumulator design, it’s a normal RISC design that adds
short form ACC instructions, that are simple one for one replacements of
long instructions.

This is normal RISC, not an accumulator design, don’t be tricked by the ACC
name which is merely an implied register name for opcode size reduction.

NO complex compiler work. You pick dependency chains and attempt to
substitute ACC as the register name to use the short opcodes.

MOV ACC is just there for crazy special cases, I expect it to NEVER be
used, as it increases opcode count and dependency chain length. And
violates the one for one opcode replacement rule.

You can go 10 wide as ACC is just register 32, and is renamed like any
other register.

This is not just for the loop counter, any code where you do not use a
variable twice can be converted to ACC code. A = B * C + D, is three loads,
two ops and a store with no reuse of names, and so is ACC code. And it does
not matter if one of these variables is used twice, that just means one
less short opcode.

You DO need to add ACC as a dest/source for long load/store instructions as
the short form load/store is completely inadequate. Or you can use R1 as
ACC which is an idea you hated. ACC is just a register name, there is
nothing special about it.

It could also be nice if most/all the long integer instructions have an ACC
form to connect to the short instructions if you do not use R1. You have
the opcode space for this, but this is just an optional overkill max code
compression feature.

I forgot about branches and jumps in my ACC opcode list, mostly because
branches need condition codes and offset bits, and could chew up an entire
major block unless there is a small subset that is doable. The median
branch is like 4 instructions…

An IF-THEN-ELSE has a short unconditional jump that can be made into a
short instruction.

Again, this is normal RISC, not an accumulator design, don’t be tricked by
the ACC name which is merely an implied register name used for opcode size
reduction.

I do expect a 30% code size reduction which is close to other 16 bit opcode
forms.

Brett

unread,
Jun 30, 2022, 6:24:54 AM6/30/22
to
I have a new name for my invention:

Implied Register Short Opcodes (IRSO)

All IRSO opcodes use the IMP register which is typically mapped to R1.
IRSO opcodes are shorter because instead of three register specifiers there
is only one, with IMP used as a source and or dest plus optionally another
register. The value in IMP is of course destroyed when a new IRSO opcode is
executed, which works like x86 does.

I did not think this approach was patentable, apparently I am wrong. In any
case this idea is free to use for all.

There is an optional extension which adds a IMP2 register specifier
typically mapped to R32, which is only useful if short IRSO loads are used,
which is almost never. In this mode sequential IRSO loads push the old IMP
value into IMP2, there is a second set of IRSO integer opcodes with no
register specifiers, using instead IMP and IMP2 with the result in IMP.

Brett

unread,
Jun 30, 2022, 3:50:55 PM6/30/22
to
Another extension:

Implied Address Register Extension (IADE) for IRSO. (Implied Register Short
Opcodes)

IADE uses the IAD register which is typically mapped to R2.
This enables a reasonable offset size for short IRSO loads/stores.

A short opcode has 6 bits of major opcode and only 10 bits of instruction
payload, one bit specifies load or store, one bit specifies 32 bit or 64
bit data, and the other 8 bits specify a scaled offset to the data. The
address register is IAD and the data is in or goes to the IMP register.

If there is a second IADE opcode the address register is IAD2, or IAD3 for
a third implied address register, mapped typically to R3, R4.

The IADE opcodes make the IMP2 register more useful, converting more
instructions to short form. I expect that code compression could exceed
40%, which is BETTER than all the other 16 bit opcode formats.

The cost for implementation in the full form is four major opcode groups
and 16 bit alignment of opcodes which will cost you a offset bit in
branches/jumps/calls. A minimal implementation needs two major opcode
groups, IRSO integer ops and one set of IADE load/store ops.

The full set of IRSO with three IADE will also cost 4 registers of 32,
which is not significant. However if one adds an additional register
coloring pass to the compiler, then the freed up registers that were mapped
to R1-R4 can be used again meaning there is no register cost. I do not
expect anyone to bother with such compiler changes for the first decade of
this instruction set use.

Here you thought My66000 had plenty of spare opcode bits, I am going to use
half of those reserved major groups? ;) ;) ;)

Brett

unread,
Jul 1, 2022, 2:38:10 AM7/1/22
to
Another change, with useful short loads and more of them I am getting rid
of the whacky mill like double load. Instead sacrificing one bit of offset
to decide on IMP or IMP2. it’s not clear that IMP2 should be visible or
not, so maybe it stays mapped to R32. As described IMP2 is only accessible
in concert with IMP for integer ops.

This change to IMP2 makes strands of IMP opcodes work, a major oversight.
Stands of reasonable length basically do not exist so I had overlooked
this, even though I used such an example.

A unified integer/float register file works well with this architecture.
You would need another major block to add the IMP/IMP2 float operations.

Stephen Fuld

unread,
Jul 4, 2022, 12:45:16 PM7/4/22
to
On 6/30/2022 1:02 AM, Brett wrote:

big snip about adding 16 bit instructions to a 32 bit ISA


> I do expect a 30% code size reduction which is close to other 16 bit opcode
> forms.

That seems very optimistic. 30% code size reduction means that 60% of
the instructions have to be 16 bit. I just don't think you can get that
much.

I did figure out that, assuming your 16 bit subset is a pure subset of
the 32 bit ISA, then you don't need a 16 bit NOP instruction to "fill
out" 32 bits. When you replace a sequence of 32 bit instructions with
16 bit ones, if the last instruction is on an even boundary, which would
otherwise require a NOP to get back to 32 bit allignment, you simply
replace the last 16 bit instruction with its 32 bit equivalent. This
costs you nothing as you would have wasted the last 16 bits anyway, and
it doesn't cost any extra cycles.

BGB

unread,
Jul 4, 2022, 1:37:02 PM7/4/22
to
On 7/4/2022 11:45 AM, Stephen Fuld wrote:
> On 6/30/2022 1:02 AM, Brett wrote:
>
> big snip about adding 16 bit instructions to a 32 bit ISA
>
>
>> I do expect a 30% code size reduction which is close to other 16 bit
>> opcode
>> forms.
>
> That seems very optimistic.  30% code size reduction means that 60% of
> the instructions have to be 16 bit.  I just don't think you can get that
> much.
>

60% / 40% (16 vs 32) is about what I am able to get in "size optimized"
mode with my ISA. This seems to be about where it converges.

However, for speed-optimized code, it tends to be closer to around 20% /
40% (16 vs 32).


I don't think one can likely get much better than this, at least with 16/32.

Had experimented with something like 16/24/32, which would have had
24-bit encodings for Load/Store (Disp5) and 3R/3RI ALU ops (and 2R forms
with access to 32 GPRs). This didn't really save enough to be worthwhile
though (and made a mess of some other things), so I basically dropped it
and later reused this part of the encoding space for the 32-bit XGPR
encodings.

Was, IIRC:
60% / 40% (70% vs Fixed-32b)
Vs:
60% / 20% / 20% (65% vs Fixed-32b)
Makes too much of a mess of things to justify saving ~ 5%...



Some factors which effect this:
Register Selection:
Size optimized code needs to avoid using the higher registers.
Opcode Selection:
The smallest encoding is not necessarily the fastest encoding.
WEX:
Generally can't use WEX and have any hope at small code size;
In size-optimized mode, the WEXifier is effectively disabled.


In the past, I had some 48-bit encodings, which could help, but don't
exist in the current form of the ISA. In the current form of the ISA,
the role is filled with the 64-bit encodings, but as noted, 64 is
greater than 48. The savings would be fairly small though, as the 64-bit
encodings represent a small minority of the total instruction count.

However, this does mean that in some cases, two-op sequences can be more
compact than a single instruction:
LDIZ Imm24u, R0 //4B
MOV.x (GBR, R0), Rn //2B
Being less bytes than:
MOV.x (GBR, Disp33s), Rn //8B

In this case, it is a tradeoff between saving 2B or spending an extra
clock-cycle.


> I did figure out that, assuming your 16 bit subset is a pure subset of
> the 32 bit ISA, then you don't need a 16 bit NOP instruction to "fill
> out" 32 bits.  When you replace a sequence of 32 bit instructions with
> 16 bit ones, if the last instruction is on an even boundary, which would
> otherwise require a NOP to get back to 32 bit allignment, you simply
> replace the last 16 bit instruction with its 32 bit equivalent.  This
> costs you nothing as you would have wasted the last 16 bits anyway, and
> it doesn't cost any extra cycles.
>

Yes, BGBCC does this trick in some cases.

I had at one point considered modifying the rules so that bundles (and,
by extension, any WEXified sequences) would require 32-bit alignment,
with a little repacking trickery being used to minimize the number of
NOPs that needed to be added to do so.

I ended up backing off on this idea, after noting that this had a
significant adverse effect on code density:
Speed optimized code would have been essentially the same size as had I
just used purely 32-bit encodings...

Stefan Monnier

unread,
Jul 4, 2022, 2:25:03 PM7/4/22
to
> Was, IIRC:
> 60% / 40% (70% vs Fixed-32b)

I assume this is with 32bit instructions that can be 16bit aligned, right?

If you impose 32bit alignment on 32bit instructions, how much worse does
it get?

Also, if you impose 32bit alignment on 32bit instructions, you can look
at it not as 16bit instructions but as 32bit "instruction pairs".
At that point you might look at what kinds of pairs you're using, what
kinds of pairs you're *not* using and what kind of additional pairs might
be worth having instead.
E.g. maybe it is better to make your pairs asymmetric (14bit + 18bit),
e.g. to allow slightly larger offsets in short load instructions
(assuming the size of the offset is a common reason for loads to be
forced to use the long form rather than the short form).


Stefan

MitchAlsup

unread,
Jul 4, 2022, 2:45:36 PM7/4/22
to
On Monday, July 4, 2022 at 1:25:03 PM UTC-5, Stefan Monnier wrote:
> > Was, IIRC:
> > 60% / 40% (70% vs Fixed-32b)
> I assume this is with 32bit instructions that can be 16bit aligned, right?
>
> If you impose 32bit alignment on 32bit instructions, how much worse does
> it get?
<
What happens if the rule is that pairs of 16-bit instructions have to used
different function units ?

luke.l...@gmail.com

unread,
Jul 4, 2022, 3:01:20 PM7/4/22
to
On Monday, July 4, 2022 at 5:45:16 PM UTC+1, Stephen Fuld wrote:

> otherwise require a NOP to get back to 32 bit allignment, you simply
> replace the last 16 bit instruction with its 32 bit equivalent. This
> costs you nothing as you would have wasted the last 16 bits anyway, and
> it doesn't cost any extra cycles.

if you don't mind a 48-bit FIFO it is quite easy to have 32-bit
instructions aligned on 16-bit boundaries. of course, if branches
drop the 2 LSBs of addresses (esp. immediate-relative-branches)
then wark.

l.

MitchAlsup

unread,
Jul 4, 2022, 4:03:36 PM7/4/22
to
I don't understand the meaning associated with "wark"
<
It is not really a 48-bit FIFO, it is a 3×½ word buffer/FIFO.
The unit of movement is ½ word not 1-bit.
<
The smallest My 66000 implementation will contain at least 5 entries
each entry is 4 words long of instruction buffering (latch array) with
3 ports (1 in PARSE, 1 in DECODE, 1 in StoreData which is in PARSE)
<
From within this buffer is scan ahead logic which annotates the
starting point of the next instruction (variable length), and looks
for PIC control transfers so instructions at those targets can be
fetched before the branch is DECODEd shrinking the minimum
cycle count of the branch from 2 cycles into 1.108 cycles even
without a delay slot or a branch target buffer. Unconditional
branches and RETs may be pulled entirely from the instruction
stream.
<
So, one would expect such a buffer to already exist and be large
enough. As currently configured it works on 1×word boundaries.
<
All labels would have to be on word boundaries.
>
> l.

BGB

unread,
Jul 4, 2022, 7:14:29 PM7/4/22
to
On 7/4/2022 1:24 PM, Stefan Monnier wrote:
>> Was, IIRC:
>> 60% / 40% (70% vs Fixed-32b)
>
> I assume this is with 32bit instructions that can be 16bit aligned, right?
>

Yes, in BJX2 all of the instructions (and bundles) have 16-bit
alignment, and 32-bit instructions are defined in terms of pairs of
16-bit words.

In my notation, F0nm_ZeoZ, the words are effectively transposed, so if
seen directly as a little endian DWORD, would be:
ZeoZF0nm

Or, in bytes: nm_F0_oZ_Ze

There is also a funky bit-ordering effect, because bit-order is
effectively big endian in terms of the instruction words, but little
endian within the instruction words.



> If you impose 32bit alignment on 32bit instructions, how much worse does
> it get?
>

Significantly.

Crude estimate is that, for the 40% of 32-bit instructions, roughly half
would require converting the adjacent instruction to 32-bit, turning it
from 60/40 to around 40/60.

For speed-optimized code, it would nearly entirely knock 16-bit
encodings out of the picture.


This is assuming a fairly ad-hoc mix of 16 and 32-bit encodings.


There would be less impact if things were clustered into separate
instruction blocks (more like how Thumb worked on the ARM7T or similar);
or how it "would have worked" on the SH-5 (which would have been
basically "Sort of like MIPS but with 64 registers", *1).

*1: Not sure if SH-5 was effectively also another casualty of the
Itanic, either way, I think at the time Hitachi folded its CPU design
part out to Renesas, who were apparently much more invested in keeping
the SH2 and SH4 going.


I didn't really want to go this direction though.


I had at one point considered fixed-length 64 bit bundles, say:
1x 60-bit (large instruction)
2x 30-bit (medium instruction)
3x 20-bit (small instruction)


But, as noted, didn't go this direction either:
When I later partially rebooted my BJX1 project into BJX2, I stayed with
a 16/32 instruction format (so, similar category as Thumb2 or RVC).

I did somewhat reorganize the encodings in an attempt to make things
more consistent (as by the end, the original BJX1 instruction encodings
had turned into an awful mess).

As can be noted, a fair chunk of the original parts were carried over
from one project to another.


> Also, if you impose 32bit alignment on 32bit instructions, you can look
> at it not as 16bit instructions but as 32bit "instruction pairs".
> At that point you might look at what kinds of pairs you're using, what
> kinds of pairs you're *not* using and what kind of additional pairs might
> be worth having instead.
> E.g. maybe it is better to make your pairs asymmetric (14bit + 18bit),
> e.g. to allow slightly larger offsets in short load instructions
> (assuming the size of the offset is a common reason for loads to be
> forced to use the long form rather than the short form).
>

Having 32-bit instruction words which could be split into a logical pair
of 16-bit instructions is effectively what ISA's like Qualcomm Hexagon
and similar did.


MitchAlsup

unread,
Jul 4, 2022, 7:18:38 PM7/4/22
to
On Monday, July 4, 2022 at 6:14:29 PM UTC-5, BGB wrote:

> There is also a funky bit-ordering effect, because bit-order is
> effectively big endian in terms of the instruction words, but little
> endian within the instruction words.
<
Bad call..........go entirely BE or LE.

BGB

unread,
Jul 4, 2022, 8:53:43 PM7/4/22
to
Consistent bit ordering is made more complicated when dealing with
variable length instruction encodings, so a certain level of "dog
chewing" is seemingly inevitable.


But, say:
F0xx_Cyyy
Is actually:
CyyyF0xx (or: xx_F0_yy_Cy)
Which is decoded as, say:
000xxyyy or FFFxxyyy

Or, FEaa_bbcc_FAdd_eeff
Is actually:
ffeeFAddccbbFEaa (or: aa_FE_bb_cc_dd_FA_ee_ff)
But, decoded as:
0000aabbccddeeff


Still, I feel this situation in BJX2 isn't quite as bad as it is in
RISC-V (any nowhere near the same levels of shifting and masking needed
for manual bit twiddly).

Like, the subtlety that may not be immediately obvious with RISC-V, is
that while many of the instructions have consecutive runs of bits, they
are frequently not consecutive in terms of how one needs to decode the
value.

BGB

unread,
Jul 4, 2022, 8:54:17 PM7/4/22
to
Mine does a 96-bit fetch each cycle from a pair of cache-lines.

Say, 2x128b (2 cache lines).
Select (191:0) or (255:64) based on Bit 3
Select (127:0) or (159:32) based on Bit 2
Select ( 95:0) or (111:16) based on Bit 1
Currently, I$ ignores Bit 0.

Granted, potentially not the most LUT efficient strategy.

Fetch also determines the bundle length, which is fed (along with the
current PC) through the branch predictor (which operates with the
instruction from the ID1 stage), effectively selecting whether to
advance the PC by one instruction, or to branch to a different location.
The result them comes back to the I$, and becomes the fetch-address for
the next clock cycle.


The LSB of PC is not used during fetch, but may be used (in a branch) to
trigger an Inter-ISA jump.

I had used both low-order bits for this, where:
x0: Normal Addr (16b aligned)
01: Jump to "other ISA" (32b aligned).
11: Use link-register semantics (currently used mode).

Does work, but requires 32-bit alignment of the target during an
inter-ISA jump.

Could change this to:
x0: Normal
x1: Use Link-Register Semantics
This can encode Inter-ISA implicitly via bits 63:48.

Issue would be that this would break binary compatibility with any code
currently depending on the original behavior (unless Bit 1 is ignored),
my existing ROM code and similar always using '11' above. Then again,
this may be a "change behavior soon or be forever stuck with it" scenario.

...

Agner Fog

unread,
Jan 30, 2023, 3:18:29 AM1/30/23
to
Where can I find documentation on My 66000? I can only find bits and pieces here in comp.arch.

I have some questions about the loop vectorization mechanism: How can the CPU hardware convert sequential code to SIMD? If the loop count is not known at the loop entry, the hardware must decode the entire loop before it can know the necessary vector length and what SIMD resources to use. Where does it store the decoded loop? What if the loop is too big to store in the CPU? What about branches? Nested loops? Function calls inside loops?

MitchAlsup

unread,
Jan 30, 2023, 1:50:03 PM1/30/23
to
On Monday, January 30, 2023 at 2:18:29 AM UTC-6, Agner Fog wrote:
>
Hello, Agner, and welcome to comp.arch.
>
> Where can I find documentation on My 66000? I can only find bits and pieces here in comp.arch.
<
Email me. I can send you "Principles of Operation" (A.K.A. ISA) and "Software"
without NDA, access to "System" is still under NDA--which we can discuss in
E-mail.
>
> I have some questions about the loop vectorization mechanism:
> How can the CPU hardware convert sequential code to SIMD?
>
{The below reads like I am talking with someone who already
has a few glints about how Virtual Vector Method works. Sorry.}
<
Hardware recognizes that the loop has narrow width and several
iterations can be run in parallel at the same time. For example::
a byte copy loop can be recognized as "dense" and 8-bits wide.
Even the 1-wide In-Order machine has 128-bit wide cache access
ports, so the HW can transform the loop to multi-lanes of 1-byte
LDs and STs, and thus perform the loop 16 iterations per loop-
latency. This low-end machine only has a single cache port, so
the loop latency is at least 2 (1 LD and 1 ST) and the LOOP inst-
ruction runs in 1 cycle, so one can move 128-bits per cycle every
2 cycles (not impressive but illustrative.) The 1-wide machine
takes 5 instructions to perform the loop and has moved 1-byte
in 5 cycles.
So, the vectorized loop is 16-bytes every 2 cycles compared to
1-bye every 5 cycles or 40× faster.
>
Bigger machines with more resources and wider (or more) cache
ports can perform the loop faster. A 4-wide Out-of-Order machine
with tw0 (2) cache ports of 256-bits wide (each) can run this loop
32-iterations per cycle.
>
Why VVM at all ????? This is really the question
<
1st generation RISC machines performed at 0.7 instructions per
cycle. Modern Great Big Massively Out of Order Machines perform
at 2.0 instructions per cycle (only 3× better).
<
A majority of all CPU cycles are spent in loops (say 80%-ish). If VVM
can make ½ of all loops run only 4× faster then we have a little bitty
in-order machine performing at 1.5 IPC with a die footprint about
10% that of the GBMOoO machine.
<
We also have the property that vector width is not a part of the architectural
specification; so the same code that runs optimally on the LBIO machine
also runs optimally on the GBMOoO machines. This preserves the software
investment.
<
> If the loop count is not known at the loop entry, the hardware must decode the entire loop before it can know the necessary vector length and what SIMD resources to use.
<
No. Why do you think loop count alters the data-flow analysis the
hardware performs on behalf of the loop?
<
You need to remember, The Virtual Vector Method has no vector
(or SIMD) registers, no trip count or masking register like vector
machines and SIMD machines. Masking is performed lane by lane
with predication, vector length is as long as it wants and needs to
be.
<
VVM is not a HW mechanism that translates scalar code to Vector
(or SIMD) code (instruction set). It is a a data-flow analysis attached
to a SIMD-like set of flip-flops near the cache ports of an implemen-
tation. You can call these registers SIMD, but SW cannot access them
and they disappear on interrupts and exceptions and loop termination.
<
VVM also has the property that if an exception happens in the middle
of the loop, that the loop collapses to scalar form, the exception is
processed, and SW debugging will see a scalar machine, and when (if)
control returns, the VVM loop will be reinstalled when the LOOP inst-
ruction is encountered.
<
So, My 66000 provides the 373 vector instructions and the 700-odd
SIMD instructions of RISC-V in exactly 2 instructions. At some point::
Reduced Instruction Set Computer should have a reduced instruction
set! My 66000 currently has 62 instructions. In comparison ARM-64
has 1730-ish instructions.
<
> Where does it store the decoded loop?
<
Different implementations are free to choose this for themselves.
<
But, in general, one would expect the instructions to be stored in
the equivalent of the reservation stations*. These stations have
become modified to hold onto the instruction (and scalar operands)
only waiting for the LOOP instruction to signal "another loop", and
wait for any dynamic operand(s). Think:: multi-fire RS.
<
(*) one could imaging a scoreboard, dispatch stack, or any kind
of instruction and data-flow matching instruction dispatch point
in an implementation. How it is done is implementation dependent.
<
> What if the loop is too big to store in the CPU?
<
The front part of the loop is installed, while the back pat of the
loop is performed as if the loop remained scalar. This gradually
degrades to "a little faster than scalar" on huge loops.
<
> What about branches?
<
One can use predication in vectorized loops but not branches.
HW assumes that all taken control transfers terminate the loop.
HW assumes predication provides the if-then-else within the loop.
<
> Nested loops?
<
There is no nesting of vectorization. outer loops run scalar.
<
> Function calls inside loops?
<
Same as branches.

MitchAlsup

unread,
Jan 30, 2023, 9:11:11 PM1/30/23
to
On Monday, January 30, 2023 at 2:18:29 AM UTC-6, Agner Fog wrote:
> Where can I find documentation on My 66000? I can only find bits and pieces here in comp.arch.
<
If your Google-fu is out of practice: email is:: Mitch...@aol.com
<

luke.l...@gmail.com

unread,
Jan 31, 2023, 8:00:15 AM1/31/23
to
On Monday, January 30, 2023 at 6:50:03 PM UTC, MitchAlsup wrote:
> On Monday, January 30, 2023 at 2:18:29 AM UTC-6, Agner Fog wrote:
> >
> Hello, Agner, and welcome to comp.arch.

likewise

> You need to remember, The Virtual Vector Method has no vector
> (or SIMD) registers, no trip count or masking register like vector
> machines and SIMD machines. Masking is performed lane by lane
> with predication, vector length is as long as it wants and needs to
> be.

as long as the hardware has resources to autoallocate inflight uOps.
one thing Mitch does not mention is that VVM relies on identifying
Load-Process-Store loops. you cannot Vectorise In-register-only
loops (as best i know) because (as bext i inow) VVM only works
with Load-Process-Store. given that that is a massive percentage
of all generl purpose loops you get a very high bang-per-buck.

> So, My 66000 provides the 373 vector instructions and the 700-odd
> SIMD instructions of RISC-V in exactly 2 instructions.

mmm... a little overstated but pretty much, yes :) the big big
advantage of VVM is you do NOT add one scalar instruction
then also a vector instruction and a tangled morass of
identical SIMD instructions: you just add one Scalar. therefore
anything missing is way easier to add.

> At some point::
> Reduced Instruction Set Computer should have a reduced instruction
> set! My 66000 currently has 62 instructions. In comparison ARM-64
> has 1730-ish instructions.

did you remember to include the 1000s from SVE/2? :)
all gone to hell and not even done well, sigh. non-orthogonal
as they tried to jam into 32bit.

> <
> > Where does it store the decoded loop?
> <
> Different implementations are free to choose this for themselves.
> <
> But, in general, one would expect the instructions to be stored in
> the equivalent of the reservation stations*. These stations have
> become modified to hold onto the instruction (and scalar operands)
> only waiting for the LOOP instruction to signal "another loop", and
> wait for any dynamic operand(s). Think:: multi-fire RS.

another way to think of it is, you have a loop short enough to identify
the start and end, therefore you micro-code back-end SIMD and
multi-issue to the back-end *implicitly*... all without actually needing
any EXPLICIT SIMD instructions.

separate in your mind "SIMD front-end ISA" from "SIMD back-end micro-architecture"
and you will do fine.

> > What about branches?
> <
> One can use predication in vectorized loops but not branches.
> HW assumes that all taken control transfers terminate the loop.
> HW assumes predication provides the if-then-else within the loop.
> <
> > Nested loops?
> <
> There is no nesting of vectorization. outer loops run scalar.

this is possible with SVP64, by storing Vectorisation State on the stack,
but SVP64 is a whole different ballgame and a different paradigm.

l.

BGB

unread,
Jan 31, 2023, 10:47:05 AM1/31/23
to
On 1/31/2023 7:00 AM, luke.l...@gmail.com wrote:
> On Monday, January 30, 2023 at 6:50:03 PM UTC, MitchAlsup wrote:
>> On Monday, January 30, 2023 at 2:18:29 AM UTC-6, Agner Fog wrote:
>>>
>> Hello, Agner, and welcome to comp.arch.
>
> likewise
>
>> You need to remember, The Virtual Vector Method has no vector
>> (or SIMD) registers, no trip count or masking register like vector
>> machines and SIMD machines. Masking is performed lane by lane
>> with predication, vector length is as long as it wants and needs to
>> be.
>
> as long as the hardware has resources to autoallocate inflight uOps.
> one thing Mitch does not mention is that VVM relies on identifying
> Load-Process-Store loops. you cannot Vectorise In-register-only
> loops (as best i know) because (as bext i inow) VVM only works
> with Load-Process-Store. given that that is a massive percentage
> of all generl purpose loops you get a very high bang-per-buck.
>
>> So, My 66000 provides the 373 vector instructions and the 700-odd
>> SIMD instructions of RISC-V in exactly 2 instructions.
>
> mmm... a little overstated but pretty much, yes :) the big big
> advantage of VVM is you do NOT add one scalar instruction
> then also a vector instruction and a tangled morass of
> identical SIMD instructions: you just add one Scalar. therefore
> anything missing is way easier to add.
>

But, it does mean that the CPU needs to be smarter.
The main advantage of SIMD is that, while it requires more instructions,
it allows for a comparably simpler and dumber CPU.

Main drawback case is if one needs to fake certain SIMD ops via pipelining.


>> At some point::
>> Reduced Instruction Set Computer should have a reduced instruction
>> set! My 66000 currently has 62 instructions. In comparison ARM-64
>> has 1730-ish instructions.
>
> did you remember to include the 1000s from SVE/2? :)
> all gone to hell and not even done well, sigh. non-orthogonal
> as they tried to jam into 32bit.
>

It is a tradeoff.

I can note that SIMD in BJX2 has neither packed byte ops nor saturating ops.

If one mostly only does, say:
Packed Int16
Packed Int32
Packed Binary16
Packed Binary32

And partial:
A few Packed Binary64 ops.


That is comparably fewer ops.
If one skips out convert+op forms (as were present in NEON) this also
reduces the amount of encoding space needed.

Some other cases can be handled via converter ops:
Packed Byte;
Packed RGB555 / RGB444A3;
Packed FP8S/FP8U;
...


This does at least slightly reduce instruction cost.
One other trick (used by SuperH) being to encode parts of the ISA by
twiddling mode bits. Technically sucks though.


>> <
>>> Where does it store the decoded loop?
>> <
>> Different implementations are free to choose this for themselves.
>> <
>> But, in general, one would expect the instructions to be stored in
>> the equivalent of the reservation stations*. These stations have
>> become modified to hold onto the instruction (and scalar operands)
>> only waiting for the LOOP instruction to signal "another loop", and
>> wait for any dynamic operand(s). Think:: multi-fire RS.
>
> another way to think of it is, you have a loop short enough to identify
> the start and end, therefore you micro-code back-end SIMD and
> multi-issue to the back-end *implicitly*... all without actually needing
> any EXPLICIT SIMD instructions.
>
> separate in your mind "SIMD front-end ISA" from "SIMD back-end micro-architecture"
> and you will do fine.
>

Or, one could encode it via prefix encodings, making all of the SIMD ops
64-bit or longer...

Pack4x32 prefix + FADD = Packed 4x Binary32 FADD.

Then one defines which combinations are or are not allowed.


If I were to do this in BJX2, it would probably consist of taking some
instructions which are not allowed in a WEX encoding, and redefining
their use in a WEX encoding as encoding a prefix modifier.

This is possible as I don't really want to burn much more of the 32-bit
encoding space on SIMD ops (as well, most SIMD ops on BJX2 generally
does not allow for operation in bundles, so not being able to encode
them in a bundled form would not be a huge loss).


>>> What about branches?
>> <
>> One can use predication in vectorized loops but not branches.
>> HW assumes that all taken control transfers terminate the loop.
>> HW assumes predication provides the if-then-else within the loop.
>> <
>>> Nested loops?
>> <
>> There is no nesting of vectorization. outer loops run scalar.
>
> this is possible with SVP64, by storing Vectorisation State on the stack,
> but SVP64 is a whole different ballgame and a different paradigm.
>

OK.

> l.

0 new messages