Why My 66000 is and is not RISC

245 views
Skip to first unread message

MitchAlsup

unread,
Jun 22, 2022, 9:03:25 PMJun 22
to
I could not find the question asking me to make a list of why My 66000
instruction set architecture is like and unlike the tenets of the original
RISC. So I spent some time looking up what the internet is currently saying
about RISCs. There is a short list, but I will start with a few statements
from Hennessey and Paterson::

Hennessey:: The goal of any instruction format should be: 1. simple decode,
2. simple decode, and 3. simple decode. Any attempts at improved code
density at the expense of CPU performance should be ridiculed at every
opportunity.

Patterson:: more is not better -- microcode is bad
Subroutines need low overhead

RISC axioms:
a) the ISA is primarily designed to make the pipeline simple.
b) the ISA is primarily designed as a target for compilers.
c) instructions only exist if they add performance.
d) frequently accessed data is kept in registers.

RISC tenets:
a) 1 word == 1 instruction
b) 1 instructions flows down the pipeline in 1 cycle
c) 1 instruction can cause 0 or 1 exception
d) instruction encoding uses few patterns
e) there is a large uniformly addressable register space

So where does My 66000 ISA stand with respect to these axioms and
tenets::

RISC axioms: My 66000 ISA embodies all of the RISC axioms
RISC tenets: My 66000 ISA rejects ½ of RISC tenets

With minor exceptions to both::

My 66000 contains 32×64-bit general purpose registers. Some might
think this is too few and a FP register file should be added. Looking
at code such as BLASS, Livermore Loops, Linpack indicates otherwise
-- as long as one assumes some hints of OoO pipelining. Looking at
various C libraries this seems perfectly sufficient.

My 66000 ISA contains 6 decoding patterns; 1 for each of
{instructions with 16-bit immediates, instructions with 12-bit
immediates, scaled memory reference, 2-operand reg-reg,
1-operand reg-reg, 3-operand reg-reg }

The 12-bit immediate format is used for shift instructions and
for Predicate instructions and positioned such that predicate
instructions are only 1-bit different than their corresponding
branch instruction. This saves 6×16-bit immediate encodings.

Scaled memory reference, 1-operand, 2-operand, 3-operand
all have access to 32-bit or 64-bit immediates/displacements
in substitution for a register. This eliminates any need to use
instructions or waste registers pasting constants together.

1-operand, 2-operand, 3-operand instructions all have sign control
over their operands. There is no SUB instruction My 66000 uses
ADD Rd,Rs1,-Rs2 instead. The sign control eliminates most NEG
instructions from execution. The 2-operand group allows the
5-bit register specifier to be used as a 6-bit sign extended
immediate, making ADD Rd,#1,-Rs2 easily encoded.

There are Compare instructions that return a bit-vector of everything
the compare circuitry can determine, including range checks like:
0 < Rs1 <= Rs2, classifications {-infinity, -normal, -denormal, -zero,
+zero, +denormal, +normal, +infinity, SNaN, QNaN} I remain tempted
to add "any byte equal", "any halfword equal", "any word equal".

There are 2 kinds of conditional flow: branching and predication and
each has 2 principle kinds of instructions:: condition is determined
from a single bit in a register, or condition is determined by comparing
a register with 0. In addition there are conditionless branches, jumps,
and a special addition supporting PIC for method calls and switches.
Compare-to-zero and branch can access certain HW know information
that is not capable of being stored in a ISA register--this includes things
like a query to the Memory Unit asking if it has seen any interference
between the start of an ATOMIC sequence and "now". The exception
and interrupts and std. return are also encoded here.

Memory reference instructions enable building of ATOMIC primitives
that can touch as many as 8 cache lines of data in a single ATOMIC
event. This is equivalent to the MIPS LL and SC except it operates
over much larger chunks of data. This is sufficient to move an entry
of a shared data structure from one place to another place in a single
event. This minimizes the number of ATOMIC events that are needed,
and comes with guarantees of forward progress.

The ST instruction can store a constant in either 5-bit sign extended
form, or in 32-bit or 64-bit forms. No need to put a constant into a
register in order to ST it to memory. This is along with the ability
to use 32-bit or 64-bit displacement constants.

There are 5 "special" memory reference instructions:: ENTER is used
to setup a new stack, and save registers, EXIT is used to tear down the
stack and restore registers, LDM loads multiple registers, STM stores
multiple registers, and MM moves data from memory to memory.
MM has the property that both cached and uncached memory smaller
than a page is moved as single ATOMIC transfer. {PCIe can do this,
so should CPUs attached to PCIe peripherals.} There is expected to
be a sequencer in the memory unit that performs these out of the
data-path.

The Floating Point group includes Transcendental instructions.
Ln, LnP1, exp, expM1, SIN, COS, TAN, ATAN and some variants
that are only 1 constant different in the calculations. Ln2 takes
only 14 cycles, sin takes 19 cycles. These are included because
they actually do improve performance.

Conversions between FP and FP or FP and INT are provided by
1 instruction (CVT) which has 49 variants to deal with 5 specified
rounding modes and 1 implied rounding mode (current) any time
a rounding could transpire. This falls into the category of "once
you have the HW to do <say> ANINT (of FORTRAN) you have the
95% of the logic to do them all".

The exception model is based on message passing (as is SVCs),
rather than wandering through the high-level OS exception
dispatcher. This model supports threads (processes or tasks)
that are paranoid of the OS looking at their data (such as banking
applications running on a home PC), and can indeed restrict the
OS from looking at the address space.

I/O devices are virtualized, and operate on the virtual address
space of originating requestor. So while the I/O device can DMA
directly into paranoid application address space, and while OS
can verify the given space and bounds are acceptable, OS cannot
look into that address space. This gets rid of the need of a secured
mode of operation.

Deferred procedure calls are handled as messages (argument
setup + 1 instruction) with continuation. The messaging sub-system
operates over both HyperVisor and GuestOS domains simultaneously.
Anyone with a "method" can call that method and get a response
even if that method is running under a different GuestOS.

There is a 66-bit remapped address space--any thread can access
64-bits of the space. Sub-spaces are {DRAM, configuration, MMIO,
and ROM} The address space is configured to efficiently transport
requests over a significant network (ala HyperTransport and Intel
equivalent). DRAM is cache coherent, configuration is strongly ordered,
MMIO is sequentially consistent, ROM is "lax").

The system repeater transports requests from chip to chip, and
amalgamates coherence requests so that the originator counts
responses from cores on his chip, and the number of chips in
the system (rather than counting from every core).

Memory management cannot be turned off--My 66000 imple-
mentations come out of reset with the MMUs turned on. HostBridge
is configured with a MMU/TLB that uses exactly the same tables as
CPUs and can share tables as applicable. Levels in the virtual
address space translations can be skipped! So an application as
simple as 'cat' can be managed with a single page of translation
overhead.

Memory management is inherently HyperVisor/GuestOS. Privilege
is determined by the assortment of root pointers in use on a per
invocation basis.

GuestOS can activate a thread (taking it from a waiting state to
running in a core) in a single instruction and remotely. So can
HyperVisor.

Finally, there is no notion of one thread morphing into a different
thread over a series of instructions manipulating control registers
one by one. For example: an ISR cleanup handler takes a thread
off a wait state queue, places it on a run state queue, and signals
GuestOS to see what threads should be running "right now". This
is all 1 instruction and 1 cycle as far as the core performing the
instruction sees.

My 66000 is not just another ISA, it is a rethink of most of the components
that make up a system. A context switch from one thread to another
within a single GuestOS is 10 cycles. A context switch from one thread
to a thread under a different GuestOS remains 10 cycles. The typical
current numbers are 1,000 cycles within GuestOS, and 10,000 cycles
across GuestOSs.

OH, and BTW, The FP transcendentals are patented.

Terje Mathisen

unread,
Jun 23, 2022, 2:00:14 AMJun 23
to
MitchAlsup wrote:
> There are Compare instructions that return a bit-vector of everything
> the compare circuitry can determine, including range checks like:
> 0 < Rs1 <= Rs2, classifications {-infinity, -normal, -denormal, -zero,
> +zero, +denormal, +normal, +infinity, SNaN, QNaN} I remain tempted
> to add "any byte equal", "any halfword equal", "any word equal".

If you can add the in-reg SIMD compare ops without slowing stuff down,
please do so!

Even having VMM, if you can identify the final \0 byte anywhere in a
64-bit reg, then that's a win for lots of code.

That said, just having your current VMM setup would obviate the need for
SIMD style ops in almost all programs.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Marcus

unread,
Jun 23, 2022, 3:01:43 AMJun 23
to
On 2022-06-23, Terje Mathisen wrote:
> MitchAlsup wrote:
>> There are Compare instructions that return a bit-vector of everything
>> the compare circuitry can determine, including range checks like:
>> 0 < Rs1 <= Rs2, classifications {-infinity, -normal, -denormal, -zero,
>> +zero, +denormal, +normal, +infinity, SNaN, QNaN} I remain tempted
>> to add "any byte equal", "any halfword equal", "any word equal".
>
> If you can add the in-reg SIMD compare ops without slowing stuff down,
> please do so!

In-reg SIMD can be useful. In MRISC32 you can do:

seq.b r2, r1, z ; Byte-wise compare r1 to zero, "Set if EQual"
bz r2, foo1 ; Branch if no byte equal (mask zero)
bnz r2, foo2 ; Branch if any byte equal (mask not zero)
bs r2, foo3 ; Branch if all bytes equal (mask set)
bns r2, foo4 ; Branch if any byte not equal (mask not set)

...and similar with seq.h for half-words. There are also inequality and
gt/lt comparisons, for instance.

Note: "Set" means all bits of the byte/half-word/word are 1. The
opposite (false) outcome of the set instructions is the all bits of the
byte/half-word/word are 0.

The cute part is that I did not have to add special "SIMD" branch
instructions, since the same instructions make sense for both packed and
unpacked comparison results.

/Marcus

MitchAlsup

unread,
Jun 23, 2022, 3:30:44 PMJun 23
to
In todays installment I touch on things about My 66000 not covered above.

My 66000 ISA requires an instruction buffer and a 2 stage instruction
processing pipeline I call PARSE and DECODE. Hennessey would be booing
at this point. However, using this, I get branch overhead down to 0.03 cycles
per taken branch without having any delay slot. {This also makes a unified
L1 cache feasible. But since Fetch and MemRef are so far apart on the die
My implementations have chosen not to utilize this capability.}

PARSE finds the instruction boundaries (main job) and scans ahead for branches,
determines which function units, and looks for CoIssue opportunities. The scan
ahead branches are processes in parallel by DECODE to fetch branch targets
even before the branch instruction is executed. So if a taken prediction is made
the instructions on the taken path are already ready to enter execution. PARSE
identifies immediates and displacements and cancels register port requests,
providing opportunities for ST to read the register file..........

DECODE processes the instructions from PARSE , accesses register file,
computes forwarding, and starts instruction into the execution pipeline.
DECODE routes immediates and displacements to the required instruction.
ST instruction pass through DECODE twice, the 1st time is for AGEN, the
2nd time is for ST.data when a register file port is available.

---------------------------instruction stuff-----------------------------------------------------------

The shift instructions have 2×6-bit fields dealing with the shift amount and
the width of data being shifted. These are used to access odd-sized data
(ala EXTRACT) and to SMASH data calculated at "machine" size back down
into containers of "language" size so the container cannot contain a value
outside of the range of its container. When the width field is 0 it is considered
to be 64-bits. When encoded as an immediate, the 2 fields are back-to-back,
when found in a register there is 26-bits separating the 2 fields, in data<38:32>
both 1000000 and 0000000 are considered to be 64-bits, while 1xxxxxxx
with any of the x's non-zero is considered an Operand exception.

The Multiplex Instruction MPX. MPX basically allows for selecting bits from
a pair of registers based on another register:: ( a & b ) | ( ~a & c ), however
it has other flavors to provide ( !!a & b ) | ( !a & c ) which is CMOV and by
using the immediate encodings in My 66000 provides MOV Rd,#IMM32 and
MOV Rd,#IMM64 along with MOV Rd, Rs1 and MOV Rd,Rs2. They fall out
for free saving MOV opcodes elsewhere.

Vectorization: My 66000 ISA contains loop vectorization. This allows
for performing vectorized loops are several iterations per cycle even
1-wide machines can perform at 32+ instructions per cycle. My main
(as yet unproven) hope is that this takes the pressure off of the design
width. The basic argument is as follows:
a) 1-wide machines operate at 0.7 IPC
b) 2-wide SuperScalar machines operate at 1.0 IPC
c) GBOoO machines operate at 2.0 IPC
d) programs spend more than ½ their time in loops.
So, if one can get a 2× performance advantage of the 1-wide machine
this puts it in spitting distance of the GBOoO machine, which in turn
means the Medium OoO machine can be competitive with the GBOoO
machine are significantly lower {cost, design time, area, power}

AND while investigating loop vectorization, I discovered that a RISC
pipeline with a 3R-1W register file can perform 1.3 IPC. Branch
instructions (20%) do not use the result register, ST instructions
(10%) can borrow the write port AFTER cache tag and translation
validations, AND in the general code I have seen there is significant
opportunity to perform write-elision in the data path, freeing up even
more ports. This, again takes pressure of the width of the design.
So, with vectorization, a 3 (or 4)-wide machine is competitive with
a 6-wide machine,.....

None of this prevents or makes wide GBOoO more difficult.

----------------------instruction modifiers------------------------------------------

CARRY is the first of the Instruction-Modifiers. An instruction-modifier
supplies "bits" for several future instructions so that one does not need
the cartesian product of a given subset encoded in the ISA. Thus, there
are shift instructions and when used with CARRY these perform shifts
as wide as you like 128, 256, 512,.....no need to clog up the encoding
space for lightly used but necessary functionality. Oven in the FP arena
CARRY provides access to exact FP arithmetics.

CARRY provides access to multiprecision arithmetic both integer and FP.
CARRY provides a register which can be used as either/both Input and Output
to a set of instructions. This provides a link from one instruction to another
where data is transmitted but not encoded in the instruction itself.

Since we are in the realm of power limited, My 66000 ISA has an ABS
instruction. Over in the integer side, this instruction can be performed
by subjugating the sign control built into the data path and be "executed"
without taking any pipeline delay (executes in zero cycles). Over on the
FP side this never adds any latency (executes in zero cycles). ABS always
takes less power than performing the instruction in any other way.

DBLE is an instruction modifier that supplies register encodings and
adds 64-bits to the calculation width of the modified instruction. Applied
to a FP instruction: DBLE Rd1,Rs11,Rs22,Rs33; FMAC Rd2,Rs12,Rs22,Rs33
we execute: FMUL {Rd1,Rd2},{Rs11,Rs22},{Rs21,Rs22},{Rs31,Rs32}
and presto: we get FP128 by adding exactly 1 instruction, the compiler
can pick any 8 registers it desires alleviating register allocation concerns.
DBLE is a "get by" kind of addition, frowned upon by Hennessey.

I can envision a SIMD instruction modifier that defines the SIMD parameters
of several subsequent instructions and allows 64-bit SIMD to transpire.
I am still thinking about these. What I cannot envision is a wide SIMD
register file--this is what VVM already provides.

These instruction-modifiers, it seems to me, are vastly more efficient
than throwing hundreds to thousands of unique instructions into ISA.
Especially if those unique instructions <on average> are not used
"that much".

-----------------------------Safe Stack--------------------------------------------------------

Safe Stack. My 66000 architecture contains the notion of a Safe Stack.
Only 3 instructions have access to Safe Stack: {ENTER, EXIT, and RET}
When Safe Stack is in use, the return address goes directly to the Safe
Stack, and return address comes directly off safe stack. Preserved
registers are placed on Safe Stack {ENTER} and their register values
(conceptually) set to 0. Safe Stack is in normal thread memory but
the PTEs are marked RWE = 000 so any access causes page faults.
EXIT reloads the preserved registers from Safe Stack and transfers
control directly back to caller. When Safe Stack is not in use, R0
is used to hold the return address. Proper compiled code runs the
same when safe stack is on or off, so one can share dynamic libraries
between modes.

Safe Stack monitors the value in SP and KILLs lines that no longer
need to reach out into the cache hierarchy, Safe Stack can efficiently
use Allocate memory semantics. Much/most of the time, nothing
in safe stack leaves the cache hierarchy.

Buffer overflows on the "stack" do not corrupt the call/return flow of
control. ROP cannot happen as application has no access to Return
Address. Application cannot see the values in the preserved registers
augmenting safety and certainly cannot modify them.

-------------------------------ABI----------------------------------------------------------------------

Subroutine Calling Convention {A.K.A. ABI}:
Registers R1..R8 contain the first 8 arguments to the subroutine.
SP points are argument[9]
R9..R15 are considered as temporary registers
R16..R29 are preserved registers
R30=FP is a preserved registers but used as a Frame Pointer when
..............language semantics need.
R31=SP is a preserved register and used as a Stack Pointer. SP must
..............remain doubleword aligned at all times.

ABI is very RISC

So, let's say we want to call a subroutine that wants to allocate 1024
bytes on the stack for its own local data, is long running and needs
to preserve all 14 preserved registers, and is using a FP along with a
SP. Let us further complicate the mater by stating this subroutine
takes variable number of arguments. Entry Prologue:

ENTRY subroutine_name
subrutine_name:
ENTER R16,R8,#(1024 | 2)

At this point the register passed arguments have been saved with the
memory passed arguments, FP is pointing at the "other" end of local
data on the stack, after pushing the registers, 1024 bytes has been
allocated onto the SP, the old FP has been saved and the new FP setup.
{This works both with and without Safe Stack}

Your typical RISC-only ISA would require at least 29 instructions to
do this amount of work getting into the subroutine, and another 17
getting out. If the ISA has both INT and FP register files 29 becomes 37.

The Same happens in Epilogue: 1 instruction.

While ABI is very RISC the ISA of Prologue and Epilogue is not.

As a side note: My 66000 is achieving similar code density as x86-64.

A few other interesting side bits:------------------------------------------------------------

LDM and STM to unCacheable address are performed as if ATOMIC::
that is:: as a single bus transaction. All interested 3rd parties see the
memory before any writes have been performed or after all writes
have been performed. A device driver can read several MMIO device
control registers and know that nobody else in the system has access
to the device control registers that could cause interference. A device
driver can store multiple control register locations without interference.

There is a page ¿in ROM? known to contain zeros. A Memory Move
instruction can cause a page accessing this ¿ROM? data to be zeroed
without even bothering to access ¿ROM?--and the entire page is zeroed
at the target. Thus, pages being reclaimed to the free pool are but 1
instruction away from being in the already zeroed page pool. Zeroing
pages is performed at the DRAM end of the system (coherently). And
no <deleterious> bus activity is utilized.


Brett

unread,
Jun 23, 2022, 7:28:53 PMJun 23
to
X86-64 has crap code density, your one instruction stack save restore alone
should make you significantly better, unless perhaps you have gone 32+32.

Add some accumulator ops and most instructions will fit in 16 bits ops with
ease, and you have the extra decode stage to do it anyway.

I would argue that 8 bit opcodes are best when you have an accumulator in
your 32 register RISC design, but that is a bridge too far for most.

How big is the code store needed for an IOT (Internet Of Things smart
toaster) code stack? And what is the savings for the next size down?

BGB

unread,
Jun 23, 2022, 9:08:33 PMJun 23
to
On 6/22/2022 8:03 PM, MitchAlsup wrote:
> I could not find the question asking me to make a list of why My 66000
> instruction set architecture is like and unlike the tenets of the original
> RISC. So I spent some time looking up what the internet is currently saying
> about RISCs. There is a short list, but I will start with a few statements
> from Hennessey and Paterson::
>
> Hennessey:: The goal of any instruction format should be: 1. simple decode,
> 2. simple decode, and 3. simple decode. Any attempts at improved code
> density at the expense of CPU performance should be ridiculed at every
> opportunity.
>
> Patterson:: more is not better -- microcode is bad
> Subroutines need low overhead
>
> RISC axioms:
> a) the ISA is primarily designed to make the pipeline simple.
> b) the ISA is primarily designed as a target for compilers.
> c) instructions only exist if they add performance.
> d) frequently accessed data is kept in registers.
>

BJX2 generally upholds the above.

While some instructions are pretty niche, most still tend to have
use-cases, and I am mostly trying to avoid adding stuff that is
(completely) useless.


> RISC tenets:
> a) 1 word == 1 instruction
> b) 1 instructions flows down the pipeline in 1 cycle
> c) 1 instruction can cause 0 or 1 exception
> d) instruction encoding uses few patterns
> e) there is a large uniformly addressable register space
>

My case, 3 out of 5.

a, 16/32 and bundle-encodings break this one.

d, Some extra complexity exists due to the lack of an architectural Zero
Register and similar, and some instructions (early on) which ended up
with both 2R and 3R encodings.

Early on, I wasn't confident, for example, that "ADD R4, R5" and "ADD
R5, R4, R5" would have been semantically equivalent in all cases.


There were some other cases (Mostly 32-bit 2R Load/Store variants) which
were dropped due to being entirely redundant with the 3R encodings (or
which became redundant once predication was added).

Some other parts of the ISA also ended up being dropped and then later
re-added a few times before becoming more-or-less permanent (and some
other features are in limbo due to not really adding enough to to
justify their existence).


> So where does My 66000 ISA stand with respect to these axioms and
> tenets::
>
> RISC axioms: My 66000 ISA embodies all of the RISC axioms
> RISC tenets: My 66000 ISA rejects ½ of RISC tenets
>
> With minor exceptions to both::
>
> My 66000 contains 32×64-bit general purpose registers. Some might
> think this is too few and a FP register file should be added. Looking
> at code such as BLASS, Livermore Loops, Linpack indicates otherwise
> -- as long as one assumes some hints of OoO pipelining. Looking at
> various C libraries this seems perfectly sufficient.
>

My case: 32|64 x 64-bit.

I am still on the fence as to whether 32 GPRs is "fully sufficient", or
whether 64 GPRs can offer enough gain (in certain use-cases) to justify
its existence. It "kinda helps" for TKRA-GL but is seemingly kinda moot
for pretty much everything else.


The way the encodings for the 64 GPR case are handled is a bit hacky,
but it was a tradeoff (I came up with something which could be done
without breaking binary compatibility or requiring a separate operating
mode). Ironically, everything still works OK so long as "most of the
code" sticks to only using the low 32 GPRs (otherwise, some of the seams
might start to show).


> My 66000 ISA contains 6 decoding patterns; 1 for each of
> {instructions with 16-bit immediates, instructions with 12-bit
> immediates, scaled memory reference, 2-operand reg-reg,
> 1-operand reg-reg, 3-operand reg-reg }
>

Hmm (8 major for 32-bit):
FZnm_ZeoZ //3R "Rm, Ro, Rn"
FZnm_ZeZZ //2R "Rm, Rn"
FZnm_Zeii //3RI (Imm9/Disp9), "Rm, Imm9, Rn" / "(Rm, Disp9), Rn"
FZnZ_Zeii //2RI (Imm10), "Imm10, Rn"
FZZZ_ZeoZ //1R (Ro treated as Rn for these)
FZZn_iiii //2RI (Imm16), "Imm16, Rn"
FZdd_Zddd //Disp20 (Branch)
FZii_iiii //"LDIz Imm24, R0"

Add a few more if one counts the 16-bit ops:
ZZnm //2R
ZZni //2RI (Imm4)
ZZnZ //1R
Znii //2RI (Imm8)
ZZdd //Disp8 (Branch)

The Jumbo and Op64 encodings may or may not be considered new forms,
however they don't actually add "new" instruction-forms per-se, but
rather modify the existing encodings in predefined ways (and reuse the
existing 32-bit decoder; just with more bits "glued on" to the instruction).

One could potentially also interpret the 32-bit encodings as zero-padded
versions of a longer internal encoding space:
FEii_iiii_FZnm_Zeii //3RI, "Rm, Imm33, Rn"
...

With a few special cases, eg:
FEii_iiii_FAii_iiii //"LDIZ Imm48, R0"
FFii_iiii_FAii_iiii //"BRA Abs48"


There are more forms if one considers "minor" patterns, but these don't
really effect instruction encoding, but more how the various parts are
interpreted and mapped to the internal pipeline:
Logically, each instruction is decoded as if it had:
3 read ports, 1 write port;
A 33-bit immediate/displacement field;
Op / Sub-Op;
...

This then combines with an outer stage that deals with the bundle as a
whole, mapping SIMD ops to two lanes, along with Abs48 and Imm64
encodings (where the immediate can't fit into a single pipeline lane).

The output of this using being the configuration for the entire pipeline.


> The 12-bit immediate format is used for shift instructions and
> for Predicate instructions and positioned such that predicate
> instructions are only 1-bit different than their corresponding
> branch instruction. This saves 6×16-bit immediate encodings.
>

Differs in my case:
Shifts and friends use Imm9 forms;
However, because one doesn't need all 9 bits for a typical shift, had
also kinda shoe-horned SIMD shuffle instructions into the mix as well.

Or, in effect, shuffle can be imagined sort of like a conjoined twin
stuck onto the shift instruction (and a variable shift imagined as
masking-off the bit that causes it to able to behave like a shuffle).

Or, one can also imagine that there could have been an alternate
universe where passing a sufficiently out-of-range value to the shift
instruction caused it to shuffle the value instead...


Predicate instructions work very differently in my case, having their
own copy of the 32-bit encoding space which mirrors the format of the
normal opcode space (just replacing the WEX bit with a True/False bit),
and the encoding spots that would have normally encoded Imm24 and Jumbo
being repurposed as Predication+WEX / "PrWEX" (but only applying to a
subset of the ISA).


> Scaled memory reference, 1-operand, 2-operand, 3-operand
> all have access to 32-bit or 64-bit immediates/displacements
> in substitution for a register. This eliminates any need to use
> instructions or waste registers pasting constants together.
>

Via Jumbo, these can all expand to 33 bits.

The 64-bit cases are a bit more limited, but not usually a huge issue.

There are also some Imm56 encodings "on paper" (these are in a similar
limbo as the 48-bit instruction encodings).

Seemingly the vast majority of what one needs a larger immediate for can
be handled via Imm33, where, say, only about 4% of the constants
actually go outside of this limit (the vast majority of these being
either MMIO pointers or irrational floating-point constants).

The Imm56 cases looking like they would be too rare to really be worth
bothering with at present.


> 1-operand, 2-operand, 3-operand instructions all have sign control
> over their operands. There is no SUB instruction My 66000 uses
> ADD Rd,Rs1,-Rs2 instead. The sign control eliminates most NEG
> instructions from execution. The 2-operand group allows the
> 5-bit register specifier to be used as a 6-bit sign extended
> immediate, making ADD Rd,#1,-Rs2 easily encoded.
>

No equivalent in my case.


> There are Compare instructions that return a bit-vector of everything
> the compare circuitry can determine, including range checks like:
> 0 < Rs1 <= Rs2, classifications {-infinity, -normal, -denormal, -zero,
> +zero, +denormal, +normal, +infinity, SNaN, QNaN} I remain tempted
> to add "any byte equal", "any halfword equal", "any word equal".
>

No equivalent.

I did the same thing as SuperH here:
CMPxx instructions twiddles the SR.T bit;
Branches / Predication / ... all operate off the SR.T bit.

Ironically, because of the way Verilog works, so much stuff hanging off
a single bit causes it to get something like 1000x more expensive.


> There are 2 kinds of conditional flow: branching and predication and
> each has 2 principle kinds of instructions:: condition is determined
> from a single bit in a register, or condition is determined by comparing
> a register with 0. In addition there are conditionless branches, jumps,
> and a special addition supporting PIC for method calls and switches.
> Compare-to-zero and branch can access certain HW know information
> that is not capable of being stored in a ISA register--this includes things
> like a query to the Memory Unit asking if it has seen any interference
> between the start of an ATOMIC sequence and "now". The exception
> and interrupts and std. return are also encoded here.
>

Internally, conditional branches are equivalent to predicated branches
in my case.

The partial extension is RISC-V style compare-and-branch instructions,
which have sort of a "shortcut" between the ALU and branch-instruction
logic (the ALU behaves as if it were performing a CMPxx, and the EX1
unit like it were doing a Branch, with the short-cut signal taking the
place of the SR.T bit).


These are effectively mandatory for RISC-V decoder support, but still
regarded as optional in BJX2 (and pretty rarely see much use in
practice, if compared with CMPxx+Bx).

It is possible that a cheaper alternative might have been to rig up
something in the decoder to allow the Bxx instruction to have been split
into a CMPxx+Bx pair in the pipeline (but, I lack a mechanism for this
at present).


> Memory reference instructions enable building of ATOMIC primitives
> that can touch as many as 8 cache lines of data in a single ATOMIC
> event. This is equivalent to the MIPS LL and SC except it operates
> over much larger chunks of data. This is sufficient to move an entry
> of a shared data structure from one place to another place in a single
> event. This minimizes the number of ATOMIC events that are needed,
> and comes with guarantees of forward progress.
>

No equivalent.


> The ST instruction can store a constant in either 5-bit sign extended
> form, or in 32-bit or 64-bit forms. No need to put a constant into a
> register in order to ST it to memory. This is along with the ability
> to use 32-bit or 64-bit displacement constants.
>

No equivalent.
Multi-op sequence required.


> There are 5 "special" memory reference instructions:: ENTER is used
> to setup a new stack, and save registers, EXIT is used to tear down the
> stack and restore registers, LDM loads multiple registers, STM stores
> multiple registers, and MM moves data from memory to memory.
> MM has the property that both cached and uncached memory smaller
> than a page is moved as single ATOMIC transfer. {PCIe can do this,
> so should CPUs attached to PCIe peripherals.} There is expected to
> be a sequencer in the memory unit that performs these out of the
> data-path.
>

No equivalent.

All this stuff needs to be handled manually via Load/Store sequences.
However, there a 128-bit Load/Store that is also used (primarily) to
load/store pairs of registers in prolog/epilog sequences. This was
actually its primary purpose, but 128-bit SIMD was a secondary use-case,
along with it able to be faster for 'memcpy()' and similar than using
64-bit load/store instructions.


> The Floating Point group includes Transcendental instructions.
> Ln, LnP1, exp, expM1, SIN, COS, TAN, ATAN and some variants
> that are only 1 constant different in the calculations. Ln2 takes
> only 14 cycles, sin takes 19 cycles. These are included because
> they actually do improve performance.
>

No equivalent, nearly all math functions done in software in my case.

Originally, there were no FDIV or FSQRT instructions either, but these
exist now.

Current timings are:
FDIV: 130 cycles
FSQRT: 384 cycles


The trig functions generally run from around 500 to 1000 cycles or so
(via unrolled Taylor expansion).

There was some older versions (that came with the C library) which
didn't unroll stuff (naive algorithm), these versions seem to weigh in
at closer to around 50000-200000 clock cycles or so (in cases where they
didn't get stuck in an infinite loop).

I replaced these versions with the unrolled versions because they were
basically unusable.


> Conversions between FP and FP or FP and INT are provided by
> 1 instruction (CVT) which has 49 variants to deal with 5 specified
> rounding modes and 1 implied rounding mode (current) any time
> a rounding could transpire. This falls into the category of "once
> you have the HW to do <say> ANINT (of FORTRAN) you have the
> 95% of the logic to do them all".
>

Separate instructions in my case.

Most FP<->FP cases are dedicated converters.

The INT<->FP cases are routed through FADD (noting as how one can do
some exponent tricks and get the FADD logic to do most of the heavy
lifting).

Rounding for the converters is generally hard-wired, usually either
Approx-RNE (scalar path), or Truncate (SIMD paths).


> The exception model is based on message passing (as is SVCs),
> rather than wandering through the high-level OS exception
> dispatcher. This model supports threads (processes or tasks)
> that are paranoid of the OS looking at their data (such as banking
> applications running on a home PC), and can indeed restrict the
> OS from looking at the address space.
>

My case:
Exception Code: { Address, Tag }
Tag is a 16-bit number
(15:12): Category
(11: 8): Target Core
( 7: 0): Magic Number

The category is used to figure out which offset relative to VBR to
branch to:
VBR basically points somewhere in kernel memory;
Holds a small table of branch instructions;
Exception dispatch branches to one of these branch instructions.

When this happens, the CPU also basically saves PC and SR to special
registers, swaps the stack pointers, and switches to a special operating
mode (which blocks further interrupts and disables the MMU and similar).

Interrupts can be routed along the Ring-Bus, as with memory requests,
intended so that CPU cores can throw interrupts at each other.

The 'RTE' instruction is actually implemented as another type of
"special" exception, which basically tries to pass control back to where
the exception came from (or, at least, where the relevant registers say
the exception came from).


> I/O devices are virtualized, and operate on the virtual address
> space of originating requestor. So while the I/O device can DMA
> directly into paranoid application address space, and while OS
> can verify the given space and bounds are acceptable, OS cannot
> look into that address space. This gets rid of the need of a secured
> mode of operation.
>
> Deferred procedure calls are handled as messages (argument
> setup + 1 instruction) with continuation. The messaging sub-system
> operates over both HyperVisor and GuestOS domains simultaneously.
> Anyone with a "method" can call that method and get a response
> even if that method is running under a different GuestOS.
>

No equivalent.

Basic IO range mapping in the physical address space.


No explicit hypervisor support, but shouldn't be too hard in-theory to
fake this part in software.

Would likely be easier than it would be for x86, given the use of a
software managed TLB.


> There is a 66-bit remapped address space--any thread can access
> 64-bits of the space. Sub-spaces are {DRAM, configuration, MMIO,
> and ROM} The address space is configured to efficiently transport
> requests over a significant network (ala HyperTransport and Intel
> equivalent). DRAM is cache coherent, configuration is strongly ordered,
> MMIO is sequentially consistent, ROM is "lax").
>

My case: 48 or 96 bit virtual, 48 bit physical.

MMIO is synchronous, the bridge to the MMIO bus will effectively "lock"
and not allow another request to pass until the former request has
completed.

All MMIO accesses are fully synchronous from the L1 cache down to the
target device (unlike normal memory), though this does mean that
accessing MMIO carries a fairly steep performance penalty relative to
normal memory accesses.


It is possible that there could be a sub-mode to mimic full 64-bit on
top of 96-bit, by treating the addressing as (32,64) rather than
(48,48). This mode would not allow shoving tag bits into the high-order
parts of base-registers though (since these bits would be interpreted as
part of the virtual address).

The Full-64 mode would effectively use 64-bit pointers, unlike 96-bit
requiring the use of 128-bit pointers. The high bits (from the quadrant
address) would effectively still function as an address-range selector
(as when using 48-bit addresses in 96-bit mode; the local 48-bit space
being a sort of isolated island within the much bigger 96-bit space).


Well, and B-Trees for address translation, because the memory overheads
of an 8-level page table would suck.


> The system repeater transports requests from chip to chip, and
> amalgamates coherence requests so that the originator counts
> responses from cores on his chip, and the number of chips in
> the system (rather than counting from every core).
>
> Memory management cannot be turned off--My 66000 imple-
> mentations come out of reset with the MMUs turned on. HostBridge
> is configured with a MMU/TLB that uses exactly the same tables as
> CPUs and can share tables as applicable. Levels in the virtual
> address space translations can be skipped! So an application as
> simple as 'cat' can be managed with a single page of translation
> overhead.
>
> Memory management is inherently HyperVisor/GuestOS. Privilege
> is determined by the assortment of root pointers in use on a per
> invocation basis.
>
> GuestOS can activate a thread (taking it from a waiting state to
> running in a core) in a single instruction and remotely. So can
> HyperVisor.
>

Very different in my case.

I can use B-Trees as page-tables, which saves memory, but this is mostly
due to this being all software, and the CPU itself doesn't really care
what the page-table looks like.

Could almost mimic x86 segmented addressing, except that pages need to
be page-aligned. I guess if one had non-page-aligned address
translation, they could rig up the MMU to pretend to be an 8086 or
something, but this would make the TLB's job a little harder (and
require more bits in the TLBE).

Most obvious would be that such a mode would come at the expense of the
ASID, reinterpreting the ASIDs bits as a displacement to add to the
translated address. Would also require physical memory pages to be
contiguous, and probably some extra hackery to deal with A20 wrap-around
behavior (I guess one would have to look into how much MS-DOS software
depends on A20 wraparound working).


> Finally, there is no notion of one thread morphing into a different
> thread over a series of instructions manipulating control registers
> one by one. For example: an ISR cleanup handler takes a thread
> off a wait state queue, places it on a run state queue, and signals
> GuestOS to see what threads should be running "right now". This
> is all 1 instruction and 1 cycle as far as the core performing the
> instruction sees.
>

Yeah, no equivalent.


> My 66000 is not just another ISA, it is a rethink of most of the components
> that make up a system. A context switch from one thread to another
> within a single GuestOS is 10 cycles. A context switch from one thread
> to a thread under a different GuestOS remains 10 cycles. The typical
> current numbers are 1,000 cycles within GuestOS, and 10,000 cycles
> across GuestOSs.
>
> OH, and BTW, The FP transcendentals are patented.

I would assume you mean FP transcendentals in hardware (in whatever way
they are implemented), as opposed to in-general.

Their existence in things like "math.h" and so on would likely preclude
any sort of patent protection in the "in general" sense.


Very different, I have doubts about how well a lot of this could be
pulled off in a low-cost implementation. Best I can come up with at the
moment would effectively amount to faking it using lots of microcode or
a software-based emulation layer.


I also consider my ISA to be "fairly unique", albeit in different ways
(and a little more conservative in terms of implementation concerns).

MitchAlsup

unread,
Jun 23, 2022, 9:17:25 PMJun 23
to
It is a major contributor to getting as small as it got.
>
> Add some accumulator ops and most instructions will fit in 16 bits ops with
> ease, and you have the extra decode stage to do it anyway.
<
I looked at this a few years ago and the damage to long term ISA growth
was catastrophic. As it is I have nearly ½ of the OpCode space in each
OpCode group left for the future. and can PARSE instructions in 31 gates
with only 4 gates of delay. All that goes out the window with a meaningful
16-bit "extension". I pass.
>
> I would argue that 8 bit opcodes are best when you have an accumulator in
> your 32 register RISC design, but that is a bridge too far for most.
<
My 66000 only has 59 total instructions. What makes you think you need 256 ?
>
> How big is the code store needed for an IOT (Internet Of Things smart
> toaster) code stack? And what is the savings for the next size down?
<
I have absolutely no interest in things that small. IoT devices don't need
a HyperVisor, or even that much of a supervisor. I have no interest in
register sizes smaller than 64-bits. And quite frankly, say you did get a
design that small and into production, you have to sell billions (maybe
trillions) of then at $0.05 to pay for the design team and recurring
engineering expenses.
<
If you do, more power to you.

MitchAlsup

unread,
Jun 23, 2022, 9:38:46 PMJun 23
to
On Thursday, June 23, 2022 at 8:08:33 PM UTC-5, BGB wrote:
> On 6/22/2022 8:03 PM, MitchAlsup wrote:
<snip>
> > The Floating Point group includes Transcendental instructions.
> > Ln, LnP1, exp, expM1, SIN, COS, TAN, ATAN and some variants
> > that are only 1 constant different in the calculations. Ln2 takes
> > only 14 cycles, sin takes 19 cycles. These are included because
> > they actually do improve performance.
> >
> No equivalent, nearly all math functions done in software in my case.
>
> Originally, there were no FDIV or FSQRT instructions either, but these
> exist now.
>
> Current timings are:
> FDIV: 130 cycles
> FSQRT: 384 cycles
>
Mc 88100 did these in / = 56 and SQRT in ~66
Mc 88120 did these in / = 17 and Sqrt in 22
>
> The trig functions generally run from around 500 to 1000 cycles or so
> (via unrolled Taylor expansion).
<
You need to use Chebyshev coefficients--more accurate sometimes fewer
terms, always better error bounds..
>
<<snip>
> My case: 48 or 96 bit virtual, 48 bit physical.
>
> MMIO is synchronous, the bridge to the MMIO bus will effectively "lock"
> and not allow another request to pass until the former request has
> completed.
<
What are you going to do when there are 24 CPUs in a system and
everybody wants to write to the same MMI/O page ?
>
> All MMIO accesses are fully synchronous from the L1 cache down to the
> target device (unlike normal memory), though this does mean that
> accessing MMIO carries a fairly steep performance penalty relative to
> normal memory accesses.
>
The penalty is inherent in the requirements. However, My 66000 can ameliorate
the latency by grouping multiple writes to neighboring MMI/O control registers
into a single bus transaction. In theory, one can write all the necessary stuff
into the control registers to cause a disk drive to DMA a disk sector wherever
in a single write transaction to MMI/O and a single DMA write transaction
when data returns.
>
<
> > My 66000 is not just another ISA, it is a rethink of most of the components
> > that make up a system. A context switch from one thread to another
> > within a single GuestOS is 10 cycles. A context switch from one thread
> > to a thread under a different GuestOS remains 10 cycles. The typical
> > current numbers are 1,000 cycles within GuestOS, and 10,000 cycles
> > across GuestOSs.
> >
> > OH, and BTW, The FP transcendentals are patented.
> I would assume you mean FP transcendentals in hardware (in whatever way
> they are implemented), as opposed to in-general.
<
You might be surprised at what was allowed in the claims.
>
> Their existence in things like "math.h" and so on would likely preclude
> any sort of patent protection in the "in general" sense.
>
Yes, I did not reinvent ancient SW as HW. The algorithms are new (well
different because of what one can do inside a HW function unit compared
to what one can do using only instructions) with several unique features.
They even bother to get the inexact bit set correctly.
>
> Very different, I have doubts about how well a lot of this could be
> pulled off in a low-cost implementation. Best I can come up with at the
> moment would effectively amount to faking it using lots of microcode or
> a software-based emulation layer.
>
Microcode generally refers to a control machine interpreting instructions.
Is a function unit run by ROM sequencer microcode ? What if the ROM got
turned into equivalent gates: Is it still microcode, or just a sequencer ?
In any event there are only 3 different sequences used (reminiscent of
Goldschmidt DIV and SQRT sequences,)
>
> I also consider my ISA to be "fairly unique", albeit in different ways
> (and a little more conservative in terms of implementation concerns).
<
And hard to read..............

BGB

unread,
Jun 24, 2022, 1:15:18 AMJun 24
to
On 6/23/2022 8:38 PM, MitchAlsup wrote:
> On Thursday, June 23, 2022 at 8:08:33 PM UTC-5, BGB wrote:
>> On 6/22/2022 8:03 PM, MitchAlsup wrote:
> <snip>
>>> The Floating Point group includes Transcendental instructions.
>>> Ln, LnP1, exp, expM1, SIN, COS, TAN, ATAN and some variants
>>> that are only 1 constant different in the calculations. Ln2 takes
>>> only 14 cycles, sin takes 19 cycles. These are included because
>>> they actually do improve performance.
>>>
>> No equivalent, nearly all math functions done in software in my case.
>>
>> Originally, there were no FDIV or FSQRT instructions either, but these
>> exist now.
>>
>> Current timings are:
>> FDIV: 130 cycles
>> FSQRT: 384 cycles
>>
> Mc 88100 did these in / = 56 and SQRT in ~66
> Mc 88120 did these in / = 17 and Sqrt in 22

It is partly based on the strategy used:
Rig the FMUL unit into a feedback loop;
Wait N cycles for answer to converge;
Assume it has converged on the answer.

Generally seems to take roughly this long for the algo to converge on
the answer.

When I first re-added FDIV, it was using the same basic algo (just with
slightly different inputs), and took a similar number of clock-cycles.

Then I had the idea that I could tweak a few things in the Shift-Add
integer divider, and get it to also do FDIV. Though, the way it was
rigged up still needs ~ 130 cycles, but 130 is still less than 384.


>>
>> The trig functions generally run from around 500 to 1000 cycles or so
>> (via unrolled Taylor expansion).
> <
> You need to use Chebyshev coefficients--more accurate sometimes fewer
> terms, always better error bounds..


Possible.

There are also a few faster algos, such as "lookup and interpolate",
but, while faster, these don't give sufficient precision to really be a
good option for the "math.h" functions (assumed to be accurate, even if
not the fastest possible).

There is also CORDIC, but I haven't really messed with it.


In any case, unrolled Taylor expansion is a few orders of magnitude
faster than calculating an exponential and factorial and performing a
floating-point divide and similar every time around the loop...

I was not the person who wrote that code originally, not sure why they
originally wrote it this way.


>>
> <<snip>
>> My case: 48 or 96 bit virtual, 48 bit physical.
>>
>> MMIO is synchronous, the bridge to the MMIO bus will effectively "lock"
>> and not allow another request to pass until the former request has
>> completed.
> <
> What are you going to do when there are 24 CPUs in a system and
> everybody wants to write to the same MMI/O page ?

It all gets serialized to them accessing it one at a time.

Though, ideally, only device drivers and similar should be accessing
MMIO, so this isn't likely to be a huge issue.

By the time I get to 24 cores, will have probably came up with a
different solution.


There is also the option of putting devices stuff on the ringbus. I had
partly already started going this way for VRAM (it is faster to write to
the framebuffer by going through the RAM interface than by going through
the MMIO interface).

However, for accessing hardware devices, in general, one kind of wants
"slower but strictly synchronous" IO over "faster but chaotic" IO.

For VRAM, it is a little different, because generally one is trying to
push several MB/sec out to the screen and don't really care if things
are strictly in-order (if things arrive in the framebuffer in a slightly
different order than they were stored into the L1 cache, who cares?...).


>>
>> All MMIO accesses are fully synchronous from the L1 cache down to the
>> target device (unlike normal memory), though this does mean that
>> accessing MMIO carries a fairly steep performance penalty relative to
>> normal memory accesses.
>>
> The penalty is inherent in the requirements. However, My 66000 can ameliorate
> the latency by grouping multiple writes to neighboring MMI/O control registers
> into a single bus transaction. In theory, one can write all the necessary stuff
> into the control registers to cause a disk drive to DMA a disk sector wherever
> in a single write transaction to MMI/O and a single DMA write transaction
> when data returns.

I was generally accessing MMIO 32 or 64 bits at a time (depending on the
device).

No DMA at present, pretty much everything is still polling IO and similar.


So, for example, for SDcard:
Store a byte to Data register.
Load Control register.
OR a bit in loaded value.
Store modified value to Control register.
Dummy load from Status register (1)
Loop:
Load from Status register.
If BUSY, Continue.
Load byte from Data register.
Repeat until bytes have been moved.

As noted, in the original form, this hit a wall at around 600 K/s.


The modified interface adds a QDATA register (64-bit), and a different
control register bit for "Transfer 8 bytes".

This QDATA version instead hits a wall at around 5 MB/s.

This interface is sufficient for SPI, but if I went to a faster mode,
pretty much as soon as I made the switch, I would be at the bandwidth
limit of this interface (and would then need to come up with something
different).

*1: When operating at "hitting the wall" speeds, the first Status load
will almost invariably be BUSY, nut the second load will typically be
"not BUSY", since the SPI transfer would have completed by the time it
has taken for the request to transfer all the way around the ring and
back again. So, a dummy load can make it faster.


Say, 13 MHz SPI gives 1.5 MB/s, but 13 MHz in UHS-I mode would boost
this up to 13 MB/s (basically, pushing 4 bits per clock-edge).

Much faster than this, and I almost may as well consider going "full
hardware" and memory mapping the SDcard...


>>
> <
>>> My 66000 is not just another ISA, it is a rethink of most of the components
>>> that make up a system. A context switch from one thread to another
>>> within a single GuestOS is 10 cycles. A context switch from one thread
>>> to a thread under a different GuestOS remains 10 cycles. The typical
>>> current numbers are 1,000 cycles within GuestOS, and 10,000 cycles
>>> across GuestOSs.
>>>
>>> OH, and BTW, The FP transcendentals are patented.
>> I would assume you mean FP transcendentals in hardware (in whatever way
>> they are implemented), as opposed to in-general.
> <
> You might be surprised at what was allowed in the claims.

OK.

>>
>> Their existence in things like "math.h" and so on would likely preclude
>> any sort of patent protection in the "in general" sense.
>>
> Yes, I did not reinvent ancient SW as HW. The algorithms are new (well
> different because of what one can do inside a HW function unit compared
> to what one can do using only instructions) with several unique features.
> They even bother to get the inexact bit set correctly.

OK.

In my case, they don't generally get used that heavily IME, so software
is OK so long as it is not unreasonably slow.


In cases where they would have gotten used more heavily, such as sin/cos
being used for the water-warping effects in Quake, lookup tables had
been used instead to good effect.

It is also possible to reduce these lookup tables to half float
precision, since the water warp effect doesn't seem to mind all that much.


>>
>> Very different, I have doubts about how well a lot of this could be
>> pulled off in a low-cost implementation. Best I can come up with at the
>> moment would effectively amount to faking it using lots of microcode or
>> a software-based emulation layer.
>>
> Microcode generally refers to a control machine interpreting instructions.
> Is a function unit run by ROM sequencer microcode ? What if the ROM got
> turned into equivalent gates: Is it still microcode, or just a sequencer ?
> In any event there are only 3 different sequences used (reminiscent of
> Goldschmidt DIV and SQRT sequences,)


Dunno. I was just sort of imagining doing it as a big ROM on top of a
RISC-style core, with chunks of the ISA being effectively treated like
special function calls into this ROM.

It is likely that parts of the Verilog would be procedurally generated,
such as the entry points into the various functions within this ROM.


I had considered something like this a few times in my case, but
generally ended up taking a different approach:
If I can't do it directly in hardware, I wont do it at all.

Only reason I ended up with the functionality of the RISC-V 'M'
extension was because I had thought up a way to implement it affordably.


Even then, it wasn't until earlier today that I got around to adding
"proper" support for 32-bit integer divide (reducing its latency from 68
to 36 cycles). Mostly because in some cases it was being used often
enough to become significant.

But, then I ended up needing to burn a few more 3R encoding spots to add
DIVx.L and MODx.L instructions.

Would have almost just assumed making them be Op64 encodings, but then
it seemed stupid to have DIVS.Q be a 32-bit encoding but DIVS.L be
64-bit, when DIVS.L would be somewhat more common.



>>
>> I also consider my ISA to be "fairly unique", albeit in different ways
>> (and a little more conservative in terms of implementation concerns).
> <
> And hard to read..............


I would suspect a bit less so than IA-64 though.


Granted, writing ASM effectively in this case does generally require
mentally keeping track of the pipeline state and which combinations of
instructions may or may not be bundled (vs x86 which will basically just
accept whatever one throws at it, running at more-or-less full speed).

Though, as an assistance feature, the emulator will generally lint this
stuff and turn illegal combinations into a breakpoint.


Syntax mostly developed out of the SuperH syntax, with a few bits of
TMS320 and similar thrown in, shouldn't be that difficult to figure out.

...

Timothy McCaffrey

unread,
Jun 24, 2022, 10:27:53 AMJun 24
to
On Thursday, June 23, 2022 at 7:28:53 PM UTC-4, gg...@yahoo.com wrote:

> X86-64 has crap code density, your one instruction stack save restore alone
> should make you significantly better, unless perhaps you have gone 32+32.
>
The X86-64 was left with a lot of baggage because of bad design decision to
try and reuse the X86 decoder. Most of the remaining 1 byte opcodes are
either barely used (STC, CLC) or deprecated (PUSH/POP). It would have been
great if the instruction encoding had been refactored, and some other cruft
removed (e.g. only being able to use CL for a dynamic shift count).

It would have also been a great time to be able set up the encodings so that
the instruction parser could figure out the instruction length from the first chunk
(whatever size that was, I suspect 16 bit chunks make sense).

- Tim

John Dallman

unread,
Jun 24, 2022, 11:01:49 AMJun 24
to
In article <2f5c8378-de57-4ef2...@googlegroups.com>,
timca...@aol.com (Timothy McCaffrey) wrote:

> The X86-64 was left with a lot of baggage because of bad design
> decision to try and reuse the X86 decoder. Most of the remaining
> 1 byte opcodes are either barely used (STC, CLC) or deprecated
> (PUSH/POP).

Remember that the design was done by AMD, who have to tread carefully to
avoid giving Intel an excuse to claim they're breaching their X86 license
in some way.

At the time, Intel were still under the impression that Itanium was going
to conquer the world. When they realised better, AMD had Opterons on the
market. Intel wanted to build an AMD-incompatible 64-bit x86 to drive AMD
out of the market. They were restrained by Microsoft, who weren't
interested in supporting two different extended x86 ISAs.

Given how we got here, things could be a lot worse.

John

Thomas Koenig

unread,
Jun 24, 2022, 11:19:55 AMJun 24
to
Brett <gg...@yahoo.com> schrieb:

> How big is the code store needed for an IOT (Internet Of Things smart
> toaster) code stack? And what is the savings for the next size down?

It will be hard to beat an ARM Cortex-M - based microcontroller
which are firmly embedded in the market, and for which a lot of
software has been written, and which cost a bit more than four
dollars per unit.

And if that's too expensive and you do not need the performance,
you can always use a MSP430-based one for considerably less,
less than a dollar at quantity.

The ROM on the latter is somewhere between 1KB and whatever you're
willing to pay for, and the RAM 256 bytes or more. But of course
you're still getting some analog hardware thrown in, such as an
ADC or a comparator.

Not a lot of savings, I'd say.

MitchAlsup

unread,
Jun 24, 2022, 12:15:58 PMJun 24
to
Yes, it was MS that made intel do x86-64. Intel had a model that was within
spitting distance and MS told them the x86-64 port was already done. So,
for the first time in its life, Intel complied.
<
But look at how they have diverged after 2 decades of being almost
identical !!

BGB

unread,
Jun 24, 2022, 12:49:09 PMJun 24
to
On 6/24/2022 10:19 AM, Thomas Koenig wrote:
> Brett <gg...@yahoo.com> schrieb:
>
>> How big is the code store needed for an IOT (Internet Of Things smart
>> toaster) code stack? And what is the savings for the next size down?
>
> It will be hard to beat an ARM Cortex-M - based microcontroller
> which are firmly embedded in the market, and for which a lot of
> software has been written, and which cost a bit more than four
> dollars per unit.
>

Another ISA which could potentially compete with Cortex-M might be
RISC-V RV32IMC or similar.

Pros/cons with C though, it is 'dog chewed' to a point (somewhat more so
than Thumb) where I wonder about decoding cost.


Something like RV32GC would likely be a bit more expensive, as the A/F/D
extensions do a lot of stuff that I have doubts about being able to pull
off cheaply.

A more cost-effective option might be:
RV32IMZfinxZdinxC
But, not a lot of code is built for this.


> And if that's too expensive and you do not need the performance,
> you can always use a MSP430-based one for considerably less,
> less than a dollar at quantity.
>

For hobbyist use, the ones in DIP packaging (MSP430Gxxxx) were typically
being sold at several $ per chip last I bought any, but dunno about now.

QFP variants were cheaper per-chip, but QFP is much less usable (can't
use it with perfboard or DIP sockets).


They were generally cheaper than AVR8 chips, though the AVR8's typically
had more RAM and ROM space.

Performance per clock seemed to be better on MSP430 than AVR8, IME.
Though, in either case, one is not usually going to be using them for
performance-intensive tasks.

IIRC:
MSP430, 16 registers, each 16 bit, Mem/Mem addressing, Von Neumann
AVG8: 32x8b or 16x16b, Load/Store, Harvard (Split code/data spaces)


> The ROM on the latter is somewhere between 1KB and whatever you're
> willing to pay for, and the RAM 256 bytes or more. But of course
> you're still getting some analog hardware thrown in, such as an
> ADC or a comparator.
>
> Not a lot of savings, I'd say.


From what I remember, for 'G' style MSP430 chips:
ROM: ~ 4K to 32K
RAM: ~ 256B to 2K
Address space, something like:
MMIO 0000..01FF
RAM 0200..09FF (Say, 0200..02FF for 256B)
(More MMIO and/or RAM, depending on device)
ROM 8000..FFFF
For less, lower-bound moves upward
FFF0..FFFF is reset/interrupt vectors.

The G chips were typically available in DIP16/20/24 packaging IIRC.

Multi-channel ADC/DAC/... are common.
IO pins are typically capable of both In/Out in digital mode;
ADC/DAC is typically limited to certain pins;
...


The 'X' chips have a larger address space, and may have considerably
more RAM and ROM space (within a 20-bit address space). But, typically
only available in QFP packaging or similar.


One can do bit-banged SPI on the MSP430, but practically one would be
limited to fairly slow IO speeds (kHz territory). Low-speed serial is
also possible.

...

BGB

unread,
Jun 24, 2022, 1:51:56 PMJun 24
to
Though, more extensive redesign would have made it effectively an
entirely new ISA, just with an x86 backward-compatibility mode.

But, yeah, 16-bit chunking makes sense, this is what I use in my ISA in
the baseline case (16/32), though one is mostly limited to 32-bit
encodings for WEX bundles.


As can be noted, x86-64 code density ranges from "kinda meh" to
"spectacularly bad", depending mostly on the compiler.

That said, i386 and Thumb2 are both a bit more competitive, kinda harder
to beat them on the code-density front.


I am not entirely sure what exactly is going on here (for x86-64) to
make the code density so bad (it is pretty bad even in size-optimized
modes). The difference is often somewhat outside of what could easily be
explained just by the REX prefix and similar.


Say, for example:
x86 does an Abs32 load, 6 bytes;
x86-64 does a RIP+Disp32 Load, 7 bytes.
Delta: 17% bigger.

Or:
x86 does an 2R-ADD, 2B
x86-64 does a 2R-ADD (w/ REX), 3B
Delta: 50% bigger.

Then again, things like REX prefix and tending to save/restore more
stack variables and similar could be a factor.

Possibly also an increase in 64 bit constant loads, ...

But, often times, the expansion is significantly larger than the
theoretically expected 20-50% or so.

Timothy McCaffrey

unread,
Jun 24, 2022, 3:31:52 PMJun 24
to
You can't have a 64 bit constant in an instruction, except for immediate load (IIRC), so
you have to waste a register loading the constant and then use it.

The calling ABI is much different than the 386, where you just usually pushed stuff
on the stack. Now you have some stuff in registers (which registers depends on whether
you are running Windows or Linux) and some stuff on the stack.

Some registers you are required to save before the call (caller save)
and others after the call (callee save).

Since you are not saving values with a simple push (usually) in the subroutine, you go
from a 1 byte PUSH to a 5 or 6 byte MOV to stack.

Simple INC/DEC doubled in size, so probably most code now uses ADD instead (which
is probably faster because you don't have a partial CC update).

Due to stack and structure storage expanding for 8 byte values (e.g. pointers),
you can only store so many of them there using an 8 bit offset. Unfortunately,
the next step up is a 32 bit offset.

And probably a bunch of other stuff I've forgotten....

- Tim

MitchAlsup

unread,
Jun 24, 2022, 3:56:28 PMJun 24
to
I worked on some x86 decode mechanisms while at AMD, and learned a lot about
x86 encoding {which I still consider BETTER than SPARC-Vis}
<
My 66000 ISA format and encoding is a direct result of this, and indeed, follows
your tenet of having everything needed to determine size in the first word.
>
> - Tim

MitchAlsup

unread,
Jun 24, 2022, 4:02:53 PMJun 24
to
On Friday, June 24, 2022 at 2:31:52 PM UTC-5, timca...@aol.com wrote:
> On Friday, June 24, 2022 at 1:51:56 PM UTC-4, BGB wrote:
> >
> > Possibly also an increase in 64 bit constant loads, ...
> >
> > But, often times, the expansion is significantly larger than the
> > theoretically expected 20-50% or so.
<
> You can't have a 64 bit constant in an instruction, except for immediate load (IIRC), so
> you have to waste a register loading the constant and then use it.
<
My 66000 does not have this problem. AND while BGB may be able to get buy with
this restriction now, you won't in 10 years hence.
>
> The calling ABI is much different than the 386, where you just usually pushed stuff
> on the stack. Now you have some stuff in registers (which registers depends on whether
> you are running Windows or Linux) and some stuff on the stack.
>
> Some registers you are required to save before the call (caller save)
> and others after the call (callee save).
<
With <realistically> 30-64-bit registers in use by compiler and 16 of these preserved,
I am not seeing very much caller-save register traffic from Brian's LLVM port. It is more
like R9-R15 are simply temps used whenever and forgotten.
>
> Since you are not saving values with a simple push (usually) in the subroutine, you go
> from a 1 byte PUSH to a 5 or 6 byte MOV to stack.
<
I go to a single instruction that pushes as much stuff as desired (by compiler)
and then allocates a stack frame for the local-variables.

Thomas Koenig

unread,
Jun 24, 2022, 4:40:20 PMJun 24
to
MitchAlsup <Mitch...@aol.com> schrieb:

First, thanks again for the good explanations.

> DBLE is an instruction modifier that supplies register encodings and
> adds 64-bits to the calculation width of the modified instruction. Applied
> to a FP instruction: DBLE Rd1,Rs11,Rs22,Rs33; FMAC Rd2,Rs12,Rs22,Rs33
> we execute: FMUL {Rd1,Rd2},{Rs11,Rs22},{Rs21,Rs22},{Rs31,Rs32}
> and presto: we get FP128 by adding exactly 1 instruction,

This means pair-of-doubles 128-bit, not IEEE 128-bit. I think S/360
introduced this; POWER still has it as the only option up to POWER8.
POWER9 has hardware support for IEEE 128-bit, and IBM is moving away
from double double to IEEE FP for POWER 9+ (I helped a bit in that
transition, for gfortran).

> the compiler
> can pick any 8 registers it desires alleviating register allocation concerns.

Eight registers is a lot if there are only 32 to go around...

> DBLE is a "get by" kind of addition, frowned upon by Hennessey.
>
> I can envision a SIMD instruction modifier that defines the SIMD parameters
> of several subsequent instructions and allows 64-bit SIMD to transpire.
> I am still thinking about these. What I cannot envision is a wide SIMD
> register file--this is what VVM already provides.

I think a lot of the use cases could also be covered if the
processor were able to process int8 ... int64 and fp16..fp64
(with fp128 being an exception) at the width of an SIMD unit,
so something like

MOV R4,#0
VEC {R5}
LDUH R6,[R10+R4] ! Load half float into R6
LDUH R7,[R11+R4] ! Second one
FADD.F2 R7,R7,R6
STH R7,[R12+R4]
ADD R4,R4,#2
LOOP (something)

could be executed at full SIMD with. Is this feasible? Or
would it be better to do this kind of thing via SIMD?

Ivan Godard

unread,
Jun 24, 2022, 5:02:17 PMJun 24
to
Binary compatibility is a real bear. Either leave a *lot* of free
entropy (my66), or push the problem to the software and build machinery
(Mill), or suffer bloat and slow decode (x86, RISCV).

Brett

unread,
Jun 24, 2022, 5:54:05 PMJun 24
to
You are missing the I in internet, no wifi I can find in that chip.

Talking about a network stack to talk to your phone. Smart color changing
lightbulbs and soon all the appliances in your home, washer, dryer, stove,
microwave, thermostat, security cameras, just everything.

Plus your home router, which uses a much more powerful wifi block and CPU.

There are markets here that will pay for better code density, assuming a
network stack is significant?

MitchAlsup

unread,
Jun 24, 2022, 5:55:11 PMJun 24
to
On Friday, June 24, 2022 at 3:40:20 PM UTC-5, Thomas Koenig wrote:
> MitchAlsup <Mitch...@aol.com> schrieb:
>
> First, thanks again for the good explanations.
> > DBLE is an instruction modifier that supplies register encodings and
> > adds 64-bits to the calculation width of the modified instruction. Applied
> > to a FP instruction: DBLE Rd1,Rs11,Rs22,Rs33; FMAC Rd2,Rs12,Rs22,Rs33
> > we execute: FMUL {Rd1,Rd2},{Rs11,Rs22},{Rs21,Rs22},{Rs31,Rs32}
> > and presto: we get FP128 by adding exactly 1 instruction,
<
> This means pair-of-doubles 128-bit, not IEEE 128-bit. I think S/360
<
No this means FP with 14-bit exponent and 113-bit fraction (if my math is
right) pairs of double are available using exact FP arithmetics via CARRY
not DBLE.
<
> introduced this; POWER still has it as the only option up to POWER8.
> POWER9 has hardware support for IEEE 128-bit, and IBM is moving away
> from double double to IEEE FP for POWER 9+ (I helped a bit in that
> transition, for gfortran).
> > the compiler
> > can pick any 8 registers it desires alleviating register allocation concerns.
<
> Eight registers is a lot if there are only 32 to go around...
<
It is not a machine designed to crunch FP128 all the time.
It is a machine designed so the occasional use is satisfactory.

MitchAlsup

unread,
Jun 24, 2022, 5:57:42 PMJun 24
to
I don't see it:: a 10G or 100G network interface already has a memory footprint
(for its own buffering concerns) that skimping on the CPU and ROM seems a
waste.

BGB

unread,
Jun 24, 2022, 8:55:52 PMJun 24
to
On 6/24/2022 3:02 PM, MitchAlsup wrote:
> On Friday, June 24, 2022 at 2:31:52 PM UTC-5, timca...@aol.com wrote:
>> On Friday, June 24, 2022 at 1:51:56 PM UTC-4, BGB wrote:
>>>
>>> Possibly also an increase in 64 bit constant loads, ...
>>>
>>> But, often times, the expansion is significantly larger than the
>>> theoretically expected 20-50% or so.
> <
>> You can't have a 64 bit constant in an instruction, except for immediate load (IIRC), so
>> you have to waste a register loading the constant and then use it.
> <
> My 66000 does not have this problem. AND while BGB may be able to get buy with
> this restriction now, you won't in 10 years hence.

Still better in my case than it is in RISC-V where this case would
require a memory load...


As-is (in BJX2), encodings have been defined ("on paper"), eg, for Imm56
encodings for some instructions; just they haven't been put into use yet.

Partly is is a combination of:
Non-zero decoding cost;
It is pretty rare to exceed the existing 33-bit limit for 3RI ops.

Spending 1 extra cycle to load a constant into a register isn't usually
a huge issue.


IME, the vast majority of 64-bit constant loads thus far tend to be
Binary64 constants; usually irrational or repeating 'double' constants
or similar (most other constants will be compacted down to a smaller
format).


Some constant-load stats (from my GLQuake port):
Imm8: 11% (Byte range)
Imm16: 70% (Int16 or UInt16)
Binary16: 12% (Double encoded as Half-Float)
Imm32/33: 4.4%
Imm33s: 2.2% (Int32 or UInt32, Zero/Sign Extend)
Imm32Hi: 1.1% (Int32 into high-order 32 bits, low 32 are 0)
Binary32: 0.8% (Double as a pair of Binary32)
2xBinary16: 0.3% (2xBinary32 as 2xBinary16)
Imm64: 2.5% (Fallback Case)


Some "rarely used" types:
Load value into the high 10 bits of target (rare);
Say: zzz0000000000000
Load bits into the middle of a 64-bit value (rare);
Say: 0000zzzzzzzz0000
Load 4xFP16 encoded as 4xFP8;
...
These cases seem to be rare enough to be mostly ignored.


Grouping constant loads by instruction length:
16-bit: 11%
32-bit: 82%
64-bit: 4.4%
96-bit: 2.5%

Note that this is only for discrete constant loads, and does not count
immediate values or displacements.

Constant loads reflect ~ 7.4% of the total Jumbo prefixes, with the rest
going into immediate fields.

Calculating stats:
85% Imm9/Disp9
15% Imm33/Disp33

Rough estimate of upper-bound of overflowerd immed cases:
Less than 3% (Excluding Load/Store ops)
Less than 0.4% (Including Load/Store ops)

Where the Imm/Disp balance is roughly:
89% Disp (Load/Store Displacements)
11% Imm (Immediates for ALU instructions and similar).


However, given that most of the 64-bit constants (dumped into a log) are
fairly obviously either MMIO addresses or floating-point constants, the
actual bound for overflowing the 33-bit immediate range is likely much
smaller.


I don't have a stat for the relative use of Jumbo between Imm and Disp
encodings, however (based on what I typically see in disassembly dumps),
I would estimate Disp to be the dominant case.


It can be noted that a significant chunk of the cases which are being
encoded as Imm33/Disp33 could also be handled by Imm17/Disp17 encodings
(my compiler doesn't typically use these unless the instruction is
*also* using XGPR).



While arguably code "could change" here, such as due to ever-expanding
memory usage, I suspect this is less likely to be an issue in a
statistical sense.


The main thing that would be the "likely existential risk" for this,
would be programs exceeding 4GB in the ".bss" section, which would
require a bigger displacement.

For x86-64, one would run into a similar problem if text+data+bss
exceeds 2GB (thus breaking ones' ability to use RIP-relative addressing).



>>
>> The calling ABI is much different than the 386, where you just usually pushed stuff
>> on the stack. Now you have some stuff in registers (which registers depends on whether
>> you are running Windows or Linux) and some stuff on the stack.
>>
>> Some registers you are required to save before the call (caller save)
>> and others after the call (callee save).
> <
> With <realistically> 30-64-bit registers in use by compiler and 16 of these preserved,
> I am not seeing very much caller-save register traffic from Brian's LLVM port. It is more
> like R9-R15 are simply temps used whenever and forgotten.

That is presumably how it is supposed to be...


In my case, it is roughly a 50/50 split between caller save (scratch)
and callee save (preserved) registers.

For leaf functions, one wants a lot of scratch registers, and for
non-leaf functions, a lot of callee-save registers.

But, sadly, no party can be entirely happy:
Leaf functions wishing they could have more registers to play with,
without needing to save them first;
Non-leaf functions wishing they could have more registers for variables
which wont get stomped on the next call;
...


Can note that, IIRC:
Win64 gave a bigger part of this pie to callee-save;
SysV/AMD64 gave a bigger part of the pie to caller-save.

A roughly even split seemed like an easy answer, lacking any good way to
find a general/optimal balance across a range of programs.


Conceivably, it could also be possible to have a certain number of
"flexible" registers which a compiler could use to "fine tune" the
balance in the ABI, but these would be annoying at DLL/SO edges, as it
would require "worst case" handling (treating them like caller-save when
calling an import, and like callee-save for DLL exports).

In such an ABI, likely:
2/3: Nominally Callee Save
1/3: Caller Save / Scratch
With 1/3 of the register space able to be re-balanced from callee-save
to caller save by the compiler.


>>
>> Since you are not saving values with a simple push (usually) in the subroutine, you go
>> from a 1 byte PUSH to a 5 or 6 byte MOV to stack.
> <
> I go to a single instruction that pushes as much stuff as desired (by compiler)
> and then allocates a stack frame for the local-variables.
>


I once had PUSH/POP in BJX2, but then I dropped them (mostly for
cost-saving reasons; after noting that adjusting the stack-pointer and
then using a series of stores, or performing a series of loads and then
adjusting the stack pointer, could be similarly effective).

So, it is basically using Load/Store instructions...

However, in most cases:
MOV.X Rn, (SP, Disp4*8)
Can also be encoded in a 16-bit instruction format...


x86-64 would need 2x as many instructions here, and each instruction
would also need 5-bytes to encode, ...

So, so roughly a 500% encoding-cost delta in this case for x86-64 vs BJX2.


Then again, can also note that I am often seeing around a 300% delta
between BJX2 and x86-64 in terms of ".text" sizes and similar.

Though, I still tend to fall a bit short of being able to match Thumb2
or similar at this game...

MitchAlsup

unread,
Jun 24, 2022, 9:36:04 PMJun 24
to
On Friday, June 24, 2022 at 7:55:52 PM UTC-5, BGB wrote:
> On 6/24/2022 3:02 PM, MitchAlsup wrote:

> > With <realistically> 30-64-bit registers in use by compiler and 16 of these preserved,
> > I am not seeing very much caller-save register traffic from Brian's LLVM port. It is more
> > like R9-R15 are simply temps used whenever and forgotten.
> That is presumably how it is supposed to be...
>
>
> In my case, it is roughly a 50/50 split between caller save (scratch)
> and callee save (preserved) registers.
<
I, too, have 50%/50%:: R1-15 are temps, R16-30 are preserved.
R0 receives Return Address, R31 is Stack Pointer. ½ of the temps
can be used to carry arguments and results covering the 98%-ile.
>
> For leaf functions, one wants a lot of scratch registers, and for
> non-leaf functions, a lot of callee-save registers.
>
> But, sadly, no party can be entirely happy:
> Leaf functions wishing they could have more registers to play with,
> without needing to save them first;
> Non-leaf functions wishing they could have more registers for variables
> which wont get stomped on the next call;
> ...
>
>
> Can note that, IIRC:
> Win64 gave a bigger part of this pie to callee-save;
> SysV/AMD64 gave a bigger part of the pie to caller-save.
<
CRAY-1 had only temp registers at the call/return interface. (Lee Higbe circa 1990)
IBM 360 had only preserved registers.
VAX had only preserved registers--both had 16 registers.
>
> A roughly even split seemed like an easy answer, lacking any good way to
> find a general/optimal balance across a range of programs.
>
The choice is a lot easier 50%/50% when you have 32 registers.
>
<snip>
> >
> I once had PUSH/POP in BJX2, but then I dropped them (mostly for
> cost-saving reasons; after noting that adjusting the stack-pointer and
> then using a series of stores, or performing a series of loads and then
> adjusting the stack pointer, could be similarly effective).
<
Push instructions make::
PUSH R1
PUSH R2
PUSH R3
more expensive than:
SUB SP,SP,#12
ST R1,[SP+8]
ST R1,[SP+4]
ST R1,[SP]
due to the serial dependency.
<
The peep hole HW optimizer in K9 would perform this transformation.
{Yes, the optimizer was a piece of HW the compiler knew nothing about.}

Ivan Godard

unread,
Jun 24, 2022, 10:30:17 PMJun 24
to
And Mill has only preserved (from the view of the caller) but you don't
have to preserve them. And only has temps (from the viewpoint of the
callee) but you don't have to clear them.

David Brown

unread,
Jun 25, 2022, 4:15:02 AMJun 25
to
Only a /very/ tiny proportion of network interfaces are 10 G or above.
Microcontrollers rarely have more than 100 Mbit Ethernet. The next big
thing in wired networking in the embedded world is two-wire Ethernet, to
get the convenience of Ethernet networking at low cost. It comes in 10
Mb and 100 Mb varieties (a 1 Gb variety may come eventually).

For wireless communication, speeds are usually even lower. Modern NBIOT
cellular systems are designed to be extremely low power, cheap, have
longer range (20 km more than 3G and the rest). You send packets of up
to about 200 bytes of data, perhaps once a day, with a delivery time of
several seconds. Perfect for environmental monitoring, finding your
sheep, and many other tasks.

For local Wifi (or Zigbee, Z-Wave, etc.) devices, small and low
bandwidth is also fine. You can get away with a few hundred bytes ram
and still have enough to control a lightbulb, thermostat, etc.

The IOT world is /full/ of systems running on 8-bit AVR's, 16-bit
MSP430's, and other small devices. Code density matters for many of them.

(Of course it's a different matter for wireless cameras and all the
other devices that need high bandwidth.)

BGB

unread,
Jun 25, 2022, 11:45:17 AMJun 25
to
Seems like it would also be fairly trivial to chop 10/100 Ethernet down
to a 4-wire variant as well, probably using RJ11 plugs or similar.
Advantage of 4-wire as that this could allow for POE (and 4-wire
phone-wire could be cheaper than CAT5E or similar).

Could also be electrically compatible with existing hubs and switches
via an RJ11 to RJ45 adapter.


> For wireless communication, speeds are usually even lower.  Modern NBIOT
> cellular systems are designed to be extremely low power, cheap, have
> longer range (20 km more than 3G and the rest).  You send packets of up
> to about 200 bytes of data, perhaps once a day, with a delivery time of
> several seconds.  Perfect for environmental monitoring, finding your
> sheep, and many other tasks.
>
> For local Wifi (or Zigbee, Z-Wave, etc.) devices, small and low
> bandwidth is also fine.  You can get away with a few hundred bytes ram
> and still have enough to control a lightbulb, thermostat, etc.
>
> The IOT world is /full/ of systems running on 8-bit AVR's, 16-bit
> MSP430's, and other small devices.  Code density matters for many of them.
>
> (Of course it's a different matter for wireless cameras and all the
> other devices that need high bandwidth.)
>

I would have figured a network stack would have been a bit much for this
class of device...

David Brown

unread,
Jun 25, 2022, 12:31:00 PMJun 25
to
10 Mbps and 100 Mbps Ethernet already only use 4 wires - one pair in
each direction. Passing (non-isolated) DC power over these wires is
extremely simple, and requires nothing more than a few diodes and an LC
filter. Unfortunately, the PoE standards were developed by a committee
of morons that produced a ridiculously over-engineered system that is
too bulky and expensive to have caught on outside a few specific use-cases.

Even easier, however, is simply to pass the power over the spare pairs
in a standard 4-pair Ethernet cable.

The two-wire Ethernet standards already include support for simpler and
cheaper PoE solutions.

> Could also be electrically compatible with existing hubs and switches
> via an RJ11 to RJ45 adapter.
>
>
>> For wireless communication, speeds are usually even lower.  Modern
>> NBIOT cellular systems are designed to be extremely low power, cheap,
>> have longer range (20 km more than 3G and the rest).  You send packets
>> of up to about 200 bytes of data, perhaps once a day, with a delivery
>> time of several seconds.  Perfect for environmental monitoring,
>> finding your sheep, and many other tasks.
>>
>> For local Wifi (or Zigbee, Z-Wave, etc.) devices, small and low
>> bandwidth is also fine.  You can get away with a few hundred bytes ram
>> and still have enough to control a lightbulb, thermostat, etc.
>>
>> The IOT world is /full/ of systems running on 8-bit AVR's, 16-bit
>> MSP430's, and other small devices.  Code density matters for many of
>> them.
>>
>> (Of course it's a different matter for wireless cameras and all the
>> other devices that need high bandwidth.)
>>
>
> I would have figured a network stack would have been a bit much for this
> class of device...

I have a book on my shelf describing a TCP/IP stack for an 8-bit PIC
microcontroller.

However, the network stack needed for small Wifi or NB-IOT systems is
vastly smaller than you need for a full IP and TCP/IP stack.

BGB

unread,
Jun 25, 2022, 3:01:30 PMJun 25
to
That is kinda the point of how it would be electrically compatible:
Use the pairs that are in-use in 10/100;
Skip the other wires;
Maybe use smaller/cheaper RJ11 (6P4C variant) rather than RJ45.

Normal twisted-pair telephone wire would probably have sufficient
electrical properties to 10/100 in many cases.

Only thing is that it would require an adapter to plug RJ11 into RJ45,
though other options:
Use RJ45 but with only 2 pairs (effectively a 10/100-only wire);
Cable which has RJ11 on one end but RJ45 on the other.
(Side-stepping the need for an adapter at the switch).


Probably put the pins in the plug in such a way that it doesn't have
adverse effects if someone tries to plug a telephone into it.

Say:
NC, A+, B-, B+, A-, NC

Normal phone only connecting to the B pair (vs across the A/B pairs).

With the POE system I am imagining, if one did connect across the A/B
pairs, plugging a phone into it would result in it ringing continuously,
whereas if only the B pair is connected (probably the TX pair from the
phone's end), it would be silent and there would be zero net voltage
from the phone's end.


> Even easier, however, is simply to pass the power over the spare pairs
> in a standard 4-pair Ethernet cable.
>

I had assumed doing a thing of running 48 VAC or similar between the two
differential pairs.

This should work OK, but needs at least 4 wires (2 for each pair).
On the device side, there would probably be a bridge rectifier connected
to the center-taps of an isolation transformer.


For PoE with this system, an RJ11<->RJ45 adapter could also function as
the AC injector, say with a pair of isolation transformers (to let the
data through), with the center taps connected (via another transformer)
up to the mains power.

Could make sense in the PoE case to have it as a multi-port block
though, say, 4-8 simultaneous connections, rather than 1 adapter per cable.


> The two-wire Ethernet standards already include support for simpler and
> cheaper PoE solutions.
>

OK, would need to look into it.


But, would assume that a two-wire interface is not likely to be
electrically compatible with traditional Ethernet, at least not without
some additional trickery (additional isolation transformers and probably
a ground wire).


Signaling and power would maybe be done in a similar way to a 2-wire
telephone, but this wouldn't be able to be (passively) connected up to
existing hubs or switches.


>> Could also be electrically compatible with existing hubs and switches
>> via an RJ11 to RJ45 adapter.
>>
>>
>>> For wireless communication, speeds are usually even lower.  Modern
>>> NBIOT cellular systems are designed to be extremely low power, cheap,
>>> have longer range (20 km more than 3G and the rest).  You send
>>> packets of up to about 200 bytes of data, perhaps once a day, with a
>>> delivery time of several seconds.  Perfect for environmental
>>> monitoring, finding your sheep, and many other tasks.
>>>
>>> For local Wifi (or Zigbee, Z-Wave, etc.) devices, small and low
>>> bandwidth is also fine.  You can get away with a few hundred bytes
>>> ram and still have enough to control a lightbulb, thermostat, etc.
>>>
>>> The IOT world is /full/ of systems running on 8-bit AVR's, 16-bit
>>> MSP430's, and other small devices.  Code density matters for many of
>>> them.
>>>
>>> (Of course it's a different matter for wireless cameras and all the
>>> other devices that need high bandwidth.)
>>>
>>
>> I would have figured a network stack would have been a bit much for
>> this class of device...
>
> I have a book on my shelf describing a TCP/IP stack for an 8-bit PIC
> microcontroller.
>
> However, the network stack needed for small Wifi or NB-IOT systems is
> vastly smaller than you need for a full IP and TCP/IP stack.

OK.


Once (when I was much younger) I implemented a TCP/IP stack and Ethernet
card driver in a hobby OS project.

Lots of little lesser-known protocols in this mix, like ICMP and ARP and
similar, ...



In my current projects, I haven't gotten back around to this part yet.
Partly, it looks like to do it from an FPGA, one is basically driving
out the bits themselves, and has to write their own logic for
transmitting and receiving Ethernet frames at the level of bits going
over the wires (IIRC).


With the old card I was targeting, IIRC it was at the level of
abstraction of Ethernet frames getting transmitted and received via a
pair of ring buffers.

The card I am using does have an Ethernet port and similar at least, so
could maybe get to this eventually.

BGB

unread,
Jun 25, 2022, 4:52:19 PMJun 25
to
On 6/24/2022 8:36 PM, MitchAlsup wrote:
> On Friday, June 24, 2022 at 7:55:52 PM UTC-5, BGB wrote:
>> On 6/24/2022 3:02 PM, MitchAlsup wrote:
>
>>> With <realistically> 30-64-bit registers in use by compiler and 16 of these preserved,
>>> I am not seeing very much caller-save register traffic from Brian's LLVM port. It is more
>>> like R9-R15 are simply temps used whenever and forgotten.
>> That is presumably how it is supposed to be...
>>
>>
>> In my case, it is roughly a 50/50 split between caller save (scratch)
>> and callee save (preserved) registers.
> <
> I, too, have 50%/50%:: R1-15 are temps, R16-30 are preserved.
> R0 receives Return Address, R31 is Stack Pointer. ½ of the temps
> can be used to carry arguments and results covering the 98%-ile.

Yeah:
R0/R1: Special
R2..R7: Scratch
R8..R14: Preserved
R15: SP
R16..R23: Scratch
R24..R31: Preserved

So: 14 scratch, 15 Preserved.

ABI:
R2/R3 Return Value
R2: Struct Pointer (Struct Return)
R3: 'this'
R4..R7, R20..R23: Arguments

If XGPR:
R32..R39, R48..R55: Scratch
R40..R47, R56..R63: Preserved


If the 128-bit ABI:
R36..R39, R52..R55: Arguments
Some other registers in the ABI are moved around.

SP is at R15 mostly for historical reasons, does result in some cruft
though.


>>
>> For leaf functions, one wants a lot of scratch registers, and for
>> non-leaf functions, a lot of callee-save registers.
>>
>> But, sadly, no party can be entirely happy:
>> Leaf functions wishing they could have more registers to play with,
>> without needing to save them first;
>> Non-leaf functions wishing they could have more registers for variables
>> which wont get stomped on the next call;
>> ...
>>
>>
>> Can note that, IIRC:
>> Win64 gave a bigger part of this pie to callee-save;
>> SysV/AMD64 gave a bigger part of the pie to caller-save.
> <
> CRAY-1 had only temp registers at the call/return interface. (Lee Higbe circa 1990)
> IBM 360 had only preserved registers.
> VAX had only preserved registers--both had 16 registers.

OK.

IIRC, x86:
EAX: Scratch
ECX: Scratch
EDX: Scratch
EBX: "It Depends" (1)
ESP: Stack
EBP: Base-Pointer / Preserved
ESI: Preserved
EDI: Preserved

1: Compilers didn't really seem to entirely agree on whether EBX was
Scratch or Preserved, but Preserved seemed to be more common.


>>
>> A roughly even split seemed like an easy answer, lacking any good way to
>> find a general/optimal balance across a range of programs.
>>
> The choice is a lot easier 50%/50% when you have 32 registers.

Yeah.


>>
> <snip>
>>>
>> I once had PUSH/POP in BJX2, but then I dropped them (mostly for
>> cost-saving reasons; after noting that adjusting the stack-pointer and
>> then using a series of stores, or performing a series of loads and then
>> adjusting the stack pointer, could be similarly effective).
> <
> Push instructions make::
> PUSH R1
> PUSH R2
> PUSH R3
> more expensive than:
> SUB SP,SP,#12
> ST R1,[SP+8]
> ST R1,[SP+4]
> ST R1,[SP]
> due to the serial dependency.
> <
> The peep hole HW optimizer in K9 would perform this transformation.
> {Yes, the optimizer was a piece of HW the compiler knew nothing about.}

Partial issue was that, once I got around to pipelining Load/Store
operations, the Push/Pop would have been generally slower as well as
they require an extra interlock stage to deal with the SP updates.

No hardware level optimizers in my case.

It was cheaper and easier simply to drop them from the ISA.

Typically, the:
ADD Imm8s, SP
Or:
ADD Imm16s, SP
Instructions are used for stack-pointer adjustments.


Had ended up treating the "LDISH16 Imm16, SP" encoding as a special type
of breakpoint, mostly one with a magic number, intended mostly to help
with debugging (if a "__debugbreak()" is hit, I can use the magic number
to figure out which debugbreak was hit, where otherwise I might not know
where exactly the loader has put the loaded program, meaning PC by
itself is not sufficient to identify the offending breakpoint).

But, this does seem like a bit of a hack.

I guess, can also note that 0000 is also interpreted as a breakpoint (so
it will trigger a breakpoint if branching into zeroed memory), but this
is not the official BREAK instruction (3030 or F000_3030).

Technically, it is actually:
MOV.B DLR, (DLR, DLR)
But, like, this encoding is pointless enough, that there is no issue
with interpreting it instead as a breakpoint.

...

Brett

unread,
Jun 25, 2022, 5:17:18 PMJun 25
to
I don’t get why you think you need to reserve half your opcode space for
future extensions, I would have thought we are at the end of history for
opcode extensions.

What is the cost of reserving 3 bits of one pattern and the same pattern at
the 16 bit border, so that you can add 16 bit opcodes in the future?

Besides the cost of rearranging bits the cost is nothing with half the
opcode space reserved. If you are clever and spread the three bits to fit
your open slots.

This gives your customers choices, customers like choices. You could upsell
this as a future feature like RISCV does and just never deliver, like
RISCV. ;)


You have a two wide design, even once you implement 16 bit opcodes handling
the unaligned case of long instructions is not that hard, an extra gate
delay? Two?Depending on which of a dozen approaches you pick for size.

The resulting 13 bit opcode is plenty for one register a source ACC and a
dest ACC, and a full set of integer operations. And some short loads/store
offsets to/from ACC using one address register.

MitchAlsup

unread,
Jun 25, 2022, 6:29:37 PMJun 25
to
On Saturday, June 25, 2022 at 4:17:18 PM UTC-5, gg...@yahoo.com wrote:
> MitchAlsup <Mitch...@aol.com> wrote:
<snip>
> > I looked at this a few years ago and the damage to long term ISA growth
> > was catastrophic. As it is I have nearly ½ of the OpCode space in each
> > OpCode group left for the future. and can PARSE instructions in 31 gates
> > with only 4 gates of delay. All that goes out the window with a meaningful
> > 16-bit "extension". I pass.
<
> I don’t get why you think you need to reserve half your opcode space for
> future extensions, I would have thought we are at the end of history for
> opcode extensions.
<
a) because I have watched what happens to processors over 50 years and
how every time you turn around they have more OpCodes--mainly to address
stuff forgotten earlier.
<
b) remember I compacted everything into only 59 actual instructions.
>
> What is the cost of reserving 3 bits of one pattern and the same pattern at
> the 16 bit border, so that you can add 16 bit opcodes in the future?
<
The cost of reserving space for 16-bit is that it over-constrains the 32-bit
OpCode space. For example: I could not give the 16-bit OpCodes a typical
subgroup (6-bit Major OpCode) because the first instruction would only
have 10-bits left !! (16-6=10)
<
Also note: Where there are instructions in several formats (like ADD with 16-bit
immediate and ADD of 2 registers), in all cases, the bit pattern used to recognize
ADD remains identical.
<
There are several OpCode groups reserved in perpetuity, these were chosen such
that if one branches into data there is very little possibility to finding anything other
than INVALID instruction decoding sitting there. From the M7 66000 ISA document::
<----------------------------------------------------------------------------------------------------------------------------
A number of the Major OpCode specifiers are reserved in perpetuity as illegal
OpCodes. Transferring control into 32-bit or 64-bit integer data, or into 32-bit or
64-bit floating point data will very likely result in the decoding of an illegal
instruction and raise the OPERATION exception. In particular, small 32-bit positive
integers and small 32-bit negative integers are illegal instructions. 32-bit Floating
point values in the range ±[1/128..32) are also illegal OpCodes. Should control be
transferred into typical integer or floating point data, there is little likelihood of
executing for a long time before running into an illegal instruction. Executing data
can also be prevented in the MMU when desired (almost always.)
<----------------------------------------------------------------------------------------------------------------------------
This catches programming errors. When I looked, you can't do this with 16-bit inst-
ructions; as basically every encoding has to be used.
<
That is: some architects think of things other than code density--we think of
code safety--even when the GuestOS fights us (or JavaScript,...) programming
attack modes, and design architectures that are substantially more robust than
current ones.
>
In particular, My 66000 is immune to the current laundry list of attack strategies
{Meltdown, Spectré, RowHammer, RoP, ForeShadow, ...}
<
> Besides the cost of rearranging bits the cost is nothing with half the
> opcode space reserved. If you are clever and spread the three bits to fit
> your open slots.
<
>
> This gives your customers choices, customers like choices. You could upsell
> this as a future feature like RISCV does and just never deliver, like
> RISCV. ;)
>
I give my customers a choice: Do you want a machine that is secure or not ?
>
> You have a two wide design,
<
Technically, I have a 1-wide design that can CoIssue some pairs of instructions
It is 1-wide because there is 3R1W register file. It can CoIssue because some
pairs of instructions do not consume certain register ports. It is not at all
SuperScalar !
<
I also have a 6-wide design. Nothing in ISA or the rest of the architecture makes
either design necessarily harder than it needs to be. For many of the control
logic calculations My 66000 ISA requires, I went to the trouble of implementing
the gate circuitry to perform said duties to verify that the choices were good.
For example, I can look at a 32-bit word (which takes 320 gates to hold in flip-
flops) and determine the instruction length in 31 gates (4-gates of delay). Thereby
Instruction caches for machines less than 16-wide do not need predecoding bits.
FCMP is performed in the same function unit as Integer CMP. The integer part
requires 55 gates (4 gates of delay) the FP addition is 17 gates (remains 4 gates
of delay) So adding FCMP to CMP is a small price {Oh and BTW, I restrict myself
to 4-in NAND gates and 3-in NOR gates}
<
> even once you implement 16 bit opcodes handling
> the unaligned case of long instructions is not that hard, an extra gate
> delay? Two? Depending on which of a dozen approaches you pick for size.
<
I can assure you that if I had set out to design a good 16-bit OpCode ISA
I would have succeeded. But over my 40 year career, I have found some
things more valuable, and I pursued those interests. You are free to do
differently.
>
> The resulting 13 bit opcode is plenty for one register a source ACC and a
> dest ACC, and a full set of integer operations. And some short loads/store
> offsets to/from ACC using one address register.
<
Then you end up with a cartesian product of various ways to do the same thing.
This makes the compiler have to figure out if::
<
INC Rd ; NoOp
is better or worse than:
ADD Rd,Rd,#1
and what set of rules (or heuristics) govern the selection. I have worked inside
compilers trying to figure this kind of stuff out. It ain't straightforward.
<
I submit that compilers are best when they only need to figure how to do
something exactly one way.

David Brown

unread,
Jun 26, 2022, 5:27:01 AMJun 26
to
On 25/06/2022 21:01, BGB wrote:
> On 6/25/2022 11:30 AM, David Brown wrote:
>> On 25/06/2022 17:45, BGB wrote:

>>> Seems like it would also be fairly trivial to chop 10/100 Ethernet
>>> down to a 4-wire variant as well, probably using RJ11 plugs or
>>> similar. Advantage of 4-wire as that this could allow for POE (and
>>> 4-wire phone-wire could be cheaper than CAT5E or similar).
>>
>> 10 Mbps and 100 Mbps Ethernet already only use 4 wires - one pair in
>> each direction.  Passing (non-isolated) DC power over these wires is
>> extremely simple, and requires nothing more than a few diodes and an
>> LC filter.  Unfortunately, the PoE standards were developed by a
>> committee of morons that produced a ridiculously over-engineered
>> system that is too bulky and expensive to have caught on outside a few
>> specific use-cases.
>>
>
> That is kinda the point of how it would be electrically compatible:
>   Use the pairs that are in-use in 10/100;
>   Skip the other wires;
>   Maybe use smaller/cheaper RJ11 (6P4C variant) rather than RJ45.
>
> Normal twisted-pair telephone wire would probably have sufficient
> electrical properties to 10/100 in many cases.

RJ11 connectors are not going to be much cheaper than RJ45, if anything.
And telephone wire is not going to be good enough for anything here.
In particular, there are no real standards or quality control (except
for long-distance lines that cost more than good Ethernet cables because
no one installs new ones any more). So your cheapo bit of telephone
wire might work in one system, but a different cheapo wire won't. It
might work fine until your motors start, then the ESD interference
disrupts it. Having something that /might/ work or /usually/ works is
not very useful.

>
> Only thing is that it would require an adapter to plug RJ11 into RJ45,
> though other options:
>   Use RJ45 but with only 2 pairs (effectively a 10/100-only wire);
>   Cable which has RJ11 on one end but RJ45 on the other.
>     (Side-stepping the need for an adapter at the switch).
>
>
> Probably put the pins in the plug in such a way that it doesn't have
> adverse effects if someone tries to plug a telephone into it.
>
> Say:
>   NC, A+, B-, B+, A-, NC
>
> Normal phone only connecting to the B pair (vs across the A/B pairs).
>
> With the POE system I am imagining, if one did connect across the A/B
> pairs, plugging a phone into it would result in it ringing continuously,
> whereas if only the B pair is connected (probably the TX pair from the
> phone's end), it would be silent and there would be zero net voltage
> from the phone's end.
>
>
>> Even easier, however, is simply to pass the power over the spare pairs
>> in a standard 4-pair Ethernet cable.
>>
>
> I had assumed doing a thing of running 48 VAC or similar between the two
> differential pairs.
>

AC is expensive in embedded systems. DC is cheap.

> This should work OK, but needs at least 4 wires (2 for each pair).
> On the device side, there would probably be a bridge rectifier connected
> to the center-taps of an isolation transformer.
>
>
> For PoE with this system, an RJ11<->RJ45 adapter could also function as
> the AC injector, say with a pair of isolation transformers (to let the
> data through), with the center taps connected (via another transformer)
> up to the mains power.
>
> Could make sense in the PoE case to have it as a multi-port block
> though, say, 4-8 simultaneous connections, rather than 1 adapter per cable.
>
>
>> The two-wire Ethernet standards already include support for simpler
>> and cheaper PoE solutions.
>>
>
> OK, would need to look into it.
>
>
> But, would assume that a two-wire interface is not likely to be
> electrically compatible with traditional Ethernet, at least not without
> some additional trickery (additional isolation transformers and probably
> a ground wire).
>

It is not electrically compatible, even with such trickery. The
trickery involved to get full duplex signalling on a single twisted pair
involves a good deal more than a transformer!

Currently, two-wire PHY's are rare and expensive. But that's due to
their novelty - if they become popular and quantities go up, prices will
drop. Note that short-range two-wire 10 Mbps Ethernet is a multi-drop
bus, and does not need a switch. (I'm not sure if it supports PoE.)

BGB

unread,
Jun 26, 2022, 2:43:21 PMJun 26
to
For telephone wire, I was mostly thinking of 2-pair CAT3 (as opposed to
4 pair CAT3).

For 10/100, it should be OK, since usually the other two pairs are just
sitting around mostly doing nothing.

Apparently, 2-pair CAT3 (still with RJ45) was actually a thing at one
point for some LANs.



However, I have noted that for some short/cheap Ethernet cables, have
seen the RJ45 connectors crimped onto pieces of flat ribbon cable,
implying that one "can" probably get by with cheaper (non-twisted /
CAT1) wire in some cases (I would guess probably for runs of say 1 or 2
meters or similar, would likely need some testing).


Though, I guess one difference between RJ11 and RJ45 is that a lot of
the through-hole RJ45 plugs have built-in isolation transformers,
whereas a board build for RJ11 plugs might need to supply these itself.


>>
>> Only thing is that it would require an adapter to plug RJ11 into RJ45,
>> though other options:
>>    Use RJ45 but with only 2 pairs (effectively a 10/100-only wire);
>>    Cable which has RJ11 on one end but RJ45 on the other.
>>      (Side-stepping the need for an adapter at the switch).
>>
>>
>> Probably put the pins in the plug in such a way that it doesn't have
>> adverse effects if someone tries to plug a telephone into it.
>>
>> Say:
>>    NC, A+, B-, B+, A-, NC
>>
>> Normal phone only connecting to the B pair (vs across the A/B pairs).
>>
>> With the POE system I am imagining, if one did connect across the A/B
>> pairs, plugging a phone into it would result in it ringing
>> continuously, whereas if only the B pair is connected (probably the TX
>> pair from the phone's end), it would be silent and there would be zero
>> net voltage from the phone's end.
>>
>>
>>> Even easier, however, is simply to pass the power over the spare
>>> pairs in a standard 4-pair Ethernet cable.
>>>
>>
>> I had assumed doing a thing of running 48 VAC or similar between the
>> two differential pairs.
>>
>
> AC is expensive in embedded systems.  DC is cheap.
>

Small transformer and 4 diodes, should be manageable.

With 48VDC, one is going to need a buck converter, which requires an
inductor and a sense and control circuits.

One can do AC->DC with 4 diodes and a capacitor, which is not a huge
cost in any sense, and AC allows using a linear transformer to step down
to 5V or similar.


It would likely be simpler and cheaper than traditional DC PoE:
Doesn't care which pair is which;
Doesn't need any logic to detect what is on the other end.

Traditional PoE involves a pointlessly complicated "negotiation" step.


Eg: For this, just sorta stick 48VAC on the line and assume it is good.
Though, this would not be compatible with devices assuming DC PoE.

But, in a way, this would be a potential advantage to using RJ11, so one
doesn't accidentally mix them up (not sure how well the DC devices would
tolerate being plugged into AC).


>> This should work OK, but needs at least 4 wires (2 for each pair).
>> On the device side, there would probably be a bridge rectifier
>> connected to the center-taps of an isolation transformer.
>>
>>
>> For PoE with this system, an RJ11<->RJ45 adapter could also function
>> as the AC injector, say with a pair of isolation transformers (to let
>> the data through), with the center taps connected (via another
>> transformer) up to the mains power.
>>
>> Could make sense in the PoE case to have it as a multi-port block
>> though, say, 4-8 simultaneous connections, rather than 1 adapter per
>> cable.
>>
>>
>>> The two-wire Ethernet standards already include support for simpler
>>> and cheaper PoE solutions.
>>>
>>
>> OK, would need to look into it.
>>
>>
>> But, would assume that a two-wire interface is not likely to be
>> electrically compatible with traditional Ethernet, at least not
>> without some additional trickery (additional isolation transformers
>> and probably a ground wire).
>>
>
> It is not electrically compatible, even with such trickery.  The
> trickery involved to get full duplex signalling on a single twisted pair
> involves a good deal more than a transformer!
>
> Currently, two-wire PHY's are rare and expensive.  But that's due to
> their novelty - if they become popular and quantities go up, prices will
> drop.  Note that short-range two-wire 10 Mbps Ethernet is a multi-drop
> bus, and does not need a switch.  (I'm not sure if it supports PoE.)
>

OK.


Not sure how it works, haven't found much information thus far.


I guess if I were to consider designing something, a few possibilities:
Simple DC signaling, say:
An RX and TX line, possibly at 3.3v or similar;
Using 8b/10b or similar;
Would need a common ground;
Not likely suitable for longer-distance signaling:
Signal integrity and possible ground-loop issues.
Likely point-to-point only (couldn't do a passive bus).
Something sorta like a CAN bus:
Maybe still using 8b/10b or similar;
Basically, one has a wire that is normally pulled high;
Pull low to send bits, half-duplex.
Likely still subject to ground-loop and propagation delays
Differential signaling, likely tri-state (+/0/-):
Would at least avoid ground loop issues and similar;
Back to requiring isolation transformers and similar.
Wired in a star, there might also be an issue with echoes (1)



1: Potentially echoes would be worse with the isolation transformers, as
induction is likely to reflect an inverted version of the signal back
down the wire, which if if hits the other transformers might flip back
into a positive image, ... Cases where constructive interference occurs
could potentially raise the "noise floor" enough to interfere with
transmission (such as corrupting transmitted frames).


Echo rate at 10 Mbps would be around 1 bit every 30 meters, probably
manageable, and the reflection should mostly "self annihilate" at the
point it is received. For a "sane" network size, the echo is likely to
almost entirely dissipate within a few bits.


At 100 Mbps, it would be 1 bit every 3 meters, potentially a bigger
issue, as echoes would propagate for comparably longer. A lot would also
depend on the inductive properties of the isolation transformers.

(It might take several hundred bits of "silence" for the echoes to die
down following a transmitted message with, say, a 100 meter network).

This issue would likely be significantly reduced if doing point-to-point
signaling (constructive interference could not occur). Would depend
mostly on the "sensitivity" at each end, would want to have sensitivity
low enough that it doesn't detect reflected bits, but high enough that
it does detect transmitted bits.

...

Brett

unread,
Jun 26, 2022, 4:31:25 PMJun 26
to
An optimizer pass to convert all loads that are only used once to ACC ops
is pretty trivial, as a software guy I am not asking for Itanic compiler
changes.

All the other 16 bit variants use restricted register sets and has deep
effects on register coloring, etc. I tried a dozen of these on paper and
despite my propaganda of the time they all sucked. ;(

ACC ops follow the KISS principle. Simple one for one substitution of 16
bit opcodes for the longer ones where the operation tree makes it possible.

To make best use you need a bunch of 32 bit ops that use ACC, or better yet
just use a register like R1, and remove that register from normal use, a
reduction of one register for the other compiler passes.

MitchAlsup

unread,
Jun 26, 2022, 5:56:38 PMJun 26
to
While I can, in general, agree that you are not asking for Titanic additions
to the compiler, you are asking for Titanic alterations of the axioms and
tenets underlying the encoding philosophy of My 66000 ISA. Basically,
you are asking for a complete reset. I am not interested in starting over.
However, You are free to design whatever 16-bit instructions you want for
your architecture.
>
> All the other 16 bit variants use restricted register sets and has deep
> effects on register coloring, etc. I tried a dozen of these on paper and
> despite my propaganda of the time they all sucked. ;(
<
You and Quadriblock should get together and compare notes........maybe
you could teach him about "what to leave out" as part of your comp.arch
meeting.
>
> ACC ops follow the KISS principle. Simple one for one substitution of 16
> bit opcodes for the longer ones where the operation tree makes it possible.
<
I am well aware of how accumulator machines perform. Rather well on the
60%+ code sequences, and less well on the 40%- code sequences. Whereas
a 32-bit only ISA has but one way of expressing arithmetic and does not so
suffer.
<
Remember I have a <essentially> pure IRSC ISA that is achieving x86-64
code density--significantly better than <¿almost?> all other pure RISC ISAs.
I got here by eliminating instruction functionality that could be embodied
elsewhere in the data-path and uniformly encoded in ISA. This makes each
instruction more powerful without adding delay to its execution and makes
the compilers job a bit easier in expressing the necessary semantic.
>
> To make best use you need a bunch of 32 bit ops that use ACC, or better yet
> just use a register like R1, and remove that register from normal use, a
> reduction of one register for the other compiler passes.
<
Yech.

BGB

unread,
Jun 26, 2022, 6:15:25 PMJun 26
to
A few thoughts:
The restricted set should not be smaller than 1/2 the baseline
"full-sized" set.

So, for 32 GPRs, 4b (16 regs) is OK, 3b (8 regs) a bit less so.

I skipped 3R encodings in the 16-bit space, as they would be
"essentially useless".

For size-optimized code, one does have to bias the compiler towards
using a smaller set of registers (the set usable by 16-bit encodings),
which does typically come at a performance cost (in my case, this is
mixed with the other drawback that one can't predicate or bundle the
16-bit encodings).


So, the general result is that the program is roughly 50% bigger if
built in speed-optimized modes.

Size optimized mode: around 60% 16-bit, 40% 32-bit;
Speed optimized mode: around 20% 16-bit, 80% 32-bit.

Speed-optimized 16/32 is still generally smaller than a fixed-length
32-bit subset though (well, and more so if one disallows Jumbo
encodings, which adds an additional size penalty).



> ACC ops follow the KISS principle. Simple one for one substitution of 16
> bit opcodes for the longer ones where the operation tree makes it possible.
>
> To make best use you need a bunch of 32 bit ops that use ACC, or better yet
> just use a register like R1, and remove that register from normal use, a
> reduction of one register for the other compiler passes.
>

Side note:
Why are R0 and R1 "special" in my case?...

Mostly because early on, I removed them from normal use to have
registers which the ASM stage could stomp without warning.


R0 was used typically to load temporary values into if the Immed field
was insufficient.

Say, without Jumbo:
ADD R4, 123, R5 //OK, can use an immediate form
ADD R6, 123456, R7 //Not OK, doesn't fit.

So, the assembler would quietly emit it as:
LDIZ 123456, R0 //OK
ADD R6, R0, R7 //Also OK
And, pretend that a larger immediate existed.

This simplified higher-level code-generation, which could then mostly
pretend that the immediate field was always large enough.

This is also why "LDIz Imm24, R0" exists, even despite eating a
relatively huge amount of encoding space.

With Jumbo, these cases can usually be encoded directly (usually saving
1 or 2 clock cycles).


Though, there are still a lot of edge cases where LDIZ ends up being used:
Because I have reloc types for LDIZ but not for the Jumbo forms;
Cases where the instruction-encoding logic lacks the case to deal with
the jumbo-form encoding;
...

Sadly, some of this part of my compiler has gathered a fair bit of cruft.

Partly, it is a case where my initial strategy:
Emitter function for each instruction type;
Bit "switch()" blocks to fill out instruction words for each mnemonic;
...
Didn't scale as well as I had hoped...


At this stage, I might have been better able with a different approach:
Use a big listing (grouped by mnemonic);
Pattern match against the listing;
Emit the first-found instruction which matches the pattern.

I had used this approach in my x86 assemblers, which is admittedly a
little more scalable than "giant switch() blocks".

However, "switch()" blocks (with logic paths for every possible
instruction encoding) are initially simpler if one assumes a reasonably
small ISA.



Then of course, some parts of my compiler, such as the WEXifier, do
their work mostly on already-emitted machine code, which is kind of a
stupid approach, but was the easiest approach at the time.

Well, and the way it works is kind limited:
Look at 3 instructions (A,B,C);
Is A or B fixed/immovable?
If so, Advance 1 and Continue.
Can A, B, and C be bundled? If yes:
Bundle A, B, and C
Advance 3;
Continue.
Can A and B be Bundled
Bundle A and B;
Advance 2;
Continue.
Can B and C be Swapped?
Will swapping B and C allow bundling A and C? If yes:
Swap B and C;
Continue.
Can A and B be Swapped?
Will swapping A and B allow bundling A and C? If yes:
Swap A and B;
Continue.
Advance 1 and Continue.
(Couldn't do anything here, so leave instruction as-is).

There are a few more cases than this, but this is the general algorithm.


The logic for "can A and B be swapped" mostly checks for things like
register dependencies, whether they are memory load/store ops, whether
they depend on or modify SR.T, etc...

Immovable cases mostly involve cases which are no longer simple 32-bit
instructions:
Instructions which have a reloc;
The LDIz Imm24 op (generally assumed to have a reloc).
Load/Store ops with a PC-relative encoding;
Jumbo encodings;
Stuff that is already bundled;
...


There is another step that runs before this, similar algo, just tries to
swap instructions if doing so will reduce the number of pipeline interlocks.

As can be noted, these aren't particularly good at extracting all that
much ILP. They will also not perform swaps if doing so would require an
intermediate swap which is not allowed, ...

(For example, manually optimizing stuff in ASM frequently involves
non-local instruction movements, which aren't really allowed here; since
validating a movement much beyond 2-instruction swaps becomes
exponentially harder).


It is likely that better could be to have a more general process, say:
Generate a permutation index (0..N);
Look for a permutation which minimizes the number of interlocks and
maximizes the number of bundled encodings;
Swaps the instructions into the found ordering.

Then say, use either a gradient descent approach:
Scan instructions;
Check if A can be swapped with B;
Evaluate if doing so would improve the metric (globally).
If so, Swap.
Continue until no more swaps occur.

Or, maybe a genetic-algorithm style approach:
Pick instructions at random to try to swap to breed more patterns;
Rank each by the metric;
Continue for 4*N generations;
Use the best-found as the final ordering.


But, these would be a bit of work to add, not sure how much they would
gain over the current approach (which only operates on a window of 3
instructions).


Note that before the WEXifier runs, the compiler behaves as if it is
generating code for a plain RISC-style ISA (the compiler has no
understanding of modulo scheduling, which is at present mostly limited
to hand written ASM).

...


In other news, after tweaking a few things in my compiler, to prioritize
the "LDIz Imm10, Rn" and "ADD Imm10, Rn" encodings over their Imm16
equivalents, the Imm10 encodings now dominate over the Imm16 encodings
in terms of instruction count (roughly a 90% / 10% split).

This implies that most of the constant loads are still fairly small (can
fit easily into an Imm10).

The reason for the change was mostly that Imm10 allows a few encoding
cases (such as PrWEX) which the Imm16 versions do not (without this
change, the compiler was missing a lot of cases where it could WEXify
the predicated ops, which happened to include a lot of constant loads
and similar).

This modifies the prior stats some, eg:
Imm8: 11%
Imm10: 63%
Imm16: 7%
Binary16: 12% (These are now the majority of the F8 block)
Imm33: 4.4%
Imm64: 2.5%

...

BGB

unread,
Jun 26, 2022, 8:28:11 PMJun 26
to
On 6/26/2022 4:56 PM, MitchAlsup wrote:
> On Sunday, June 26, 2022 at 3:31:25 PM UTC-5, gg...@yahoo.com wrote:
>> MitchAlsup <Mitch...@aol.com> wrote:
>>> On Saturday, June 25, 2022 at 4:17:18 PM UTC-5, gg...@yahoo.com wrote:
>
>>> Then you end up with a cartesian product of various ways to do the same thing.
>>> This makes the compiler have to figure out if::
>>> <
>>> INC Rd ; NoOp
>>> is better or worse than:
>>> ADD Rd,Rd,#1
>>> and what set of rules (or heuristics) govern the selection. I have worked inside
>>> compilers trying to figure this kind of stuff out. It ain't straightforward.
>>> <
>>> I submit that compilers are best when they only need to figure how to do
>>> something exactly one way.
> <
>> An optimizer pass to convert all loads that are only used once to ACC ops
>> is pretty trivial, as a software guy I am not asking for Itanic compiler
>> changes.
> <
> While I can, in general, agree that you are not asking for Titanic additions
> to the compiler, you are asking for Titanic alterations of the axioms and
> tenets underlying the encoding philosophy of My 66000 ISA. Basically,
> you are asking for a complete reset. I am not interested in starting over.
> However, You are free to design whatever 16-bit instructions you want for
> your architecture.

Yeah, there are only so many possible combinations.


>>
>> All the other 16 bit variants use restricted register sets and has deep
>> effects on register coloring, etc. I tried a dozen of these on paper and
>> despite my propaganda of the time they all sucked. ;(
> <
> You and Quadriblock should get together and compare notes........maybe
> you could teach him about "what to leave out" as part of your comp.arch
> meeting.

Yeah, the funky obsession with non-power-of-2 data sizes and similar
would probably be high on my list.


>>
>> ACC ops follow the KISS principle. Simple one for one substitution of 16
>> bit opcodes for the longer ones where the operation tree makes it possible.
> <
> I am well aware of how accumulator machines perform. Rather well on the
> 60%+ code sequences, and less well on the 40%- code sequences. Whereas
> a 32-bit only ISA has but one way of expressing arithmetic and does not so
> suffer.
> <
> Remember I have a <essentially> pure IRSC ISA that is achieving x86-64
> code density--significantly better than <¿almost?> all other pure RISC ISAs.
> I got here by eliminating instruction functionality that could be embodied
> elsewhere in the data-path and uniformly encoded in ISA. This makes each
> instruction more powerful without adding delay to its execution and makes
> the compilers job a bit easier in expressing the necessary semantic.

I will assume in this you *don't* mean the apparent common variation of
x86-64 which takes upwards of 1.5MB to build Doom...



>>
>> To make best use you need a bunch of 32 bit ops that use ACC, or better yet
>> just use a register like R1, and remove that register from normal use, a
>> reduction of one register for the other compiler passes.
> <
> Yech.

Agreed...

There isn't really a good reason to have hard-coded registers "in
general" with 32-bit instruction encodings, particularly not for an
accumulator.


I will make a partial assumption for loading a big constant to a fixed
register, where having an instruction for a larger constant load could
partly offset the drawback of having "not particularly large" immediate
fields in other contexts.


Also at the time, 24 bits would be sufficient in-general for things like
data/bss loads and stores, whereas something more modest (such as 16 or
20 bit) would not have been sufficient (several of these programs
effectively have several MB of '.bss').

...

Ivan Godard

unread,
Jun 26, 2022, 9:17:30 PMJun 26
to
There are advantages to dedicated base registers, set as a side effect
of other operations. You can keep them where they are used, instead of
in a regfile, saving wire delay. You need fewer of them (Mill: 8 vs 32)
saving fan-in. If you restrict addressable regions to not cross a 4Gb
(or less) boundary then you can have a narrower address adder (Mill: 32
bits vs 64). You avoid the loads with fat constants. You lower pressure
on the genregs. The address-using instructions need fewer bits to encode
the base (Mill: 3 vs 5). The base can be used to select a particular WKR
for range checking, avoiding the trip to the PLB. The base can be used
as the lwb in its WHR, reducing state for task switch or call.

And so on.

MitchAlsup

unread,
Jun 26, 2022, 9:18:47 PMJun 26
to
On Sunday, June 26, 2022 at 7:28:11 PM UTC-5, BGB wrote:
> On 6/26/2022 4:56 PM, MitchAlsup wrote:
> > On Sunday, June 26, 2022 at 3:31:25 PM UTC-5, gg...@yahoo.com wrote:
> >> MitchAlsup <Mitch...@aol.com> wrote:
> >>> On Saturday, June 25, 2022 at 4:17:18 PM UTC-5, gg...@yahoo.com wrote:
> >
> >>> Then you end up with a cartesian product of various ways to do the same thing.
> >>> This makes the compiler have to figure out if::
> >>> <
> >>> INC Rd ; NoOp
> >>> is better or worse than:
> >>> ADD Rd,Rd,#1
> >>> and what set of rules (or heuristics) govern the selection. I have worked inside
> >>> compilers trying to figure this kind of stuff out. It ain't straightforward.
> >>> <
> >>> I submit that compilers are best when they only need to figure how to do
> >>> something exactly one way.
> > <
> >> An optimizer pass to convert all loads that are only used once to ACC ops
> >> is pretty trivial, as a software guy I am not asking for Itanic compiler
> >> changes.
> > <
> > While I can, in general, agree that you are not asking for Titanic additions
> > to the compiler, you are asking for Titanic alterations of the axioms and
> > tenets underlying the encoding philosophy of My 66000 ISA. Basically,
> > you are asking for a complete reset. I am not interested in starting over.
> > However, You are free to design whatever 16-bit instructions you want for
> > your architecture.
> Yeah, there are only so many possible combinations.
<
nearly infinite purmutations........
> >>
> >> All the other 16 bit variants use restricted register sets and has deep
> >> effects on register coloring, etc. I tried a dozen of these on paper and
> >> despite my propaganda of the time they all sucked. ;(
> > <
> > You and Quadriblock should get together and compare notes........maybe
> > you could teach him about "what to leave out" as part of your comp.arch
> > meeting.
> Yeah, the funky obsession with non-power-of-2 data sizes and similar
> would probably be high on my list.
<
I was wondering if anyone caught the palindrome of his thread "...life the universe
and everything." In Hitchhikers Guide to the Galaxy the answer to "life the universe
and everything" is/was 42 whereas Quadriblock's thread converges on 24 which
is the simple reverse of 42.
> >>
> >> ACC ops follow the KISS principle. Simple one for one substitution of 16
> >> bit opcodes for the longer ones where the operation tree makes it possible.
> > <
> > I am well aware of how accumulator machines perform. Rather well on the
> > 60%+ code sequences, and less well on the 40%- code sequences. Whereas
> > a 32-bit only ISA has but one way of expressing arithmetic and does not so
> > suffer.
> > <
> > Remember I have a <essentially> pure IRSC ISA that is achieving x86-64
> > code density--significantly better than <¿almost?> all other pure RISC ISAs.
> > I got here by eliminating instruction functionality that could be embodied
> > elsewhere in the data-path and uniformly encoded in ISA. This makes each
> > instruction more powerful without adding delay to its execution and makes
> > the compilers job a bit easier in expressing the necessary semantic.
<
> I will assume in this you *don't* mean the apparent common variation of
> x86-64 which takes upwards of 1.5MB to build Doom...
<
No, in general I am talking about "never having to" in terms of loading
constants, pasting constants together, negating or inverting operands,
and the prologue and epilogue handling instructions.
> >>
> >> To make best use you need a bunch of 32 bit ops that use ACC, or better yet
> >> just use a register like R1, and remove that register from normal use, a
> >> reduction of one register for the other compiler passes.
> > <
> > Yech.
> Agreed...
>
> There isn't really a good reason to have hard-coded registers "in
> general" with 32-bit instruction encodings, particularly not for an
> accumulator.
>
The only hard coded registers is::
a) R0 receives the return address when control is delivered to a subroutine.
b) ENTER and EXIT use R31 = SP to build and tear down stack fames.
<
However; there are circumstances where the HW understands that ranges
of registers have certain properties. For example: when Safe-Stack mode
is in effect, R16-R31 are saved on Safe Stack, R0 does not receive the return
address, and R16-R30 when read before write return zeros.
>
> I will make a partial assumption for loading a big constant to a fixed
> register, where having an instruction for a larger constant load could
> partly offset the drawback of having "not particularly large" immediate
> fields in other contexts.
>
My 66000 never has to load a constant to a register. Somewhere close
to 9% of my instruction count vanishes by uniformly supplying large
constants.
>
> Also at the time, 24 bits would be sufficient in-general for things like
> data/bss loads and stores, whereas something more modest (such as 16 or
> 20 bit) would not have been sufficient (several of these programs
> effectively have several MB of '.bss').
>
In My 66000 memory mapping; a simple application such as cat can have
the .txt, .data, .bss, ... each separated by GBs in the virtual address
space and yet only need 1 page of memory mapping tables !
> ...

Ivan Godard

unread,
Jun 26, 2022, 10:57:39 PMJun 26
to
Doesn't have to, but isn't it advisble when a constant has widespread uses?


Brett

unread,
Jun 26, 2022, 11:00:24 PMJun 26
to
MitchAlsup <Mitch...@aol.com> wrote:
> On Saturday, June 25, 2022 at 4:17:18 PM UTC-5, gg...@yahoo.com wrote:
>> MitchAlsup <Mitch...@aol.com> wrote:
> <snip>
>>> I looked at this a few years ago and the damage to long term ISA growth
>>> was catastrophic. As it is I have nearly ½ of the OpCode space in each
>>> OpCode group left for the future. and can PARSE instructions in 31 gates
>>> with only 4 gates of delay. All that goes out the window with a meaningful
>>> 16-bit "extension". I pass.
> <
>> I don’t get why you think you need to reserve half your opcode space for
>> future extensions, I would have thought we are at the end of history for
>> opcode extensions.
> <
> a) because I have watched what happens to processors over 50 years and
> how every time you turn around they have more OpCodes--mainly to address
> stuff forgotten earlier.
> <
> b) remember I compacted everything into only 59 actual instructions.
>>
>> What is the cost of reserving 3 bits of one pattern and the same pattern at
>> the 16 bit border, so that you can add 16 bit opcodes in the future?
> <
> The cost of reserving space for 16-bit is that it over-constrains the 32-bit
> OpCode space. For example: I could not give the 16-bit OpCodes a typical
> subgroup (6-bit Major OpCode) because the first instruction would only
> have 10-bits left !! (16-6=10)

I can work with 10 bits, ideally 11 for 59 instructions, which would be two
sub groups.

ACC opcodes only need one register, plus 5 bits for the opcode hits all the
common cases this extension would use, so 10 bits works fine.

Note that this extension will not slow down code like all the other 16 bit
architectures. You will get a small boost from a smaller code footprint. 32
bit instructions are unaffected, and do all the heavy lifting.

With two opcode groups it should wire directly into your existing logic,
which actually happens anyway if you split this extension into smaller
groups spread into the open opcode spaces where appropriate.

BGB

unread,
Jun 26, 2022, 11:14:27 PMJun 26
to
FWIW: 128/3 ~= 42 ...

But, yeah, in a more practical sense, unusual data sizes aren't much of
a win.


>>>>
>>>> ACC ops follow the KISS principle. Simple one for one substitution of 16
>>>> bit opcodes for the longer ones where the operation tree makes it possible.
>>> <
>>> I am well aware of how accumulator machines perform. Rather well on the
>>> 60%+ code sequences, and less well on the 40%- code sequences. Whereas
>>> a 32-bit only ISA has but one way of expressing arithmetic and does not so
>>> suffer.
>>> <
>>> Remember I have a <essentially> pure IRSC ISA that is achieving x86-64
>>> code density--significantly better than <¿almost?> all other pure RISC ISAs.
>>> I got here by eliminating instruction functionality that could be embodied
>>> elsewhere in the data-path and uniformly encoded in ISA. This makes each
>>> instruction more powerful without adding delay to its execution and makes
>>> the compilers job a bit easier in expressing the necessary semantic.
> <
>> I will assume in this you *don't* mean the apparent common variation of
>> x86-64 which takes upwards of 1.5MB to build Doom...
> <
> No, in general I am talking about "never having to" in terms of loading
> constants, pasting constants together, negating or inverting operands,
> and the prologue and epilogue handling instructions.

My case is pretty much comparable to x86-64 in what you can do with
immediate and displacement fields:
x86-64: 8 and 32; 64 via a dedicated load.
BJX2: 9 and 33, 64 via a dedicated load.

But, differs in that it has 3R, and is Load/Store.

Also, I have both more registers, and encodings that are often smaller.
Though, x86-64 has:
MOV Rn, Imm64 (80-bits)
Where:
LDI Imm64, Rn (96-bits)
So, x86-64 wins this one...


However, compared with some of the other options (eg: SH-4 or Thumb2),
x86-64's code density seems to be "pretty bad".

Even something like ARMv5 (with fixed-length 32-bit instructions) seems
to be able to beat x86-64 on this metric.




>>>>
>>>> To make best use you need a bunch of 32 bit ops that use ACC, or better yet
>>>> just use a register like R1, and remove that register from normal use, a
>>>> reduction of one register for the other compiler passes.
>>> <
>>> Yech.
>> Agreed...
>>
>> There isn't really a good reason to have hard-coded registers "in
>> general" with 32-bit instruction encodings, particularly not for an
>> accumulator.
>>
> The only hard coded registers is::
> a) R0 receives the return address when control is delivered to a subroutine.
> b) ENTER and EXIT use R31 = SP to build and tear down stack fames.
> <
> However; there are circumstances where the HW understands that ranges
> of registers have certain properties. For example: when Safe-Stack mode
> is in effect, R16-R31 are saved on Safe Stack, R0 does not receive the return
> address, and R16-R30 when read before write return zeros.

OK.

I have R0, R1, and R15/SP hard-coded in certain contexts.

LR is nominally in CR space, but I sometimes used R1 as a "Secondary
Link-Register" or "Saved-Link-Register", mostly in the context of prolog
and epilog compression.


R0 and R1 can be used as scratch registers (with care), however:
The assembler may stomp them without warning in some cases;
For some instructions, they are not allowed:
In some cases, the encodings are special cases;
Some instructions are simply not allowed to use them.
...


>>
>> I will make a partial assumption for loading a big constant to a fixed
>> register, where having an instruction for a larger constant load could
>> partly offset the drawback of having "not particularly large" immediate
>> fields in other contexts.
>>
> My 66000 never has to load a constant to a register. Somewhere close
> to 9% of my instruction count vanishes by uniformly supplying large
> constants.

Stuff like:
if(x>=100)
x=99;
Is still kind of a thing...

But, yeah:
y=x+12345678;
Can be handled with an Imm33s encoding.



But, as noted, the FAzz_zzzz and FBzz_zzzz encodings predate the
addition of Jumbo encodings. In the original form of the ISA (and in ISA
subsets without Jumbo) they are more useful.


But, as noted, these only exist in one of the "unconditional subspace":
FAzz_zzzz LDIZ Imm24u, R0
FBzz_zzzz LDIN Imm24n, R0
WEX Space (Same spot, just WEX=1):
FEzz_zzzz Jumbo
FFzz_zzzz Op64
Predicate Space:
EAzz_zzzz PrWEX F0?T
EBzz_zzzz PrWEX F2?T
EEzz_zzzz PrWEX F0?F
EFzz_zzzz PrWEX F2?F

So, whether or not these are "still" useful, their "twins" elsewhere in
the encoding space allow for a few other encoding spaces to exist.

Likewise:
FFdd_dddd_FAdd_dddd BRA Abs48
FFdd_dddd_FBdd_dddd BSR Disp48
It is itself effective bundled with itself to encode the Abs48 branches.

Had I put pretty much anything else there, I would not have been able to
do this stuff in this way.


>>
>> Also at the time, 24 bits would be sufficient in-general for things like
>> data/bss loads and stores, whereas something more modest (such as 16 or
>> 20 bit) would not have been sufficient (several of these programs
>> effectively have several MB of '.bss').
>>
> In My 66000 memory mapping; a simple application such as cat can have
> the .txt, .data, .bss, ... each separated by GBs in the virtual address
> space and yet only need 1 page of memory mapping tables !

OK.

I am mostly dealing with programs like Doom and Quake effectively
needing ~ 21..23 bits to be able to address across the size of their
'.bss' section.

One can eliminate a lot of bits by using GBR (Global-Base-Register), or
similar, but still need a lot of bits to deal with the size of the section.

One could use less bits by using a GOT, but this just sorta replaces the
large displacements with additional memory loads and tables (not really
a win).


Excluding things like larger displacements needed for branching and
addressing into '.bss' and similar the, most other constants tend to be
a lot smaller.


...

Thomas Koenig

unread,
Jun 27, 2022, 1:33:40 AMJun 27
to
Ivan Godard <iv...@millcomputing.com> schrieb:

> There are advantages to dedicated base registers, set as a side effect
> of other operations. You can keep them where they are used, instead of
> in a regfile, saving wire delay. You need fewer of them (Mill: 8 vs 32)
> saving fan-in. If you restrict addressable regions to not cross a 4Gb
> (or less) boundary then you can have a narrower address adder (Mill: 32
> bits vs 64).

Hmm... does that mean that the Mill has to do special things to
address arrays > 4 GB?

Thomas Koenig

unread,
Jun 27, 2022, 1:37:35 AMJun 27
to
MitchAlsup <Mitch...@aol.com> schrieb:

> My 66000 never has to load a constant to a register. Somewhere close
> to 9% of my instruction count vanishes by uniformly supplying large
> constants.

It does not have to, but it makes sense to do so (and is
now done) if the same constant is stored multiple times, see
https://github.com/bagel99/llvm-my66000/issues/2 .

BGB

unread,
Jun 27, 2022, 2:51:05 AMJun 27
to
Agreed, sometimes it makes sense.


FWIW: In BGBCC, constants are generally treated like a sort of read-only
variable, and so may be pulled into a register like that of a variable.

Though, admittedly, my compiler isn't smart enough to make decisions
about when it might be better to use an immediate, or pull the value
into a register (this decision would need to be made at the
code-generation level rather than the assembler level). This would
likely require adding the use of a heuristic of some sort.

Though, I would suspect cases like that shown are probably in the
minority of cases.


The specific case shown in the issue-tracker would not come up in BJX2,
because it doesn't support directly storing a constant to memory, so
would be forced in this case to load the value into a register and then
store it to memory, and by extension the register would likely be reused
across the multiple stores (since each time one tries to reload it, the
compiler will see that it was already present in a register).

There are cases where this could be useful, just I don't expect they
would likely come up often enough to justify the encoding.

Ivan Godard

unread,
Jun 27, 2022, 10:27:21 AMJun 27
to
Arrays can be of any size that mmap is willing to give you. Generated
code for constant offsets bigger than 2^32 builds a pointer by explicit
arithmetic rather than by using the address adder. The arithmetic costs
two instructions (con, addp) and a cycle, but saves in the AA for all
offsets < 2^32. We judge that to be a worthwhile tradeoff.

EricP

unread,
Jun 27, 2022, 1:11:03 PMJun 27
to
Is there something odd about the assembler at the top where
it seems to repeatedly spill a register to local stack frame
then use it as a pointer. e.g.

std r26,[sp,216]
.loc 1 268 1 ; fatigue2.f90:268:1
std #-4317352126650676160,[r26]

plus does this 5 other times.
Just checking.

BGB

unread,
Jun 27, 2022, 3:31:19 PMJun 27
to
On 6/26/2022 10:14 PM, BGB wrote:
> On 6/26/2022 8:18 PM, MitchAlsup wrote:
>> On Sunday, June 26, 2022 at 7:28:11 PM UTC-5, BGB wrote:
>>> On 6/26/2022 4:56 PM, MitchAlsup wrote:
>>>> On Sunday, June 26, 2022 at 3:31:25 PM UTC-5, gg...@yahoo.com wrote:
>>>>> MitchAlsup <Mitch...@aol.com> wrote:
>>>>>> On Saturday, June 25, 2022 at 4:17:18 PM UTC-5, gg...@yahoo.com
>>>>>> wrote:
>>>>

<snip>

>
> My case is pretty much comparable to x86-64 in what you can do with
> immediate and displacement fields:
>   x86-64: 8 and 32; 64 via a dedicated load.
>   BJX2: 9 and 33, 64 via a dedicated load.
>
> But, differs in that it has 3R, and is Load/Store.
>
> Also, I have both more registers, and encodings that are often smaller.
>   Though, x86-64 has:
>     MOV Rn, Imm64 (80-bits)
>   Where:
>     LDI Imm64, Rn (96-bits)
>   So, x86-64 wins this one...
>
>
> However, compared with some of the other options (eg: SH-4 or Thumb2),
> x86-64's code density seems to be "pretty bad".
>
> Even something like ARMv5 (with fixed-length 32-bit instructions) seems
> to be able to beat x86-64 on this metric.
>
>


So, thought is, comparing a few common-case encodings for size between
x86-64 (x64) and BJX2 (mostly for cases where direct analogs exist):
MOV Rxx, Imm64 | LDI Imm64, Rn
10(x64) vs 12(BJX2), x64 wins
MOV Rxx, Imm32 | LDI Imm8, Rn
7(x64) vs 2(BJX2), BJX2 wins
(x86-64 lacking a smaller immediate than Imm32 here)

MOV Rxx, [Rb+Ri*8] | MOV.Q (Rm, Ri), Rn
4(x64) vs 4(BJX2), tie
MOV Rxx, [Rb+Disp8] | MOV.Q (Rm, Disp9u), Rn
4 (x64) vs 4(BJX2), tie

MOV Rxx, [Rb+Ri*Sc+Disp8] | MOV.Q (Rb, Ri*Sc, Disp11), Rn
5(x64) vs 8(BJX2), x64 wins
MOV Rxx, [RIP+Disp32] | MOV.Q (PC, Disp33), Rn
7(x64) vs 8(BJX2), x64 wins

ADD Rxx, Rxx | ADD Rm, Rn
3(x64) vs 2(BJX2), BJX2 wins

ADD Rn, Imm32 | ADD Imm8, Rn
7(x64) vs 2(BJX2), BJX2 wins
(another scenario where x86-64 lacks smaller immeds)
(You don't get byte values unless working on byte registers).
ADD Rn, Imm32 | ADD Imm16s, Rn
7(x64) vs 4(BJX2), BJX2 wins
ADD Rn, Imm32 | ADD Imm33s, Rn
7(x64) vs 8(BJX2), x64 wins
MOV Rt, Imm64; ADD Rn, Rt | ADD Imm64, Rn
13(x64) vs 12(BJX2), BJX2 wins


In many areas where there is a direct 1:1 comparison, x86-64 seems to be
ahead; Except in cases where x86-64 only has a 32-bit immediate.

I am generally assuming encodings here where REX is used (typical case,
basically required to use registers as 64-bit).


I de