BJX2 generally upholds the above.

While some instructions are pretty niche, most still tend to have
use-cases, and I am mostly trying to avoid adding stuff that is
(completely) useless.

> RISC tenets:
> a) 1 word == 1 instruction
> b) 1 instructions flows down the pipeline in 1 cycle
> c) 1 instruction can cause 0 or 1 exception
> d) instruction encoding uses few patterns
> e) there is a large uniformly addressable register space
>

My case, 3 out of 5.

a, 16/32 and bundle-encodings break this one.

d, Some extra complexity exists due to the lack of an architectural Zero
Register and similar, and some instructions (early on) which ended up
with both 2R and 3R encodings.

Early on, I wasn't confident, for example, that "ADD R4, R5" and "ADD
R5, R4, R5" would have been semantically equivalent in all cases.

There were some other cases (Mostly 32-bit 2R Load/Store variants) which
were dropped due to being entirely redundant with the 3R encodings (or
which became redundant once predication was added).

Some other parts of the ISA also ended up being dropped and then later
re-added a few times before becoming more-or-less permanent (and some
other features are in limbo due to not really adding enough to to
justify their existence).

> So where does My 66000 ISA stand with respect to these axioms and
> tenets::
>
> RISC axioms: My 66000 ISA embodies all of the RISC axioms
> RISC tenets: My 66000 ISA rejects ½ of RISC tenets
>
> With minor exceptions to both::
>
> My 66000 contains 32×64-bit general purpose registers. Some might
> think this is too few and a FP register file should be added. Looking
> at code such as BLASS, Livermore Loops, Linpack indicates otherwise
> -- as long as one assumes some hints of OoO pipelining. Looking at
> various C libraries this seems perfectly sufficient.
>

My case: 32|64 x 64-bit.

I am still on the fence as to whether 32 GPRs is "fully sufficient", or
whether 64 GPRs can offer enough gain (in certain use-cases) to justify
its existence. It "kinda helps" for TKRA-GL but is seemingly kinda moot
for pretty much everything else.

The way the encodings for the 64 GPR case are handled is a bit hacky,
but it was a tradeoff (I came up with something which could be done
without breaking binary compatibility or requiring a separate operating
mode). Ironically, everything still works OK so long as "most of the
code" sticks to only using the low 32 GPRs (otherwise, some of the seams
might start to show).

> My 66000 ISA contains 6 decoding patterns; 1 for each of
> {instructions with 16-bit immediates, instructions with 12-bit
> immediates, scaled memory reference, 2-operand reg-reg,
> 1-operand reg-reg, 3-operand reg-reg }
>

Hmm (8 major for 32-bit):
FZnm_ZeoZ //3R "Rm, Ro, Rn"
FZnm_ZeZZ //2R "Rm, Rn"
FZnm_Zeii //3RI (Imm9/Disp9), "Rm, Imm9, Rn" / "(Rm, Disp9), Rn"
FZnZ_Zeii //2RI (Imm10), "Imm10, Rn"
FZZZ_ZeoZ //1R (Ro treated as Rn for these)
FZZn_iiii //2RI (Imm16), "Imm16, Rn"
FZdd_Zddd //Disp20 (Branch)
FZii_iiii //"LDIz Imm24, R0"

Add a few more if one counts the 16-bit ops:
ZZnm //2R
ZZni //2RI (Imm4)
ZZnZ //1R
Znii //2RI (Imm8)
ZZdd //Disp8 (Branch)

The Jumbo and Op64 encodings may or may not be considered new forms,
however they don't actually add "new" instruction-forms per-se, but
rather modify the existing encodings in predefined ways (and reuse the
existing 32-bit decoder; just with more bits "glued on" to the instruction).

One could potentially also interpret the 32-bit encodings as zero-padded
versions of a longer internal encoding space:
FEii_iiii_FZnm_Zeii //3RI, "Rm, Imm33, Rn"
...

With a few special cases, eg:
FEii_iiii_FAii_iiii //"LDIZ Imm48, R0"
FFii_iiii_FAii_iiii //"BRA Abs48"

There are more forms if one considers "minor" patterns, but these don't
really effect instruction encoding, but more how the various parts are
interpreted and mapped to the internal pipeline:
Logically, each instruction is decoded as if it had:
3 read ports, 1 write port;
A 33-bit immediate/displacement field;
Op / Sub-Op;
...

This then combines with an outer stage that deals with the bundle as a
whole, mapping SIMD ops to two lanes, along with Abs48 and Imm64
encodings (where the immediate can't fit into a single pipeline lane).

The output of this using being the configuration for the entire pipeline.

> The 12-bit immediate format is used for shift instructions and
> for Predicate instructions and positioned such that predicate
> instructions are only 1-bit different than their corresponding
> branch instruction. This saves 6×16-bit immediate encodings.
>

Differs in my case:
Shifts and friends use Imm9 forms;
However, because one doesn't need all 9 bits for a typical shift, had
also kinda shoe-horned SIMD shuffle instructions into the mix as well.

Or, in effect, shuffle can be imagined sort of like a conjoined twin
stuck onto the shift instruction (and a variable shift imagined as
masking-off the bit that causes it to able to behave like a shuffle).

Or, one can also imagine that there could have been an alternate
universe where passing a sufficiently out-of-range value to the shift
instruction caused it to shuffle the value instead...

Predicate instructions work very differently in my case, having their
own copy of the 32-bit encoding space which mirrors the format of the
normal opcode space (just replacing the WEX bit with a True/False bit),
and the encoding spots that would have normally encoded Imm24 and Jumbo
being repurposed as Predication+WEX / "PrWEX" (but only applying to a
subset of the ISA).

> Scaled memory reference, 1-operand, 2-operand, 3-operand
> all have access to 32-bit or 64-bit immediates/displacements
> in substitution for a register. This eliminates any need to use
> instructions or waste registers pasting constants together.
>

Via Jumbo, these can all expand to 33 bits.

The 64-bit cases are a bit more limited, but not usually a huge issue.

There are also some Imm56 encodings "on paper" (these are in a similar
limbo as the 48-bit instruction encodings).

Seemingly the vast majority of what one needs a larger immediate for can
be handled via Imm33, where, say, only about 4% of the constants
actually go outside of this limit (the vast majority of these being
either MMIO pointers or irrational floating-point constants).

The Imm56 cases looking like they would be too rare to really be worth
bothering with at present.

> 1-operand, 2-operand, 3-operand instructions all have sign control
> over their operands. There is no SUB instruction My 66000 uses
> ADD Rd,Rs1,-Rs2 instead. The sign control eliminates most NEG
> instructions from execution. The 2-operand group allows the
> 5-bit register specifier to be used as a 6-bit sign extended
> immediate, making ADD Rd,#1,-Rs2 easily encoded.
>

No equivalent in my case.

> There are Compare instructions that return a bit-vector of everything
> the compare circuitry can determine, including range checks like:
> 0 < Rs1 <= Rs2, classifications {-infinity, -normal, -denormal, -zero,
> +zero, +denormal, +normal, +infinity, SNaN, QNaN} I remain tempted
> to add "any byte equal", "any halfword equal", "any word equal".
>

No equivalent.

I did the same thing as SuperH here:
CMPxx instructions twiddles the SR.T bit;
Branches / Predication / ... all operate off the SR.T bit.

Ironically, because of the way Verilog works, so much stuff hanging off
a single bit causes it to get something like 1000x more expensive.

> There are 2 kinds of conditional flow: branching and predication and
> each has 2 principle kinds of instructions:: condition is determined
> from a single bit in a register, or condition is determined by comparing
> a register with 0. In addition there are conditionless branches, jumps,
> and a special addition supporting PIC for method calls and switches.
> Compare-to-zero and branch can access certain HW know information
> that is not capable of being stored in a ISA register--this includes things
> like a query to the Memory Unit asking if it has seen any interference
> between the start of an ATOMIC sequence and "now". The exception
> and interrupts and std. return are also encoded here.
>

Internally, conditional branches are equivalent to predicated branches
in my case.

The partial extension is RISC-V style compare-and-branch instructions,
which have sort of a "shortcut" between the ALU and branch-instruction
logic (the ALU behaves as if it were performing a CMPxx, and the EX1
unit like it were doing a Branch, with the short-cut signal taking the
place of the SR.T bit).

These are effectively mandatory for RISC-V decoder support, but still
regarded as optional in BJX2 (and pretty rarely see much use in
practice, if compared with CMPxx+Bx).

It is possible that a cheaper alternative might have been to rig up
something in the decoder to allow the Bxx instruction to have been split
into a CMPxx+Bx pair in the pipeline (but, I lack a mechanism for this
at present).

> Memory reference instructions enable building of ATOMIC primitives
> that can touch as many as 8 cache lines of data in a single ATOMIC
> event. This is equivalent to the MIPS LL and SC except it operates
> over much larger chunks of data. This is sufficient to move an entry
> of a shared data structure from one place to another place in a single
> event. This minimizes the number of ATOMIC events that are needed,
> and comes with guarantees of forward progress.
>

No equivalent.

> The ST instruction can store a constant in either 5-bit sign extended
> form, or in 32-bit or 64-bit forms. No need to put a constant into a
> register in order to ST it to memory. This is along with the ability
> to use 32-bit or 64-bit displacement constants.
>

No equivalent.
Multi-op sequence required.

> There are 5 "special" memory reference instructions:: ENTER is used
> to setup a new stack, and save registers, EXIT is used to tear down the
> stack and restore registers, LDM loads multiple registers, STM stores
> multiple registers, and MM moves data from memory to memory.
> MM has the property that both cached and uncached memory smaller
> than a page is moved as single ATOMIC transfer. {PCIe can do this,
> so should CPUs attached to PCIe peripherals.} There is expected to
> be a sequencer in the memory unit that performs these out of the
> data-path.
>

No equivalent.

All this stuff needs to be handled manually via Load/Store sequences.
However, there a 128-bit Load/Store that is also used (primarily) to
load/store pairs of registers in prolog/epilog sequences. This was
actually its primary purpose, but 128-bit SIMD was a secondary use-case,
along with it able to be faster for 'memcpy()' and similar than using
64-bit load/store instructions.

> The Floating Point group includes Transcendental instructions.
> Ln, LnP1, exp, expM1, SIN, COS, TAN, ATAN and some variants
> that are only 1 constant different in the calculations. Ln2 takes
> only 14 cycles, sin takes 19 cycles. These are included because
> they actually do improve performance.
>

No equivalent, nearly all math functions done in software in my case.

Originally, there were no FDIV or FSQRT instructions either, but these
exist now.

Current timings are:
FDIV: 130 cycles
FSQRT: 384 cycles

The trig functions generally run from around 500 to 1000 cycles or so
(via unrolled Taylor expansion).

There was some older versions (that came with the C library) which
didn't unroll stuff (naive algorithm), these versions seem to weigh in
at closer to around 50000-200000 clock cycles or so (in cases where they
didn't get stuck in an infinite loop).

I replaced these versions with the unrolled versions because they were
basically unusable.

> Conversions between FP and FP or FP and INT are provided by
> 1 instruction (CVT) which has 49 variants to deal with 5 specified
> rounding modes and 1 implied rounding mode (current) any time
> a rounding could transpire. This falls into the category of "once
> you have the HW to do <say> ANINT (of FORTRAN) you have the
> 95% of the logic to do them all".
>

Separate instructions in my case.

Most FP<->FP cases are dedicated converters.

The INT<->FP cases are routed through FADD (noting as how one can do
some exponent tricks and get the FADD logic to do most of the heavy
lifting).

Rounding for the converters is generally hard-wired, usually either
Approx-RNE (scalar path), or Truncate (SIMD paths).

> The exception model is based on message passing (as is SVCs),
> rather than wandering through the high-level OS exception
> dispatcher. This model supports threads (processes or tasks)
> that are paranoid of the OS looking at their data (such as banking
> applications running on a home PC), and can indeed restrict the
> OS from looking at the address space.
>

My case:
Exception Code: { Address, Tag }
Tag is a 16-bit number
(15:12): Category
(11: 8): Target Core
( 7: 0): Magic Number

The category is used to figure out which offset relative to VBR to
branch to:
VBR basically points somewhere in kernel memory;
Holds a small table of branch instructions;
Exception dispatch branches to one of these branch instructions.

When this happens, the CPU also basically saves PC and SR to special
registers, swaps the stack pointers, and switches to a special operating
mode (which blocks further interrupts and disables the MMU and similar).

Interrupts can be routed along the Ring-Bus, as with memory requests,
intended so that CPU cores can throw interrupts at each other.

The 'RTE' instruction is actually implemented as another type of
"special" exception, which basically tries to pass control back to where
the exception came from (or, at least, where the relevant registers say
the exception came from).

> I/O devices are virtualized, and operate on the virtual address
> space of originating requestor. So while the I/O device can DMA
> directly into paranoid application address space, and while OS
> can verify the given space and bounds are acceptable, OS cannot
> look into that address space. This gets rid of the need of a secured
> mode of operation.
>
> Deferred procedure calls are handled as messages (argument
> setup + 1 instruction) with continuation. The messaging sub-system
> operates over both HyperVisor and GuestOS domains simultaneously.
> Anyone with a "method" can call that method and get a response
> even if that method is running under a different GuestOS.
>

No equivalent.

Basic IO range mapping in the physical address space.

No explicit hypervisor support, but shouldn't be too hard in-theory to
fake this part in software.

Would likely be easier than it would be for x86, given the use of a
software managed TLB.

> There is a 66-bit remapped address space--any thread can access
> 64-bits of the space. Sub-spaces are {DRAM, configuration, MMIO,
> and ROM} The address space is configured to efficiently transport
> requests over a significant network (ala HyperTransport and Intel
> equivalent). DRAM is cache coherent, configuration is strongly ordered,
> MMIO is sequentially consistent, ROM is "lax").
>

My case: 48 or 96 bit virtual, 48 bit physical.

MMIO is synchronous, the bridge to the MMIO bus will effectively "lock"
and not allow another request to pass until the former request has
completed.

All MMIO accesses are fully synchronous from the L1 cache down to the
target device (unlike normal memory), though this does mean that
accessing MMIO carries a fairly steep performance penalty relative to
normal memory accesses.

It is possible that there could be a sub-mode to mimic full 64-bit on
top of 96-bit, by treating the addressing as (32,64) rather than
(48,48). This mode would not allow shoving tag bits into the high-order
parts of base-registers though (since these bits would be interpreted as
part of the virtual address).

The Full-64 mode would effectively use 64-bit pointers, unlike 96-bit
requiring the use of 128-bit pointers. The high bits (from the quadrant
address) would effectively still function as an address-range selector
(as when using 48-bit addresses in 96-bit mode; the local 48-bit space
being a sort of isolated island within the much bigger 96-bit space).

Well, and B-Trees for address translation, because the memory overheads
of an 8-level page table would suck.

> The system repeater transports requests from chip to chip, and
> amalgamates coherence requests so that the originator counts
> responses from cores on his chip, and the number of chips in
> the system (rather than counting from every core).
>
> Memory management cannot be turned off--My 66000 imple-
> mentations come out of reset with the MMUs turned on. HostBridge
> is configured with a MMU/TLB that uses exactly the same tables as
> CPUs and can share tables as applicable. Levels in the virtual
> address space translations can be skipped! So an application as
> simple as 'cat' can be managed with a single page of translation
> overhead.
>
> Memory management is inherently HyperVisor/GuestOS. Privilege
> is determined by the assortment of root pointers in use on a per
> invocation basis.
>
> GuestOS can activate a thread (taking it from a waiting state to
> running in a core) in a single instruction and remotely. So can
> HyperVisor.
>

Very different in my case.

I can use B-Trees as page-tables, which saves memory, but this is mostly
due to this being all software, and the CPU itself doesn't really care
what the page-table looks like.

Could almost mimic x86 segmented addressing, except that pages need to
be page-aligned. I guess if one had non-page-aligned address
translation, they could rig up the MMU to pretend to be an 8086 or
something, but this would make the TLB's job a little harder (and
require more bits in the TLBE).

Most obvious would be that such a mode would come at the expense of the
ASID, reinterpreting the ASIDs bits as a displacement to add to the
translated address. Would also require physical memory pages to be
contiguous, and probably some extra hackery to deal with A20 wrap-around
behavior (I guess one would have to look into how much MS-DOS software
depends on A20 wraparound working).

> Finally, there is no notion of one thread morphing into a different
> thread over a series of instructions manipulating control registers
> one by one. For example: an ISR cleanup handler takes a thread
> off a wait state queue, places it on a run state queue, and signals
> GuestOS to see what threads should be running "right now". This
> is all 1 instruction and 1 cycle as far as the core performing the
> instruction sees.
>

Yeah, no equivalent.

> My 66000 is not just another ISA, it is a rethink of most of the components
> that make up a system. A context switch from one thread to another
> within a single GuestOS is 10 cycles. A context switch from one thread
> to a thread under a different GuestOS remains 10 cycles. The typical
> current numbers are 1,000 cycles within GuestOS, and 10,000 cycles
> across GuestOSs.
>
> OH, and BTW, The FP transcendentals are patented.

I would assume you mean FP transcendentals in hardware (in whatever way
they are implemented), as opposed to in-general.

Their existence in things like "math.h" and so on would likely preclude
any sort of patent protection in the "in general" sense.

Very different, I have doubts about how well a lot of this could be
pulled off in a low-cost implementation. Best I can come up with at the
moment would effectively amount to faking it using lots of microcode or
a software-based emulation layer.

I also consider my ISA to be "fairly unique", albeit in different ways
(and a little more conservative in terms of implementation concerns).

MitchAlsup

unread,

Jun 23, 2022, 9:17:25 PM6/23/22

to

It is a major contributor to getting as small as it got.

>
> Add some accumulator ops and most instructions will fit in 16 bits ops with
> ease, and you have the extra decode stage to do it anyway.
<

I looked at this a few years ago and the damage to long term ISA growth
was catastrophic. As it is I have nearly ½ of the OpCode space in each
OpCode group left for the future. and can PARSE instructions in 31 gates
with only 4 gates of delay. All that goes out the window with a meaningful
16-bit "extension". I pass.

>
> I would argue that 8 bit opcodes are best when you have an accumulator in
> your 32 register RISC design, but that is a bridge too far for most.
<

My 66000 only has 59 total instructions. What makes you think you need 256 ?

>
> How big is the code store needed for an IOT (Internet Of Things smart
> toaster) code stack? And what is the savings for the next size down?
<

I have absolutely no interest in things that small. IoT devices don't need
a HyperVisor, or even that much of a supervisor. I have no interest in
register sizes smaller than 64-bits. And quite frankly, say you did get a
design that small and into production, you have to sell billions (maybe
trillions) of then at $0.05 to pay for the design team and recurring
engineering expenses.
<
If you do, more power to you.

MitchAlsup

unread,

Jun 23, 2022, 9:38:46 PM6/23/22

to

On Thursday, June 23, 2022 at 8:08:33 PM UTC-5, BGB wrote:
> On 6/22/2022 8:03 PM, MitchAlsup wrote:

<snip>

> > The Floating Point group includes Transcendental instructions.
> > Ln, LnP1, exp, expM1, SIN, COS, TAN, ATAN and some variants
> > that are only 1 constant different in the calculations. Ln2 takes
> > only 14 cycles, sin takes 19 cycles. These are included because
> > they actually do improve performance.
> >
> No equivalent, nearly all math functions done in software in my case.
>
> Originally, there were no FDIV or FSQRT instructions either, but these
> exist now.
>
> Current timings are:
> FDIV: 130 cycles
> FSQRT: 384 cycles
>

Mc 88100 did these in / = 56 and SQRT in ~66
Mc 88120 did these in / = 17 and Sqrt in 22

>
> The trig functions generally run from around 500 to 1000 cycles or so
> (via unrolled Taylor expansion).
<

You need to use Chebyshev coefficients--more accurate sometimes fewer
terms, always better error bounds..
>
<<snip>

> My case: 48 or 96 bit virtual, 48 bit physical.
>
> MMIO is synchronous, the bridge to the MMIO bus will effectively "lock"
> and not allow another request to pass until the former request has
> completed.
<

What are you going to do when there are 24 CPUs in a system and
everybody wants to write to the same MMI/O page ?

>
> All MMIO accesses are fully synchronous from the L1 cache down to the
> target device (unlike normal memory), though this does mean that
> accessing MMIO carries a fairly steep performance penalty relative to
> normal memory accesses.
>

The penalty is inherent in the requirements. However, My 66000 can ameliorate
the latency by grouping multiple writes to neighboring MMI/O control registers
into a single bus transaction. In theory, one can write all the necessary stuff
into the control registers to cause a disk drive to DMA a disk sector wherever
in a single write transaction to MMI/O and a single DMA write transaction
when data returns.

>
<
> > My 66000 is not just another ISA, it is a rethink of most of the components
> > that make up a system. A context switch from one thread to another
> > within a single GuestOS is 10 cycles. A context switch from one thread
> > to a thread under a different GuestOS remains 10 cycles. The typical
> > current numbers are 1,000 cycles within GuestOS, and 10,000 cycles
> > across GuestOSs.
> >
> > OH, and BTW, The FP transcendentals are patented.
> I would assume you mean FP transcendentals in hardware (in whatever way
> they are implemented), as opposed to in-general.
<

You might be surprised at what was allowed in the claims.

>
> Their existence in things like "math.h" and so on would likely preclude
> any sort of patent protection in the "in general" sense.
>

Yes, I did not reinvent ancient SW as HW. The algorithms are new (well
different because of what one can do inside a HW function unit compared
to what one can do using only instructions) with several unique features.
They even bother to get the inexact bit set correctly.

>
> Very different, I have doubts about how well a lot of this could be
> pulled off in a low-cost implementation. Best I can come up with at the
> moment would effectively amount to faking it using lots of microcode or
> a software-based emulation layer.
>

Microcode generally refers to a control machine interpreting instructions.
Is a function unit run by ROM sequencer microcode ? What if the ROM got
turned into equivalent gates: Is it still microcode, or just a sequencer ?
In any event there are only 3 different sequences used (reminiscent of
Goldschmidt DIV and SQRT sequences,)

>
> I also consider my ISA to be "fairly unique", albeit in different ways
> (and a little more conservative in terms of implementation concerns).
<

And hard to read..............

BGB

unread,

Jun 24, 2022, 1:15:18 AM6/24/22

to

On 6/23/2022 8:38 PM, MitchAlsup wrote:
> On Thursday, June 23, 2022 at 8:08:33 PM UTC-5, BGB wrote:
>> On 6/22/2022 8:03 PM, MitchAlsup wrote:
> <snip>
>>> The Floating Point group includes Transcendental instructions.
>>> Ln, LnP1, exp, expM1, SIN, COS, TAN, ATAN and some variants
>>> that are only 1 constant different in the calculations. Ln2 takes
>>> only 14 cycles, sin takes 19 cycles. These are included because
>>> they actually do improve performance.
>>>
>> No equivalent, nearly all math functions done in software in my case.
>>
>> Originally, there were no FDIV or FSQRT instructions either, but these
>> exist now.
>>
>> Current timings are:
>> FDIV: 130 cycles
>> FSQRT: 384 cycles
>>
> Mc 88100 did these in / = 56 and SQRT in ~66
> Mc 88120 did these in / = 17 and Sqrt in 22

It is partly based on the strategy used:
Rig the FMUL unit into a feedback loop;
Wait N cycles for answer to converge;
Assume it has converged on the answer.

Generally seems to take roughly this long for the algo to converge on
the answer.

When I first re-added FDIV, it was using the same basic algo (just with
slightly different inputs), and took a similar number of clock-cycles.

Then I had the idea that I could tweak a few things in the Shift-Add
integer divider, and get it to also do FDIV. Though, the way it was
rigged up still needs ~ 130 cycles, but 130 is still less than 384.

>>
>> The trig functions generally run from around 500 to 1000 cycles or so
>> (via unrolled Taylor expansion).
> <
> You need to use Chebyshev coefficients--more accurate sometimes fewer
> terms, always better error bounds..

Possible.

There are also a few faster algos, such as "lookup and interpolate",
but, while faster, these don't give sufficient precision to really be a
good option for the "math.h" functions (assumed to be accurate, even if
not the fastest possible).

There is also CORDIC, but I haven't really messed with it.

In any case, unrolled Taylor expansion is a few orders of magnitude
faster than calculating an exponential and factorial and performing a
floating-point divide and similar every time around the loop...

I was not the person who wrote that code originally, not sure why they
originally wrote it this way.

>>
> <<snip>
>> My case: 48 or 96 bit virtual, 48 bit physical.
>>
>> MMIO is synchronous, the bridge to the MMIO bus will effectively "lock"
>> and not allow another request to pass until the former request has
>> completed.
> <
> What are you going to do when there are 24 CPUs in a system and
> everybody wants to write to the same MMI/O page ?

It all gets serialized to them accessing it one at a time.

Though, ideally, only device drivers and similar should be accessing
MMIO, so this isn't likely to be a huge issue.

By the time I get to 24 cores, will have probably came up with a
different solution.

There is also the option of putting devices stuff on the ringbus. I had
partly already started going this way for VRAM (it is faster to write to
the framebuffer by going through the RAM interface than by going through
the MMIO interface).

However, for accessing hardware devices, in general, one kind of wants
"slower but strictly synchronous" IO over "faster but chaotic" IO.

For VRAM, it is a little different, because generally one is trying to
push several MB/sec out to the screen and don't really care if things
are strictly in-order (if things arrive in the framebuffer in a slightly
different order than they were stored into the L1 cache, who cares?...).

>>
>> All MMIO accesses are fully synchronous from the L1 cache down to the
>> target device (unlike normal memory), though this does mean that
>> accessing MMIO carries a fairly steep performance penalty relative to
>> normal memory accesses.
>>
> The penalty is inherent in the requirements. However, My 66000 can ameliorate
> the latency by grouping multiple writes to neighboring MMI/O control registers
> into a single bus transaction. In theory, one can write all the necessary stuff
> into the control registers to cause a disk drive to DMA a disk sector wherever
> in a single write transaction to MMI/O and a single DMA write transaction
> when data returns.

I was generally accessing MMIO 32 or 64 bits at a time (depending on the
device).

No DMA at present, pretty much everything is still polling IO and similar.

So, for example, for SDcard:
Store a byte to Data register.
Load Control register.
OR a bit in loaded value.
Store modified value to Control register.
Dummy load from Status register (1)
Loop:
Load from Status register.
If BUSY, Continue.
Load byte from Data register.
Repeat until bytes have been moved.

As noted, in the original form, this hit a wall at around 600 K/s.

The modified interface adds a QDATA register (64-bit), and a different
control register bit for "Transfer 8 bytes".

This QDATA version instead hits a wall at around 5 MB/s.

This interface is sufficient for SPI, but if I went to a faster mode,
pretty much as soon as I made the switch, I would be at the bandwidth
limit of this interface (and would then need to come up with something
different).

*1: When operating at "hitting the wall" speeds, the first Status load
will almost invariably be BUSY, nut the second load will typically be
"not BUSY", since the SPI transfer would have completed by the time it
has taken for the request to transfer all the way around the ring and
back again. So, a dummy load can make it faster.

Say, 13 MHz SPI gives 1.5 MB/s, but 13 MHz in UHS-I mode would boost
this up to 13 MB/s (basically, pushing 4 bits per clock-edge).

Much faster than this, and I almost may as well consider going "full
hardware" and memory mapping the SDcard...

>>
> <
>>> My 66000 is not just another ISA, it is a rethink of most of the components
>>> that make up a system. A context switch from one thread to another
>>> within a single GuestOS is 10 cycles. A context switch from one thread
>>> to a thread under a different GuestOS remains 10 cycles. The typical
>>> current numbers are 1,000 cycles within GuestOS, and 10,000 cycles
>>> across GuestOSs.
>>>
>>> OH, and BTW, The FP transcendentals are patented.
>> I would assume you mean FP transcendentals in hardware (in whatever way
>> they are implemented), as opposed to in-general.
> <
> You might be surprised at what was allowed in the claims.

OK.

>>
>> Their existence in things like "math.h" and so on would likely preclude
>> any sort of patent protection in the "in general" sense.
>>
> Yes, I did not reinvent ancient SW as HW. The algorithms are new (well
> different because of what one can do inside a HW function unit compared
> to what one can do using only instructions) with several unique features.
> They even bother to get the inexact bit set correctly.

OK.

In my case, they don't generally get used that heavily IME, so software
is OK so long as it is not unreasonably slow.

In cases where they would have gotten used more heavily, such as sin/cos
being used for the water-warping effects in Quake, lookup tables had
been used instead to good effect.

It is also possible to reduce these lookup tables to half float
precision, since the water warp effect doesn't seem to mind all that much.

>>
>> Very different, I have doubts about how well a lot of this could be
>> pulled off in a low-cost implementation. Best I can come up with at the
>> moment would effectively amount to faking it using lots of microcode or
>> a software-based emulation layer.
>>
> Microcode generally refers to a control machine interpreting instructions.
> Is a function unit run by ROM sequencer microcode ? What if the ROM got
> turned into equivalent gates: Is it still microcode, or just a sequencer ?
> In any event there are only 3 different sequences used (reminiscent of
> Goldschmidt DIV and SQRT sequences,)

Dunno. I was just sort of imagining doing it as a big ROM on top of a
RISC-style core, with chunks of the ISA being effectively treated like
special function calls into this ROM.

It is likely that parts of the Verilog would be procedurally generated,
such as the entry points into the various functions within this ROM.

I had considered something like this a few times in my case, but
generally ended up taking a different approach:
If I can't do it directly in hardware, I wont do it at all.

Only reason I ended up with the functionality of the RISC-V 'M'
extension was because I had thought up a way to implement it affordably.

Even then, it wasn't until earlier today that I got around to adding
"proper" support for 32-bit integer divide (reducing its latency from 68
to 36 cycles). Mostly because in some cases it was being used often
enough to become significant.

But, then I ended up needing to burn a few more 3R encoding spots to add
DIVx.L and MODx.L instructions.

Would have almost just assumed making them be Op64 encodings, but then
it seemed stupid to have DIVS.Q be a 32-bit encoding but DIVS.L be
64-bit, when DIVS.L would be somewhat more common.

>>
>> I also consider my ISA to be "fairly unique", albeit in different ways
>> (and a little more conservative in terms of implementation concerns).
> <
> And hard to read..............

I would suspect a bit less so than IA-64 though.

Granted, writing ASM effectively in this case does generally require
mentally keeping track of the pipeline state and which combinations of
instructions may or may not be bundled (vs x86 which will basically just
accept whatever one throws at it, running at more-or-less full speed).

Though, as an assistance feature, the emulator will generally lint this
stuff and turn illegal combinations into a breakpoint.

Syntax mostly developed out of the SuperH syntax, with a few bits of
TMS320 and similar thrown in, shouldn't be that difficult to figure out.

...

Timothy McCaffrey

unread,

Jun 24, 2022, 10:27:53 AM6/24/22

to

On Thursday, June 23, 2022 at 7:28:53 PM UTC-4, gg...@yahoo.com wrote:

> X86-64 has crap code density, your one instruction stack save restore alone
> should make you significantly better, unless perhaps you have gone 32+32.
>

The X86-64 was left with a lot of baggage because of bad design decision to
try and reuse the X86 decoder. Most of the remaining 1 byte opcodes are
either barely used (STC, CLC) or deprecated (PUSH/POP). It would have been
great if the instruction encoding had been refactored, and some other cruft
removed (e.g. only being able to use CL for a dynamic shift count).

It would have also been a great time to be able set up the encodings so that
the instruction parser could figure out the instruction length from the first chunk
(whatever size that was, I suspect 16 bit chunks make sense).

- Tim

John Dallman

unread,

Jun 24, 2022, 11:01:49 AM6/24/22

to

In article <2f5c8378-de57-4ef2...@googlegroups.com>,

timca...@aol.com (Timothy McCaffrey) wrote:

> The X86-64 was left with a lot of baggage because of bad design
> decision to try and reuse the X86 decoder. Most of the remaining
> 1 byte opcodes are either barely used (STC, CLC) or deprecated
> (PUSH/POP).

Remember that the design was done by AMD, who have to tread carefully to
avoid giving Intel an excuse to claim they're breaching their X86 license
in some way.

At the time, Intel were still under the impression that Itanium was going
to conquer the world. When they realised better, AMD had Opterons on the
market. Intel wanted to build an AMD-incompatible 64-bit x86 to drive AMD
out of the market. They were restrained by Microsoft, who weren't
interested in supporting two different extended x86 ISAs.

Given how we got here, things could be a lot worse.

John

Thomas Koenig

unread,

Jun 24, 2022, 11:19:55 AM6/24/22

This means pair-of-doubles 128-bit, not IEEE 128-bit. I think S/360
introduced this; POWER still has it as the only option up to POWER8.
POWER9 has hardware support for IEEE 128-bit, and IBM is moving away
from double double to IEEE FP for POWER 9+ (I helped a bit in that
transition, for gfortran).

> the compiler
> can pick any 8 registers it desires alleviating register allocation concerns.

Eight registers is a lot if there are only 32 to go around...

> DBLE is a "get by" kind of addition, frowned upon by Hennessey.
>
> I can envision a SIMD instruction modifier that defines the SIMD parameters
> of several subsequent instructions and allows 64-bit SIMD to transpire.
> I am still thinking about these. What I cannot envision is a wide SIMD
> register file--this is what VVM already provides.

I think a lot of the use cases could also be covered if the
processor were able to process int8 ... int64 and fp16..fp64
(with fp128 being an exception) at the width of an SIMD unit,
so something like

MOV R4,#0
VEC {R5}
LDUH R6,[R10+R4] ! Load half float into R6
LDUH R7,[R11+R4] ! Second one
FADD.F2 R7,R7,R6
STH R7,[R12+R4]
ADD R4,R4,#2
LOOP (something)

could be executed at full SIMD with. Is this feasible? Or
would it be better to do this kind of thing via SIMD?

Ivan Godard

unread,

Jun 24, 2022, 5:02:17 PM6/24/22

to

Binary compatibility is a real bear. Either leave a *lot* of free
entropy (my66), or push the problem to the software and build machinery
(Mill), or suffer bloat and slow decode (x86, RISCV).

Brett

unread,

Jun 24, 2022, 5:54:05 PM6/24/22

to

You are missing the I in internet, no wifi I can find in that chip.

Talking about a network stack to talk to your phone. Smart color changing
lightbulbs and soon all the appliances in your home, washer, dryer, stove,
microwave, thermostat, security cameras, just everything.

Plus your home router, which uses a much more powerful wifi block and CPU.

There are markets here that will pay for better code density, assuming a
network stack is significant?

MitchAlsup

unread,

Jun 24, 2022, 5:55:11 PM6/24/22

to

On Friday, June 24, 2022 at 3:40:20 PM UTC-5, Thomas Koenig wrote:
> MitchAlsup <Mitch...@aol.com> schrieb:
>
> First, thanks again for the good explanations.
> > DBLE is an instruction modifier that supplies register encodings and
> > adds 64-bits to the calculation width of the modified instruction. Applied
> > to a FP instruction: DBLE Rd1,Rs11,Rs22,Rs33; FMAC Rd2,Rs12,Rs22,Rs33
> > we execute: FMUL {Rd1,Rd2},{Rs11,Rs22},{Rs21,Rs22},{Rs31,Rs32}
> > and presto: we get FP128 by adding exactly 1 instruction,
<
> This means pair-of-doubles 128-bit, not IEEE 128-bit. I think S/360
<

No this means FP with 14-bit exponent and 113-bit fraction (if my math is
right) pairs of double are available using exact FP arithmetics via CARRY
not DBLE.

<
> introduced this; POWER still has it as the only option up to POWER8.
> POWER9 has hardware support for IEEE 128-bit, and IBM is moving away
> from double double to IEEE FP for POWER 9+ (I helped a bit in that
> transition, for gfortran).
> > the compiler
> > can pick any 8 registers it desires alleviating register allocation concerns.
<
> Eight registers is a lot if there are only 32 to go around...
<

It is not a machine designed to crunch FP128 all the time.
It is a machine designed so the occasional use is satisfactory.

MitchAlsup

unread,

Jun 24, 2022, 5:57:42 PM6/24/22

to

I don't see it:: a 10G or 100G network interface already has a memory footprint
(for its own buffering concerns) that skimping on the CPU and ROM seems a
waste.

BGB

unread,

Jun 24, 2022, 8:55:52 PM6/24/22

to

On 6/24/2022 3:02 PM, MitchAlsup wrote:
> On Friday, June 24, 2022 at 2:31:52 PM UTC-5, timca...@aol.com wrote:
>> On Friday, June 24, 2022 at 1:51:56 PM UTC-4, BGB wrote:
>>>
>>> Possibly also an increase in 64 bit constant loads, ...
>>>
>>> But, often times, the expansion is significantly larger than the
>>> theoretically expected 20-50% or so.
> <
>> You can't have a 64 bit constant in an instruction, except for immediate load (IIRC), so
>> you have to waste a register loading the constant and then use it.
> <
> My 66000 does not have this problem. AND while BGB may be able to get buy with
> this restriction now, you won't in 10 years hence.

Still better in my case than it is in RISC-V where this case would
require a memory load...

As-is (in BJX2), encodings have been defined ("on paper"), eg, for Imm56
encodings for some instructions; just they haven't been put into use yet.

Partly is is a combination of:
Non-zero decoding cost;
It is pretty rare to exceed the existing 33-bit limit for 3RI ops.

Spending 1 extra cycle to load a constant into a register isn't usually
a huge issue.

IME, the vast majority of 64-bit constant loads thus far tend to be
Binary64 constants; usually irrational or repeating 'double' constants
or similar (most other constants will be compacted down to a smaller
format).

Some constant-load stats (from my GLQuake port):
Imm8: 11% (Byte range)
Imm16: 70% (Int16 or UInt16)
Binary16: 12% (Double encoded as Half-Float)
Imm32/33: 4.4%
Imm33s: 2.2% (Int32 or UInt32, Zero/Sign Extend)
Imm32Hi: 1.1% (Int32 into high-order 32 bits, low 32 are 0)
Binary32: 0.8% (Double as a pair of Binary32)
2xBinary16: 0.3% (2xBinary32 as 2xBinary16)
Imm64: 2.5% (Fallback Case)

Some "rarely used" types:
Load value into the high 10 bits of target (rare);
Say: zzz0000000000000
Load bits into the middle of a 64-bit value (rare);
Say: 0000zzzzzzzz0000
Load 4xFP16 encoded as 4xFP8;
...
These cases seem to be rare enough to be mostly ignored.

Grouping constant loads by instruction length:
16-bit: 11%
32-bit: 82%
64-bit: 4.4%
96-bit: 2.5%

Note that this is only for discrete constant loads, and does not count
immediate values or displacements.

Constant loads reflect ~ 7.4% of the total Jumbo prefixes, with the rest
going into immediate fields.

Calculating stats:
85% Imm9/Disp9
15% Imm33/Disp33

Rough estimate of upper-bound of overflowerd immed cases:
Less than 3% (Excluding Load/Store ops)
Less than 0.4% (Including Load/Store ops)

Where the Imm/Disp balance is roughly:
89% Disp (Load/Store Displacements)
11% Imm (Immediates for ALU instructions and similar).

However, given that most of the 64-bit constants (dumped into a log) are
fairly obviously either MMIO addresses or floating-point constants, the
actual bound for overflowing the 33-bit immediate range is likely much
smaller.

I don't have a stat for the relative use of Jumbo between Imm and Disp
encodings, however (based on what I typically see in disassembly dumps),
I would estimate Disp to be the dominant case.

It can be noted that a significant chunk of the cases which are being
encoded as Imm33/Disp33 could also be handled by Imm17/Disp17 encodings
(my compiler doesn't typically use these unless the instruction is
*also* using XGPR).

While arguably code "could change" here, such as due to ever-expanding
memory usage, I suspect this is less likely to be an issue in a
statistical sense.

The main thing that would be the "likely existential risk" for this,
would be programs exceeding 4GB in the ".bss" section, which would
require a bigger displacement.

For x86-64, one would run into a similar problem if text+data+bss
exceeds 2GB (thus breaking ones' ability to use RIP-relative addressing).

>>
>> The calling ABI is much different than the 386, where you just usually pushed stuff
>> on the stack. Now you have some stuff in registers (which registers depends on whether
>> you are running Windows or Linux) and some stuff on the stack.
>>
>> Some registers you are required to save before the call (caller save)
>> and others after the call (callee save).
> <
> With <realistically> 30-64-bit registers in use by compiler and 16 of these preserved,
> I am not seeing very much caller-save register traffic from Brian's LLVM port. It is more
> like R9-R15 are simply temps used whenever and forgotten.

That is presumably how it is supposed to be...

In my case, it is roughly a 50/50 split between caller save (scratch)
and callee save (preserved) registers.

For leaf functions, one wants a lot of scratch registers, and for
non-leaf functions, a lot of callee-save registers.

But, sadly, no party can be entirely happy:
Leaf functions wishing they could have more registers to play with,
without needing to save them first;
Non-leaf functions wishing they could have more registers for variables
which wont get stomped on the next call;
...

Can note that, IIRC:
Win64 gave a bigger part of this pie to callee-save;
SysV/AMD64 gave a bigger part of the pie to caller-save.

A roughly even split seemed like an easy answer, lacking any good way to
find a general/optimal balance across a range of programs.

Conceivably, it could also be possible to have a certain number of
"flexible" registers which a compiler could use to "fine tune" the
balance in the ABI, but these would be annoying at DLL/SO edges, as it
would require "worst case" handling (treating them like caller-save when
calling an import, and like callee-save for DLL exports).

In such an ABI, likely:
2/3: Nominally Callee Save
1/3: Caller Save / Scratch
With 1/3 of the register space able to be re-balanced from callee-save
to caller save by the compiler.

>>
>> Since you are not saving values with a simple push (usually) in the subroutine, you go
>> from a 1 byte PUSH to a 5 or 6 byte MOV to stack.
> <
> I go to a single instruction that pushes as much stuff as desired (by compiler)
> and then allocates a stack frame for the local-variables.
>

I once had PUSH/POP in BJX2, but then I dropped them (mostly for
cost-saving reasons; after noting that adjusting the stack-pointer and
then using a series of stores, or performing a series of loads and then
adjusting the stack pointer, could be similarly effective).

So, it is basically using Load/Store instructions...

However, in most cases:
MOV.X Rn, (SP, Disp4*8)
Can also be encoded in a 16-bit instruction format...

x86-64 would need 2x as many instructions here, and each instruction
would also need 5-bytes to encode, ...

So, so roughly a 500% encoding-cost delta in this case for x86-64 vs BJX2.

Then again, can also note that I am often seeing around a 300% delta
between BJX2 and x86-64 in terms of ".text" sizes and similar.

Though, I still tend to fall a bit short of being able to match Thumb2
or similar at this game...

MitchAlsup

unread,

Jun 24, 2022, 9:36:04 PM6/24/22

to

On Friday, June 24, 2022 at 7:55:52 PM UTC-5, BGB wrote:
> On 6/24/2022 3:02 PM, MitchAlsup wrote:

> > With <realistically> 30-64-bit registers in use by compiler and 16 of these preserved,
> > I am not seeing very much caller-save register traffic from Brian's LLVM port. It is more
> > like R9-R15 are simply temps used whenever and forgotten.
> That is presumably how it is supposed to be...
>
>
> In my case, it is roughly a 50/50 split between caller save (scratch)
> and callee save (preserved) registers.
<

I, too, have 50%/50%:: R1-15 are temps, R16-30 are preserved.
R0 receives Return Address, R31 is Stack Pointer. ½ of the temps
can be used to carry arguments and results covering the 98%-ile.

>
> For leaf functions, one wants a lot of scratch registers, and for
> non-leaf functions, a lot of callee-save registers.
>
> But, sadly, no party can be entirely happy:
> Leaf functions wishing they could have more registers to play with,
> without needing to save them first;
> Non-leaf functions wishing they could have more registers for variables
> which wont get stomped on the next call;
> ...
>
>
> Can note that, IIRC:
> Win64 gave a bigger part of this pie to callee-save;
> SysV/AMD64 gave a bigger part of the pie to caller-save.
<

CRAY-1 had only temp registers at the call/return interface. (Lee Higbe circa 1990)
IBM 360 had only preserved registers.
VAX had only preserved registers--both had 16 registers.

>
> A roughly even split seemed like an easy answer, lacking any good way to
> find a general/optimal balance across a range of programs.
>

The choice is a lot easier 50%/50% when you have 32 registers.
>
<snip>

> >
> I once had PUSH/POP in BJX2, but then I dropped them (mostly for
> cost-saving reasons; after noting that adjusting the stack-pointer and
> then using a series of stores, or performing a series of loads and then
> adjusting the stack pointer, could be similarly effective).
<

Push instructions make::
PUSH R1
PUSH R2
PUSH R3
more expensive than:
SUB SP,SP,#12
ST R1,[SP+8]
ST R1,[SP+4]
ST R1,[SP]
due to the serial dependency.
<
The peep hole HW optimizer in K9 would perform this transformation.
{Yes, the optimizer was a piece of HW the compiler knew nothing about.}

Ivan Godard

unread,

Jun 24, 2022, 10:30:17 PM6/24/22

to

And Mill has only preserved (from the view of the caller) but you don't
have to preserve them. And only has temps (from the viewpoint of the
callee) but you don't have to clear them.

Jun 25, 2022, 3:01:30 PM6/25/22

to

That is kinda the point of how it would be electrically compatible:
Use the pairs that are in-use in 10/100;
Skip the other wires;
Maybe use smaller/cheaper RJ11 (6P4C variant) rather than RJ45.

Normal twisted-pair telephone wire would probably have sufficient
electrical properties to 10/100 in many cases.

Only thing is that it would require an adapter to plug RJ11 into RJ45,
though other options:
Use RJ45 but with only 2 pairs (effectively a 10/100-only wire);
Cable which has RJ11 on one end but RJ45 on the other.
(Side-stepping the need for an adapter at the switch).

Probably put the pins in the plug in such a way that it doesn't have
adverse effects if someone tries to plug a telephone into it.

Say:
NC, A+, B-, B+, A-, NC

Normal phone only connecting to the B pair (vs across the A/B pairs).

With the POE system I am imagining, if one did connect across the A/B
pairs, plugging a phone into it would result in it ringing continuously,
whereas if only the B pair is connected (probably the TX pair from the
phone's end), it would be silent and there would be zero net voltage
from the phone's end.

> Even easier, however, is simply to pass the power over the spare pairs
> in a standard 4-pair Ethernet cable.
>

I had assumed doing a thing of running 48 VAC or similar between the two
differential pairs.

This should work OK, but needs at least 4 wires (2 for each pair).
On the device side, there would probably be a bridge rectifier connected
to the center-taps of an isolation transformer.

For PoE with this system, an RJ11<->RJ45 adapter could also function as
the AC injector, say with a pair of isolation transformers (to let the
data through), with the center taps connected (via another transformer)
up to the mains power.

Could make sense in the PoE case to have it as a multi-port block
though, say, 4-8 simultaneous connections, rather than 1 adapter per cable.

> The two-wire Ethernet standards already include support for simpler and
> cheaper PoE solutions.
>

OK, would need to look into it.

But, would assume that a two-wire interface is not likely to be
electrically compatible with traditional Ethernet, at least not without
some additional trickery (additional isolation transformers and probably
a ground wire).

Signaling and power would maybe be done in a similar way to a 2-wire
telephone, but this wouldn't be able to be (passively) connected up to
existing hubs or switches.

>> Could also be electrically compatible with existing hubs and switches
>> via an RJ11 to RJ45 adapter.
>>
>>
>>> For wireless communication, speeds are usually even lower. Modern
>>> NBIOT cellular systems are designed to be extremely low power, cheap,
>>> have longer range (20 km more than 3G and the rest). You send
>>> packets of up to about 200 bytes of data, perhaps once a day, with a
>>> delivery time of several seconds. Perfect for environmental
>>> monitoring, finding your sheep, and many other tasks.
>>>
>>> For local Wifi (or Zigbee, Z-Wave, etc.) devices, small and low
>>> bandwidth is also fine. You can get away with a few hundred bytes
>>> ram and still have enough to control a lightbulb, thermostat, etc.
>>>
>>> The IOT world is /full/ of systems running on 8-bit AVR's, 16-bit
>>> MSP430's, and other small devices. Code density matters for many of
>>> them.
>>>
>>> (Of course it's a different matter for wireless cameras and all the
>>> other devices that need high bandwidth.)
>>>
>>
>> I would have figured a network stack would have been a bit much for
>> this class of device...
>
> I have a book on my shelf describing a TCP/IP stack for an 8-bit PIC
> microcontroller.
>
> However, the network stack needed for small Wifi or NB-IOT systems is
> vastly smaller than you need for a full IP and TCP/IP stack.

OK.

Once (when I was much younger) I implemented a TCP/IP stack and Ethernet
card driver in a hobby OS project.

Lots of little lesser-known protocols in this mix, like ICMP and ARP and
similar, ...

In my current projects, I haven't gotten back around to this part yet.
Partly, it looks like to do it from an FPGA, one is basically driving
out the bits themselves, and has to write their own logic for
transmitting and receiving Ethernet frames at the level of bits going
over the wires (IIRC).

With the old card I was targeting, IIRC it was at the level of
abstraction of Ethernet frames getting transmitted and received via a
pair of ring buffers.

The card I am using does have an Ethernet port and similar at least, so
could maybe get to this eventually.

BGB

unread,

Jun 25, 2022, 4:52:19 PM6/25/22

to

On 6/24/2022 8:36 PM, MitchAlsup wrote:
> On Friday, June 24, 2022 at 7:55:52 PM UTC-5, BGB wrote:
>> On 6/24/2022 3:02 PM, MitchAlsup wrote:
>
>>> With <realistically> 30-64-bit registers in use by compiler and 16 of these preserved,
>>> I am not seeing very much caller-save register traffic from Brian's LLVM port. It is more
>>> like R9-R15 are simply temps used whenever and forgotten.
>> That is presumably how it is supposed to be...
>>
>>
>> In my case, it is roughly a 50/50 split between caller save (scratch)
>> and callee save (preserved) registers.
> <
> I, too, have 50%/50%:: R1-15 are temps, R16-30 are preserved.
> R0 receives Return Address, R31 is Stack Pointer. ½ of the temps
> can be used to carry arguments and results covering the 98%-ile.

Yeah:
R0/R1: Special
R2..R7: Scratch
R8..R14: Preserved
R15: SP
R16..R23: Scratch
R24..R31: Preserved

So: 14 scratch, 15 Preserved.

ABI:
R2/R3 Return Value
R2: Struct Pointer (Struct Return)
R3: 'this'
R4..R7, R20..R23: Arguments

If XGPR:
R32..R39, R48..R55: Scratch
R40..R47, R56..R63: Preserved

If the 128-bit ABI:
R36..R39, R52..R55: Arguments
Some other registers in the ABI are moved around.

SP is at R15 mostly for historical reasons, does result in some cruft
though.

>>
>> For leaf functions, one wants a lot of scratch registers, and for
>> non-leaf functions, a lot of callee-save registers.
>>
>> But, sadly, no party can be entirely happy:
>> Leaf functions wishing they could have more registers to play with,
>> without needing to save them first;
>> Non-leaf functions wishing they could have more registers for variables
>> which wont get stomped on the next call;
>> ...
>>
>>
>> Can note that, IIRC:
>> Win64 gave a bigger part of this pie to callee-save;
>> SysV/AMD64 gave a bigger part of the pie to caller-save.
> <
> CRAY-1 had only temp registers at the call/return interface. (Lee Higbe circa 1990)
> IBM 360 had only preserved registers.
> VAX had only preserved registers--both had 16 registers.

OK.

IIRC, x86:
EAX: Scratch
ECX: Scratch
EDX: Scratch
EBX: "It Depends" (1)
ESP: Stack
EBP: Base-Pointer / Preserved
ESI: Preserved
EDI: Preserved

1: Compilers didn't really seem to entirely agree on whether EBX was
Scratch or Preserved, but Preserved seemed to be more common.

>>
>> A roughly even split seemed like an easy answer, lacking any good way to
>> find a general/optimal balance across a range of programs.
>>
> The choice is a lot easier 50%/50% when you have 32 registers.

Yeah.

>>
> <snip>
>>>
>> I once had PUSH/POP in BJX2, but then I dropped them (mostly for
>> cost-saving reasons; after noting that adjusting the stack-pointer and
>> then using a series of stores, or performing a series of loads and then
>> adjusting the stack pointer, could be similarly effective).
> <
> Push instructions make::
> PUSH R1
> PUSH R2
> PUSH R3
> more expensive than:
> SUB SP,SP,#12
> ST R1,[SP+8]
> ST R1,[SP+4]
> ST R1,[SP]
> due to the serial dependency.
> <
> The peep hole HW optimizer in K9 would perform this transformation.
> {Yes, the optimizer was a piece of HW the compiler knew nothing about.}

Partial issue was that, once I got around to pipelining Load/Store
operations, the Push/Pop would have been generally slower as well as
they require an extra interlock stage to deal with the SP updates.

No hardware level optimizers in my case.

It was cheaper and easier simply to drop them from the ISA.

Typically, the:
ADD Imm8s, SP
Or:
ADD Imm16s, SP
Instructions are used for stack-pointer adjustments.

Had ended up treating the "LDISH16 Imm16, SP" encoding as a special type
of breakpoint, mostly one with a magic number, intended mostly to help
with debugging (if a "__debugbreak()" is hit, I can use the magic number
to figure out which debugbreak was hit, where otherwise I might not know
where exactly the loader has put the loaded program, meaning PC by
itself is not sufficient to identify the offending breakpoint).

But, this does seem like a bit of a hack.

I guess, can also note that 0000 is also interpreted as a breakpoint (so
it will trigger a breakpoint if branching into zeroed memory), but this
is not the official BREAK instruction (3030 or F000_3030).

Technically, it is actually:
MOV.B DLR, (DLR, DLR)
But, like, this encoding is pointless enough, that there is no issue
with interpreting it instead as a breakpoint.

...

Brett

unread,

Jun 25, 2022, 5:17:18 PM6/25/22

to

I don’t get why you think you need to reserve half your opcode space for
future extensions, I would have thought we are at the end of history for
opcode extensions.

What is the cost of reserving 3 bits of one pattern and the same pattern at
the 16 bit border, so that you can add 16 bit opcodes in the future?

Besides the cost of rearranging bits the cost is nothing with half the
opcode space reserved. If you are clever and spread the three bits to fit
your open slots.

This gives your customers choices, customers like choices. You could upsell
this as a future feature like RISCV does and just never deliver, like
RISCV. ;)

You have a two wide design, even once you implement 16 bit opcodes handling
the unaligned case of long instructions is not that hard, an extra gate
delay? Two?Depending on which of a dozen approaches you pick for size.

The resulting 13 bit opcode is plenty for one register a source ACC and a
dest ACC, and a full set of integer operations. And some short loads/store
offsets to/from ACC using one address register.

MitchAlsup

unread,

Jun 25, 2022, 6:29:37 PM6/25/22

to

On Saturday, June 25, 2022 at 4:17:18 PM UTC-5, gg...@yahoo.com wrote:
> MitchAlsup <Mitch...@aol.com> wrote:

<snip>

> > I looked at this a few years ago and the damage to long term ISA growth
> > was catastrophic. As it is I have nearly ½ of the OpCode space in each
> > OpCode group left for the future. and can PARSE instructions in 31 gates
> > with only 4 gates of delay. All that goes out the window with a meaningful
> > 16-bit "extension". I pass.
<
> I don’t get why you think you need to reserve half your opcode space for
> future extensions, I would have thought we are at the end of history for
> opcode extensions.
<

a) because I have watched what happens to processors over 50 years and
how every time you turn around they have more OpCodes--mainly to address
stuff forgotten earlier.
<
b) remember I compacted everything into only 59 actual instructions.

>
> What is the cost of reserving 3 bits of one pattern and the same pattern at
> the 16 bit border, so that you can add 16 bit opcodes in the future?
<

The cost of reserving space for 16-bit is that it over-constrains the 32-bit
OpCode space. For example: I could not give the 16-bit OpCodes a typical
subgroup (6-bit Major OpCode) because the first instruction would only
have 10-bits left !! (16-6=10)
<
Also note: Where there are instructions in several formats (like ADD with 16-bit
immediate and ADD of 2 registers), in all cases, the bit pattern used to recognize
ADD remains identical.
<
There are several OpCode groups reserved in perpetuity, these were chosen such
that if one branches into data there is very little possibility to finding anything other
than INVALID instruction decoding sitting there. From the M7 66000 ISA document::
<----------------------------------------------------------------------------------------------------------------------------
A number of the Major OpCode specifiers are reserved in perpetuity as illegal
OpCodes. Transferring control into 32-bit or 64-bit integer data, or into 32-bit or
64-bit floating point data will very likely result in the decoding of an illegal
instruction and raise the OPERATION exception. In particular, small 32-bit positive
integers and small 32-bit negative integers are illegal instructions. 32-bit Floating
point values in the range ±[1/128..32) are also illegal OpCodes. Should control be
transferred into typical integer or floating point data, there is little likelihood of
executing for a long time before running into an illegal instruction. Executing data
can also be prevented in the MMU when desired (almost always.)
<----------------------------------------------------------------------------------------------------------------------------
This catches programming errors. When I looked, you can't do this with 16-bit inst-
ructions; as basically every encoding has to be used.
<
That is: some architects think of things other than code density--we think of
code safety--even when the GuestOS fights us (or JavaScript,...) programming
attack modes, and design architectures that are substantially more robust than
current ones.
>
In particular, My 66000 is immune to the current laundry list of attack strategies
{Meltdown, Spectré, RowHammer, RoP, ForeShadow, ...}

<
> Besides the cost of rearranging bits the cost is nothing with half the
> opcode space reserved. If you are clever and spread the three bits to fit
> your open slots.
<
>
> This gives your customers choices, customers like choices. You could upsell
> this as a future feature like RISCV does and just never deliver, like
> RISCV. ;)
>

I give my customers a choice: Do you want a machine that is secure or not ?

>
> You have a two wide design,
<

Technically, I have a 1-wide design that can CoIssue some pairs of instructions
It is 1-wide because there is 3R1W register file. It can CoIssue because some
pairs of instructions do not consume certain register ports. It is not at all
SuperScalar !
<
I also have a 6-wide design. Nothing in ISA or the rest of the architecture makes
either design necessarily harder than it needs to be. For many of the control
logic calculations My 66000 ISA requires, I went to the trouble of implementing
the gate circuitry to perform said duties to verify that the choices were good.
For example, I can look at a 32-bit word (which takes 320 gates to hold in flip-
flops) and determine the instruction length in 31 gates (4-gates of delay). Thereby
Instruction caches for machines less than 16-wide do not need predecoding bits.
FCMP is performed in the same function unit as Integer CMP. The integer part
requires 55 gates (4 gates of delay) the FP addition is 17 gates (remains 4 gates
of delay) So adding FCMP to CMP is a small price {Oh and BTW, I restrict myself
to 4-in NAND gates and 3-in NOR gates}

<
> even once you implement 16 bit opcodes handling
> the unaligned case of long instructions is not that hard, an extra gate

> delay? Two? Depending on which of a dozen approaches you pick for size.
<
I can assure you that if I had set out to design a good 16-bit OpCode ISA
I would have succeeded. But over my 40 year career, I have found some
things more valuable, and I pursued those interests. You are free to do
differently.

>
> The resulting 13 bit opcode is plenty for one register a source ACC and a
> dest ACC, and a full set of integer operations. And some short loads/store
> offsets to/from ACC using one address register.
<

Then you end up with a cartesian product of various ways to do the same thing.
This makes the compiler have to figure out if::
<
INC Rd ; NoOp
is better or worse than:
ADD Rd,Rd,#1
and what set of rules (or heuristics) govern the selection. I have worked inside
compilers trying to figure this kind of stuff out. It ain't straightforward.
<
I submit that compilers are best when they only need to figure how to do
something exactly one way.

David Brown

unread,

Jun 26, 2022, 5:27:01 AM6/26/22

to

On 25/06/2022 21:01, BGB wrote:
> On 6/25/2022 11:30 AM, David Brown wrote:
>> On 25/06/2022 17:45, BGB wrote:

>>> Seems like it would also be fairly trivial to chop 10/100 Ethernet
>>> down to a 4-wire variant as well, probably using RJ11 plugs or
>>> similar. Advantage of 4-wire as that this could allow for POE (and
>>> 4-wire phone-wire could be cheaper than CAT5E or similar).
>>
>> 10 Mbps and 100 Mbps Ethernet already only use 4 wires - one pair in
>> each direction. Passing (non-isolated) DC power over these wires is
>> extremely simple, and requires nothing more than a few diodes and an
>> LC filter. Unfortunately, the PoE standards were developed by a
>> committee of morons that produced a ridiculously over-engineered
>> system that is too bulky and expensive to have caught on outside a few
>> specific use-cases.
>>
>
> That is kinda the point of how it would be electrically compatible:
> Use the pairs that are in-use in 10/100;
> Skip the other wires;
> Maybe use smaller/cheaper RJ11 (6P4C variant) rather than RJ45.
>
> Normal twisted-pair telephone wire would probably have sufficient
> electrical properties to 10/100 in many cases.

RJ11 connectors are not going to be much cheaper than RJ45, if anything.
And telephone wire is not going to be good enough for anything here.
In particular, there are no real standards or quality control (except
for long-distance lines that cost more than good Ethernet cables because
no one installs new ones any more). So your cheapo bit of telephone
wire might work in one system, but a different cheapo wire won't. It
might work fine until your motors start, then the ESD interference
disrupts it. Having something that /might/ work or /usually/ works is
not very useful.

>
> Only thing is that it would require an adapter to plug RJ11 into RJ45,
> though other options:
> Use RJ45 but with only 2 pairs (effectively a 10/100-only wire);
> Cable which has RJ11 on one end but RJ45 on the other.
> (Side-stepping the need for an adapter at the switch).
>
>
> Probably put the pins in the plug in such a way that it doesn't have
> adverse effects if someone tries to plug a telephone into it.
>
> Say:
> NC, A+, B-, B+, A-, NC
>
> Normal phone only connecting to the B pair (vs across the A/B pairs).
>
> With the POE system I am imagining, if one did connect across the A/B
> pairs, plugging a phone into it would result in it ringing continuously,
> whereas if only the B pair is connected (probably the TX pair from the
> phone's end), it would be silent and there would be zero net voltage
> from the phone's end.
>
>
>> Even easier, however, is simply to pass the power over the spare pairs
>> in a standard 4-pair Ethernet cable.
>>
>
> I had assumed doing a thing of running 48 VAC or similar between the two
> differential pairs.
>

AC is expensive in embedded systems. DC is cheap.

> This should work OK, but needs at least 4 wires (2 for each pair).
> On the device side, there would probably be a bridge rectifier connected
> to the center-taps of an isolation transformer.
>
>
> For PoE with this system, an RJ11<->RJ45 adapter could also function as
> the AC injector, say with a pair of isolation transformers (to let the
> data through), with the center taps connected (via another transformer)
> up to the mains power.
>
> Could make sense in the PoE case to have it as a multi-port block
> though, say, 4-8 simultaneous connections, rather than 1 adapter per cable.
>
>
>> The two-wire Ethernet standards already include support for simpler
>> and cheaper PoE solutions.
>>
>
> OK, would need to look into it.
>
>
> But, would assume that a two-wire interface is not likely to be
> electrically compatible with traditional Ethernet, at least not without
> some additional trickery (additional isolation transformers and probably
> a ground wire).
>

It is not electrically compatible, even with such trickery. The
trickery involved to get full duplex signalling on a single twisted pair
involves a good deal more than a transformer!

Currently, two-wire PHY's are rare and expensive. But that's due to
their novelty - if they become popular and quantities go up, prices will
drop. Note that short-range two-wire 10 Mbps Ethernet is a multi-drop
bus, and does not need a switch. (I'm not sure if it supports PoE.)

BGB

unread,

Jun 26, 2022, 2:43:21 PM6/26/22

to

For telephone wire, I was mostly thinking of 2-pair CAT3 (as opposed to
4 pair CAT3).

For 10/100, it should be OK, since usually the other two pairs are just
sitting around mostly doing nothing.

Apparently, 2-pair CAT3 (still with RJ45) was actually a thing at one
point for some LANs.

However, I have noted that for some short/cheap Ethernet cables, have
seen the RJ45 connectors crimped onto pieces of flat ribbon cable,
implying that one "can" probably get by with cheaper (non-twisted /
CAT1) wire in some cases (I would guess probably for runs of say 1 or 2
meters or similar, would likely need some testing).

Though, I guess one difference between RJ11 and RJ45 is that a lot of
the through-hole RJ45 plugs have built-in isolation transformers,
whereas a board build for RJ11 plugs might need to supply these itself.

>>
>> Only thing is that it would require an adapter to plug RJ11 into RJ45,
>> though other options:
>>    Use RJ45 but with only 2 pairs (effectively a 10/100-only wire);
>>    Cable which has RJ11 on one end but RJ45 on the other.
>>      (Side-stepping the need for an adapter at the switch).
>>
>>
>> Probably put the pins in the plug in such a way that it doesn't have
>> adverse effects if someone tries to plug a telephone into it.
>>
>> Say:
>>    NC, A+, B-, B+, A-, NC
>>
>> Normal phone only connecting to the B pair (vs across the A/B pairs).
>>
>> With the POE system I am imagining, if one did connect across the A/B
>> pairs, plugging a phone into it would result in it ringing
>> continuously, whereas if only the B pair is connected (probably the TX
>> pair from the phone's end), it would be silent and there would be zero
>> net voltage from the phone's end.
>>
>>
>>> Even easier, however, is simply to pass the power over the spare
>>> pairs in a standard 4-pair Ethernet cable.
>>>
>>
>> I had assumed doing a thing of running 48 VAC or similar between the
>> two differential pairs.
>>
>
> AC is expensive in embedded systems. DC is cheap.
>

Small transformer and 4 diodes, should be manageable.

With 48VDC, one is going to need a buck converter, which requires an
inductor and a sense and control circuits.

One can do AC->DC with 4 diodes and a capacitor, which is not a huge
cost in any sense, and AC allows using a linear transformer to step down
to 5V or similar.

It would likely be simpler and cheaper than traditional DC PoE:
Doesn't care which pair is which;
Doesn't need any logic to detect what is on the other end.

Traditional PoE involves a pointlessly complicated "negotiation" step.

Eg: For this, just sorta stick 48VAC on the line and assume it is good.
Though, this would not be compatible with devices assuming DC PoE.

But, in a way, this would be a potential advantage to using RJ11, so one
doesn't accidentally mix them up (not sure how well the DC devices would
tolerate being plugged into AC).

>> This should work OK, but needs at least 4 wires (2 for each pair).
>> On the device side, there would probably be a bridge rectifier
>> connected to the center-taps of an isolation transformer.
>>
>>
>> For PoE with this system, an RJ11<->RJ45 adapter could also function
>> as the AC injector, say with a pair of isolation transformers (to let
>> the data through), with the center taps connected (via another
>> transformer) up to the mains power.
>>
>> Could make sense in the PoE case to have it as a multi-port block
>> though, say, 4-8 simultaneous connections, rather than 1 adapter per
>> cable.
>>
>>
>>> The two-wire Ethernet standards already include support for simpler
>>> and cheaper PoE solutions.
>>>
>>
>> OK, would need to look into it.
>>
>>
>> But, would assume that a two-wire interface is not likely to be
>> electrically compatible with traditional Ethernet, at least not
>> without some additional trickery (additional isolation transformers
>> and probably a ground wire).
>>
>
> It is not electrically compatible, even with such trickery. The
> trickery involved to get full duplex signalling on a single twisted pair
> involves a good deal more than a transformer!
>
> Currently, two-wire PHY's are rare and expensive. But that's due to
> their novelty - if they become popular and quantities go up, prices will
> drop. Note that short-range two-wire 10 Mbps Ethernet is a multi-drop
> bus, and does not need a switch. (I'm not sure if it supports PoE.)
>

OK.

Not sure how it works, haven't found much information thus far.

I guess if I were to consider designing something, a few possibilities:
Simple DC signaling, say:
An RX and TX line, possibly at 3.3v or similar;
Using 8b/10b or similar;
Would need a common ground;
Not likely suitable for longer-distance signaling:
Signal integrity and possible ground-loop issues.
Likely point-to-point only (couldn't do a passive bus).
Something sorta like a CAN bus:
Maybe still using 8b/10b or similar;
Basically, one has a wire that is normally pulled high;
Pull low to send bits, half-duplex.
Likely still subject to ground-loop and propagation delays
Differential signaling, likely tri-state (+/0/-):
Would at least avoid ground loop issues and similar;
Back to requiring isolation transformers and similar.
Wired in a star, there might also be an issue with echoes (1)

1: Potentially echoes would be worse with the isolation transformers, as
induction is likely to reflect an inverted version of the signal back
down the wire, which if if hits the other transformers might flip back
into a positive image, ... Cases where constructive interference occurs
could potentially raise the "noise floor" enough to interfere with
transmission (such as corrupting transmitted frames).

Echo rate at 10 Mbps would be around 1 bit every 30 meters, probably
manageable, and the reflection should mostly "self annihilate" at the
point it is received. For a "sane" network size, the echo is likely to
almost entirely dissipate within a few bits.

At 100 Mbps, it would be 1 bit every 3 meters, potentially a bigger
issue, as echoes would propagate for comparably longer. A lot would also
depend on the inductive properties of the isolation transformers.

(It might take several hundred bits of "silence" for the echoes to die
down following a transmitted message with, say, a 100 meter network).

This issue would likely be significantly reduced if doing point-to-point
signaling (constructive interference could not occur). Would depend
mostly on the "sensitivity" at each end, would want to have sensitivity
low enough that it doesn't detect reflected bits, but high enough that
it does detect transmitted bits.

...

Brett

unread,

Jun 26, 2022, 4:31:25 PM6/26/22

to

An optimizer pass to convert all loads that are only used once to ACC ops
is pretty trivial, as a software guy I am not asking for Itanic compiler
changes.

All the other 16 bit variants use restricted register sets and has deep
effects on register coloring, etc. I tried a dozen of these on paper and
despite my propaganda of the time they all sucked. ;(

ACC ops follow the KISS principle. Simple one for one substitution of 16
bit opcodes for the longer ones where the operation tree makes it possible.

To make best use you need a bunch of 32 bit ops that use ACC, or better yet
just use a register like R1, and remove that register from normal use, a
reduction of one register for the other compiler passes.

MitchAlsup

unread,

Jun 26, 2022, 5:56:38 PM6/26/22

to

While I can, in general, agree that you are not asking for Titanic additions
to the compiler, you are asking for Titanic alterations of the axioms and
tenets underlying the encoding philosophy of My 66000 ISA. Basically,
you are asking for a complete reset. I am not interested in starting over.
However, You are free to design whatever 16-bit instructions you want for
your architecture.

>
> All the other 16 bit variants use restricted register sets and has deep
> effects on register coloring, etc. I tried a dozen of these on paper and
> despite my propaganda of the time they all sucked. ;(
<

You and Quadriblock should get together and compare notes........maybe
you could teach him about "what to leave out" as part of your comp.arch
meeting.

> >>
> >> ACC ops follow the KISS principle. Simple one for one substitution of 16
> >> bit opcodes for the longer ones where the operation tree makes it possible.
> > <
> > I am well aware of how accumulator machines perform. Rather well on the
> > 60%+ code sequences, and less well on the 40%- code sequences. Whereas
> > a 32-bit only ISA has but one way of expressing arithmetic and does not so
> > suffer.
> > <
> > Remember I have a <essentially> pure IRSC ISA that is achieving x86-64
> > code density--significantly better than <¿almost?> all other pure RISC ISAs.
> > I got here by eliminating instruction functionality that could be embodied
> > elsewhere in the data-path and uniformly encoded in ISA. This makes each
> > instruction more powerful without adding delay to its execution and makes
> > the compilers job a bit easier in expressing the necessary semantic.
<
> I will assume in this you *don't* mean the apparent common variation of
> x86-64 which takes upwards of 1.5MB to build Doom...
<

No, in general I am talking about "never having to" in terms of loading
constants, pasting constants together, negating or inverting operands,
and the prologue and epilogue handling instructions.

> >>
> >> To make best use you need a bunch of 32 bit ops that use ACC, or better yet
> >> just use a register like R1, and remove that register from normal use, a
> >> reduction of one register for the other compiler passes.
> > <
> > Yech.
> Agreed...
>
> There isn't really a good reason to have hard-coded registers "in
> general" with 32-bit instruction encodings, particularly not for an
> accumulator.
>

The only hard coded registers is::
a) R0 receives the return address when control is delivered to a subroutine.
b) ENTER and EXIT use R31 = SP to build and tear down stack fames.
<
However; there are circumstances where the HW understands that ranges
of registers have certain properties. For example: when Safe-Stack mode
is in effect, R16-R31 are saved on Safe Stack, R0 does not receive the return
address, and R16-R30 when read before write return zeros.

>
> I will make a partial assumption for loading a big constant to a fixed
> register, where having an instruction for a larger constant load could
> partly offset the drawback of having "not particularly large" immediate
> fields in other contexts.
>

My 66000 never has to load a constant to a register. Somewhere close
to 9% of my instruction count vanishes by uniformly supplying large
constants.

>
> Also at the time, 24 bits would be sufficient in-general for things like
> data/bss loads and stores, whereas something more modest (such as 16 or
> 20 bit) would not have been sufficient (several of these programs
> effectively have several MB of '.bss').
>

In My 66000 memory mapping; a simple application such as cat can have
the .txt, .data, .bss, ... each separated by GBs in the virtual address
space and yet only need 1 page of memory mapping tables !
> ...

Ivan Godard

unread,

Jun 26, 2022, 10:57:39 PM6/26/22

to

Doesn't have to, but isn't it advisble when a constant has widespread uses?

Brett

unread,

Jun 26, 2022, 11:00:24 PM6/26/22

to

MitchAlsup <Mitch...@aol.com> wrote:
> On Saturday, June 25, 2022 at 4:17:18 PM UTC-5, gg...@yahoo.com wrote:
>> MitchAlsup <Mitch...@aol.com> wrote:
> <snip>
>>> I looked at this a few years ago and the damage to long term ISA growth
>>> was catastrophic. As it is I have nearly ½ of the OpCode space in each
>>> OpCode group left for the future. and can PARSE instructions in 31 gates
>>> with only 4 gates of delay. All that goes out the window with a meaningful
>>> 16-bit "extension". I pass.
> <
>> I don’t get why you think you need to reserve half your opcode space for
>> future extensions, I would have thought we are at the end of history for
>> opcode extensions.
> <
> a) because I have watched what happens to processors over 50 years and
> how every time you turn around they have more OpCodes--mainly to address
> stuff forgotten earlier.
> <
> b) remember I compacted everything into only 59 actual instructions.
>>
>> What is the cost of reserving 3 bits of one pattern and the same pattern at
>> the 16 bit border, so that you can add 16 bit opcodes in the future?
> <
> The cost of reserving space for 16-bit is that it over-constrains the 32-bit
> OpCode space. For example: I could not give the 16-bit OpCodes a typical
> subgroup (6-bit Major OpCode) because the first instruction would only
> have 10-bits left !! (16-6=10)

I can work with 10 bits, ideally 11 for 59 instructions, which would be two
sub groups.

ACC opcodes only need one register, plus 5 bits for the opcode hits all the
common cases this extension would use, so 10 bits works fine.

Note that this extension will not slow down code like all the other 16 bit
architectures. You will get a small boost from a smaller code footprint. 32
bit instructions are unaffected, and do all the heavy lifting.

With two opcode groups it should wire directly into your existing logic,
which actually happens anyway if you split this extension into smaller
groups spread into the open opcode spaces where appropriate.

BGB

unread,

Jun 26, 2022, 11:14:27 PM6/26/22

to

FWIW: 128/3 ~= 42 ...

But, yeah, in a more practical sense, unusual data sizes aren't much of
a win.

>>>>
>>>> ACC ops follow the KISS principle. Simple one for one substitution of 16
>>>> bit opcodes for the longer ones where the operation tree makes it possible.
>>> <
>>> I am well aware of how accumulator machines perform. Rather well on the
>>> 60%+ code sequences, and less well on the 40%- code sequences. Whereas
>>> a 32-bit only ISA has but one way of expressing arithmetic and does not so
>>> suffer.
>>> <
>>> Remember I have a <essentially> pure IRSC ISA that is achieving x86-64
>>> code density--significantly better than <¿almost?> all other pure RISC ISAs.
>>> I got here by eliminating instruction functionality that could be embodied
>>> elsewhere in the data-path and uniformly encoded in ISA. This makes each
>>> instruction more powerful without adding delay to its execution and makes
>>> the compilers job a bit easier in expressing the necessary semantic.
> <
>> I will assume in this you *don't* mean the apparent common variation of
>> x86-64 which takes upwards of 1.5MB to build Doom...
> <
> No, in general I am talking about "never having to" in terms of loading
> constants, pasting constants together, negating or inverting operands,
> and the prologue and epilogue handling instructions.

My case is pretty much comparable to x86-64 in what you can do with
immediate and displacement fields:
x86-64: 8 and 32; 64 via a dedicated load.
BJX2: 9 and 33, 64 via a dedicated load.

But, differs in that it has 3R, and is Load/Store.

Also, I have both more registers, and encodings that are often smaller.
Though, x86-64 has:
MOV Rn, Imm64 (80-bits)
Where:
LDI Imm64, Rn (96-bits)
So, x86-64 wins this one...

However, compared with some of the other options (eg: SH-4 or Thumb2),
x86-64's code density seems to be "pretty bad".

Even something like ARMv5 (with fixed-length 32-bit instructions) seems
to be able to beat x86-64 on this metric.

>>>>
>>>> To make best use you need a bunch of 32 bit ops that use ACC, or better yet
>>>> just use a register like R1, and remove that register from normal use, a
>>>> reduction of one register for the other compiler passes.
>>> <
>>> Yech.
>> Agreed...
>>
>> There isn't really a good reason to have hard-coded registers "in
>> general" with 32-bit instruction encodings, particularly not for an
>> accumulator.
>>
> The only hard coded registers is::
> a) R0 receives the return address when control is delivered to a subroutine.
> b) ENTER and EXIT use R31 = SP to build and tear down stack fames.
> <
> However; there are circumstances where the HW understands that ranges
> of registers have certain properties. For example: when Safe-Stack mode
> is in effect, R16-R31 are saved on Safe Stack, R0 does not receive the return
> address, and R16-R30 when read before write return zeros.

OK.

I have R0, R1, and R15/SP hard-coded in certain contexts.

LR is nominally in CR space, but I sometimes used R1 as a "Secondary
Link-Register" or "Saved-Link-Register", mostly in the context of prolog
and epilog compression.

R0 and R1 can be used as scratch registers (with care), however:
The assembler may stomp them without warning in some cases;
For some instructions, they are not allowed:
In some cases, the encodings are special cases;
Some instructions are simply not allowed to use them.
...

>>
>> I will make a partial assumption for loading a big constant to a fixed
>> register, where having an instruction for a larger constant load could
>> partly offset the drawback of having "not particularly large" immediate
>> fields in other contexts.
>>
> My 66000 never has to load a constant to a register. Somewhere close
> to 9% of my instruction count vanishes by uniformly supplying large
> constants.

Stuff like:
if(x>=100)
x=99;
Is still kind of a thing...

But, yeah:
y=x+12345678;
Can be handled with an Imm33s encoding.

But, as noted, the FAzz_zzzz and FBzz_zzzz encodings predate the
addition of Jumbo encodings. In the original form of the ISA (and in ISA
subsets without Jumbo) they are more useful.

But, as noted, these only exist in one of the "unconditional subspace":
FAzz_zzzz LDIZ Imm24u, R0
FBzz_zzzz LDIN Imm24n, R0
WEX Space (Same spot, just WEX=1):
FEzz_zzzz Jumbo
FFzz_zzzz Op64
Predicate Space:
EAzz_zzzz PrWEX F0?T
EBzz_zzzz PrWEX F2?T
EEzz_zzzz PrWEX F0?F
EFzz_zzzz PrWEX F2?F

So, whether or not these are "still" useful, their "twins" elsewhere in
the encoding space allow for a few other encoding spaces to exist.

Likewise:
FFdd_dddd_FAdd_dddd BRA Abs48
FFdd_dddd_FBdd_dddd BSR Disp48
It is itself effective bundled with itself to encode the Abs48 branches.

Had I put pretty much anything else there, I would not have been able to
do this stuff in this way.

>>
>> Also at the time, 24 bits would be sufficient in-general for things like
>> data/bss loads and stores, whereas something more modest (such as 16 or
>> 20 bit) would not have been sufficient (several of these programs
>> effectively have several MB of '.bss').
>>
> In My 66000 memory mapping; a simple application such as cat can have
> the .txt, .data, .bss, ... each separated by GBs in the virtual address
> space and yet only need 1 page of memory mapping tables !

OK.

I am mostly dealing with programs like Doom and Quake effectively
needing ~ 21..23 bits to be able to address across the size of their
'.bss' section.

One can eliminate a lot of bits by using GBR (Global-Base-Register), or
similar, but still need a lot of bits to deal with the size of the section.

One could use less bits by using a GOT, but this just sorta replaces the
large displacements with additional memory loads and tables (not really
a win).

Excluding things like larger displacements needed for branching and
addressing into '.bss' and similar the, most other constants tend to be
a lot smaller.

...

Thomas Koenig

unread,

Jun 27, 2022, 1:33:40 AM6/27/22

to

Ivan Godard <iv...@millcomputing.com> schrieb:

> There are advantages to dedicated base registers, set as a side effect
> of other operations. You can keep them where they are used, instead of
> in a regfile, saving wire delay. You need fewer of them (Mill: 8 vs 32)
> saving fan-in. If you restrict addressable regions to not cross a 4Gb
> (or less) boundary then you can have a narrower address adder (Mill: 32
> bits vs 64).

Hmm... does that mean that the Mill has to do special things to
address arrays > 4 GB?

Thomas Koenig

unread,

Jun 27, 2022, 1:37:35 AM6/27/22

to

MitchAlsup <Mitch...@aol.com> schrieb:

> My 66000 never has to load a constant to a register. Somewhere close
> to 9% of my instruction count vanishes by uniformly supplying large
> constants.

It does not have to, but it makes sense to do so (and is
now done) if the same constant is stored multiple times, see
https://github.com/bagel99/llvm-my66000/issues/2 .

BGB

unread,

Jun 27, 2022, 2:51:05 AM6/27/22

to

Agreed, sometimes it makes sense.

FWIW: In BGBCC, constants are generally treated like a sort of read-only
variable, and so may be pulled into a register like that of a variable.

Though, admittedly, my compiler isn't smart enough to make decisions
about when it might be better to use an immediate, or pull the value
into a register (this decision would need to be made at the
code-generation level rather than the assembler level). This would
likely require adding the use of a heuristic of some sort.

Though, I would suspect cases like that shown are probably in the
minority of cases.

The specific case shown in the issue-tracker would not come up in BJX2,
because it doesn't support directly storing a constant to memory, so
would be forced in this case to load the value into a register and then
store it to memory, and by extension the register would likely be reused
across the multiple stores (since each time one tries to reload it, the
compiler will see that it was already present in a register).

There are cases where this could be useful, just I don't expect they
would likely come up often enough to justify the encoding.

Ivan Godard

unread,

Jun 27, 2022, 10:27:21 AM6/27/22

to

Arrays can be of any size that mmap is willing to give you. Generated
code for constant offsets bigger than 2^32 builds a pointer by explicit
arithmetic rather than by using the address adder. The arithmetic costs
two instructions (con, addp) and a cycle, but saves in the AA for all
offsets < 2^32. We judge that to be a worthwhile tradeoff.

EricP

unread,

Jun 27, 2022, 1:11:03 PM6/27/22

to

Is there something odd about the assembler at the top where
it seems to repeatedly spill a register to local stack frame
then use it as a pointer. e.g.

std r26,[sp,216]
.loc 1 268 1 ; fatigue2.f90:268:1
std #-4317352126650676160,[r26]

plus does this 5 other times.
Just checking.

unread,

Jun 28, 2022, 1:34:46 PM6/28/22

to

Thanks.

Terje Mathisen

unread,

Jun 28, 2022, 4:38:25 PM6/28/22

to

If the constant already exists in a register from a previous load, then
obviously you just reuse it, but when the constant is consumed each time
then you allocate one register up front just to reduce the code size
when iterating a loop.

I suspect the loop count needs to be high, and/or the constant happens
to be the 2/4/8 bytes that increases the loop size past some performance
boundary, i.e. VMM won't work?

The key idea is that all such inner loops fits nicely in $L1 code cache,
so preloading constants just increases the total code size in order to
reduce the loop size by something less.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

MitchAlsup

unread,

Jun 28, 2022, 4:56:44 PM6/28/22

to

On Tuesday, June 28, 2022 at 3:38:25 PM UTC-5, Terje Mathisen wrote:
> Ivan Godard wrote:
> > On 6/26/2022 6:18 PM, MitchAlsup wrote:
> >> My 66000 never has to load a constant to a register. Somewhere close
> >> to 9% of my instruction count vanishes by uniformly supplying large
> >> constants.
> >
> > Doesn't have to, but isn't it advisble when a constant has widespread uses?
> If the constant already exists in a register from a previous load, then
> obviously you just reuse it, but when the constant is consumed each time
> then you allocate one register up front just to reduce the code size
> when iterating a loop.
>
> I suspect the loop count needs to be high, and/or the constant happens
> to be the 2/4/8 bytes that increases the loop size past some performance
> boundary, i.e. VMM won't work?
>
> The key idea is that all such inner loops fits nicely in $L1 code cache,
> so preloading constants just increases the total code size in order to
> reduce the loop size by something less.
<

Both sides of the argument are correct:
a) there are circumstances where code density increases
b) there are circumstances where code performance increases
BOTH ways (preload constant, use directly)
<
And to a certain extent, dependent on whether you are compiling for
performance or density.
<
With any reasonable code cache, the differences are expected to be
tiny. Processing of constants in PARSE and DECODE to not take cycles
on anything bigger than the most minimal implementation ever to be
considered. This implementation would have to be fetching Words from
the I$. My current lowest end fetches quadwords--which is the same size
as the D$ due to misaligned accesses being "fast".*
<
Apparently, one has to look for this situation in an earlier pass of LLVM
in order for the code generator to make use of it.
<
ALSO NOTE: In most implementations, constants do not count against the
maximal number of instructions in a VVM loop. In effect, like Scalar registers,
they are simply pre-delivered values sitting in the instruction queue.
<
And finally, this is one of those things one can fix in the compiler later..........

>
> Terje
>
> --
> - <Terje.Mathisen at tmsw.no>
> "almost all programming can be viewed as an exercise in caching"
<

(*) As a design point, I like to configure the I$ and D$ such that one
(the D$) can be designed, and then another instance used as the I$
(with a few sequences simply not being used in the I$ block). Verilog
gate-eater will remove the logic not in use, so there is no waste in area
(however tiny). AND making 2 instances of 1 cache design is vastly
easier in the verification side of things, this also saves SCHEDULE.
<
You may choose to do otherwise.

Brett

unread,

Jun 29, 2022, 5:34:57 PM6/29/22

to

59 instructions is overkill, short ACC instructions will only use a
fraction of that.
I have 16 integer instructions which is 4 bits plus a 5 bit register for 9
bits.
I also have 4 loads and 3 stores with no offsets which would be in the
load/store major blocks. Tempted to expand these out with a very short
offset for the 64 bit ops to fill out the major block.
This can all be added to My 66000 using free opcode major blocks without
breaking anything.

Two source, Reg op ACC = ACC

Add, Sub, Mul, And, Or, XOR, NAND,

Two source, const op ACC = ACC

5 bit add, 5 bit sub, 5 bit shift left, 5 bit shift right, bit set, bit
clear.

One source Reg op = ACC

NOT, move to ACC, move from ACC

Load base [Reg] = ACC

Load8, load8signed, load 32, load 64,

Store ACC = [Reg]

Store 8, store 32, store 64,

This may give less code compression than traditional forms, which would
explain why it is not used so far. But it is simpler and easier to
implement.

Stephen Fuld

unread,

Jun 29, 2022, 6:12:08 PM6/29/22

to

On 6/29/2022 2:34 PM, Brett wrote:

snip

> 59 instructions is overkill, short ACC instructions will only use a
> fraction of that.
> I have 16 integer instructions which is 4 bits plus a 5 bit register for 9
> bits.
> I also have 4 loads and 3 stores with no offsets which would be in the
> load/store major blocks. Tempted to expand these out with a very short
> offset for the 64 bit ops to fill out the major block.
> This can all be added to My 66000 using free opcode major blocks without
> breaking anything.

Not quite. You have reduced the range of the offsets in control
transfer instructions by a factor of 2. You either need a 16 bit NOP,
or the ability to have 32 bit instructions on 16 bit boundaries. If you
have a NOP, you either need the ability to throw this away in an early
pipeline stage, or it costs you a cycle. And, of course, you loose some
scheduling flexibility when multiple instructions have the Accumulator
as a destination.

I may be wrong, but it seems like a lot of work for a very modest benefit.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

MitchAlsup

unread,

Jun 29, 2022, 6:12:54 PM6/29/22

to

Well there is the rule where the low order 2-bits of IP have to equal zero.
<
But, there are good reasons accumulator design style fell from fashion.
<
Say:: increment stack pointer::
<
MOV ACC, SP
ADD ACC,4
MOV ACC,SP

<
>
> Two source, Reg op ACC = ACC
<

If you are going to write in left to right arithmetic notation use =: or =>

>
> Add, Sub, Mul, And, Or, XOR, NAND,
>
>
> Two source, const op ACC = ACC
>
> 5 bit add, 5 bit sub, 5 bit shift left, 5 bit shift right, bit set, bit
> clear.
>
>
> One source Reg op = ACC
>
> NOT, move to ACC, move from ACC
>
>
> Load base [Reg] = ACC
>
> Load8, load8signed, load 32, load 64,
>
>
> Store ACC = [Reg]
>
> Store 8, store 32, store 64,
>
>
> This may give less code compression than traditional forms, which would
> explain why it is not used so far. But it is simpler and easier to
> implement.
<

The only thing it seems to buy is if you can leave the loop variable in ACC,
which looks tough. I would rather do PDP-11 style.
<
But, look, you want it, why don't you go spend the time to::
a) invent the ISA
b) port the compiler
c) teach the compiler about accumulator code sequences
d) build a simulator
and come back with a report.
<
As I have stated in the past:
I am not interested in this,
but I encourage you to pursue you interests
<
And as a hint I rarely give out:: you get to the end faster by writing the compiler first
and letting the ISA bend itself to the compiler rather than the other way around.
<
I spent 9 years of my career doing interpreters and compilers. My 66000 ISA
is a result of my encounters with instructions which worked with the compiler
(2-operand 1-result) and what didn't (condition codes). Also note: I did pure RISC
designs and x86-64 designs and used these experiences to moderate my views
on both sides. Thus: [Rbase+Rindex<<scale+Displacement] made the cut while
[Rs++] and *[Rs] did not; why ENTER and EXIT made the cut, but POLY did not.
<

MitchAlsup

unread,

Jun 29, 2022, 6:15:39 PM6/29/22

to

That seems very optimistic. 30% code size reduction means that 60% of
the instructions have to be 16 bit. I just don't think you can get that
much.

I did figure out that, assuming your 16 bit subset is a pure subset of
the 32 bit ISA, then you don't need a 16 bit NOP instruction to "fill
out" 32 bits. When you replace a sequence of 32 bit instructions with
16 bit ones, if the last instruction is on an even boundary, which would
otherwise require a NOP to get back to 32 bit allignment, you simply
replace the last 16 bit instruction with its 32 bit equivalent. This
costs you nothing as you would have wasted the last 16 bits anyway, and
it doesn't cost any extra cycles.

BGB

unread,

Jul 4, 2022, 1:37:02 PM7/4/22

to

On 7/4/2022 11:45 AM, Stephen Fuld wrote:
> On 6/30/2022 1:02 AM, Brett wrote:
>
> big snip about adding 16 bit instructions to a 32 bit ISA
>
>
>> I do expect a 30% code size reduction which is close to other 16 bit
>> opcode
>> forms.
>
> That seems very optimistic. 30% code size reduction means that 60% of
> the instructions have to be 16 bit. I just don't think you can get that
> much.
>

60% / 40% (16 vs 32) is about what I am able to get in "size optimized"
mode with my ISA. This seems to be about where it converges.

However, for speed-optimized code, it tends to be closer to around 20% /
40% (16 vs 32).

I don't think one can likely get much better than this, at least with 16/32.

Had experimented with something like 16/24/32, which would have had
24-bit encodings for Load/Store (Disp5) and 3R/3RI ALU ops (and 2R forms
with access to 32 GPRs). This didn't really save enough to be worthwhile
though (and made a mess of some other things), so I basically dropped it
and later reused this part of the encoding space for the 32-bit XGPR
encodings.

Was, IIRC:
60% / 40% (70% vs Fixed-32b)
Vs:
60% / 20% / 20% (65% vs Fixed-32b)
Makes too much of a mess of things to justify saving ~ 5%...

Some factors which effect this:
Register Selection:
Size optimized code needs to avoid using the higher registers.
Opcode Selection:
The smallest encoding is not necessarily the fastest encoding.
WEX:
Generally can't use WEX and have any hope at small code size;
In size-optimized mode, the WEXifier is effectively disabled.

In the past, I had some 48-bit encodings, which could help, but don't
exist in the current form of the ISA. In the current form of the ISA,
the role is filled with the 64-bit encodings, but as noted, 64 is
greater than 48. The savings would be fairly small though, as the 64-bit
encodings represent a small minority of the total instruction count.

However, this does mean that in some cases, two-op sequences can be more
compact than a single instruction:
LDIZ Imm24u, R0 //4B
MOV.x (GBR, R0), Rn //2B
Being less bytes than:
MOV.x (GBR, Disp33s), Rn //8B

In this case, it is a tradeoff between saving 2B or spending an extra
clock-cycle.

> I did figure out that, assuming your 16 bit subset is a pure subset of
> the 32 bit ISA, then you don't need a 16 bit NOP instruction to "fill
> out" 32 bits. When you replace a sequence of 32 bit instructions with
> 16 bit ones, if the last instruction is on an even boundary, which would
> otherwise require a NOP to get back to 32 bit allignment, you simply
> replace the last 16 bit instruction with its 32 bit equivalent. This
> costs you nothing as you would have wasted the last 16 bits anyway, and
> it doesn't cost any extra cycles.
>

Yes, BGBCC does this trick in some cases.

I had at one point considered modifying the rules so that bundles (and,
by extension, any WEXified sequences) would require 32-bit alignment,
with a little repacking trickery being used to minimize the number of
NOPs that needed to be added to do so.

I ended up backing off on this idea, after noting that this had a
significant adverse effect on code density:
Speed optimized code would have been essentially the same size as had I
just used purely 32-bit encodings...

Stefan Monnier

unread,

Jul 4, 2022, 2:25:03 PM7/4/22

to

> Was, IIRC:
> 60% / 40% (70% vs Fixed-32b)

I assume this is with 32bit instructions that can be 16bit aligned, right?

If you impose 32bit alignment on 32bit instructions, how much worse does
it get?

Also, if you impose 32bit alignment on 32bit instructions, you can look
at it not as 16bit instructions but as 32bit "instruction pairs".
At that point you might look at what kinds of pairs you're using, what
kinds of pairs you're *not* using and what kind of additional pairs might
be worth having instead.
E.g. maybe it is better to make your pairs asymmetric (14bit + 18bit),
e.g. to allow slightly larger offsets in short load instructions
(assuming the size of the offset is a common reason for loads to be
forced to use the long form rather than the short form).

Stefan

MitchAlsup

unread,

Jul 4, 2022, 2:45:36 PM7/4/22

to

On Monday, July 4, 2022 at 1:25:03 PM UTC-5, Stefan Monnier wrote:
> > Was, IIRC:
> > 60% / 40% (70% vs Fixed-32b)
> I assume this is with 32bit instructions that can be 16bit aligned, right?
>
> If you impose 32bit alignment on 32bit instructions, how much worse does
> it get?
<

What happens if the rule is that pairs of 16-bit instructions have to used
different function units ?

luke.l...@gmail.com

unread,

Jul 4, 2022, 3:01:20 PM7/4/22

to

On Monday, July 4, 2022 at 5:45:16 PM UTC+1, Stephen Fuld wrote:

> otherwise require a NOP to get back to 32 bit allignment, you simply
> replace the last 16 bit instruction with its 32 bit equivalent. This
> costs you nothing as you would have wasted the last 16 bits anyway, and
> it doesn't cost any extra cycles.

if you don't mind a 48-bit FIFO it is quite easy to have 32-bit
instructions aligned on 16-bit boundaries. of course, if branches
drop the 2 LSBs of addresses (esp. immediate-relative-branches)
then wark.

l.

MitchAlsup

unread,

Jul 4, 2022, 4:03:36 PM7/4/22

to

I don't understand the meaning associated with "wark"
<
It is not really a 48-bit FIFO, it is a 3×½ word buffer/FIFO.
The unit of movement is ½ word not 1-bit.
<
The smallest My 66000 implementation will contain at least 5 entries
each entry is 4 words long of instruction buffering (latch array) with
3 ports (1 in PARSE, 1 in DECODE, 1 in StoreData which is in PARSE)
<
From within this buffer is scan ahead logic which annotates the
starting point of the next instruction (variable length), and looks
for PIC control transfers so instructions at those targets can be
fetched before the branch is DECODEd shrinking the minimum
cycle count of the branch from 2 cycles into 1.108 cycles even
without a delay slot or a branch target buffer. Unconditional
branches and RETs may be pulled entirely from the instruction
stream.
<
So, one would expect such a buffer to already exist and be large
enough. As currently configured it works on 1×word boundaries.
<
All labels would have to be on word boundaries.
>
> l.

BGB

unread,

Jul 4, 2022, 7:14:29 PM7/4/22

to

On 7/4/2022 1:24 PM, Stefan Monnier wrote:
>> Was, IIRC:
>> 60% / 40% (70% vs Fixed-32b)
>
> I assume this is with 32bit instructions that can be 16bit aligned, right?
>

Yes, in BJX2 all of the instructions (and bundles) have 16-bit
alignment, and 32-bit instructions are defined in terms of pairs of
16-bit words.

In my notation, F0nm_ZeoZ, the words are effectively transposed, so if
seen directly as a little endian DWORD, would be:
ZeoZF0nm

Or, in bytes: nm_F0_oZ_Ze

There is also a funky bit-ordering effect, because bit-order is
effectively big endian in terms of the instruction words, but little
endian within the instruction words.

>
> I have some questions about the loop vectorization mechanism:
> How can the CPU hardware convert sequential code to SIMD?
>

{The below reads like I am talking with someone who already
has a few glints about how Virtual Vector Method works. Sorry.}
<
Hardware recognizes that the loop has narrow width and several
iterations can be run in parallel at the same time. For example::
a byte copy loop can be recognized as "dense" and 8-bits wide.
Even the 1-wide In-Order machine has 128-bit wide cache access
ports, so the HW can transform the loop to multi-lanes of 1-byte
LDs and STs, and thus perform the loop 16 iterations per loop-
latency. This low-end machine only has a single cache port, so
the loop latency is at least 2 (1 LD and 1 ST) and the LOOP inst-
ruction runs in 1 cycle, so one can move 128-bits per cycle every
2 cycles (not impressive but illustrative.) The 1-wide machine
takes 5 instructions to perform the loop and has moved 1-byte
in 5 cycles.
So, the vectorized loop is 16-bytes every 2 cycles compared to
1-bye every 5 cycles or 40× faster.
>
Bigger machines with more resources and wider (or more) cache
ports can perform the loop faster. A 4-wide Out-of-Order machine
with tw0 (2) cache ports of 256-bits wide (each) can run this loop
32-iterations per cycle.
>
Why VVM at all ????? This is really the question
<
1st generation RISC machines performed at 0.7 instructions per
cycle. Modern Great Big Massively Out of Order Machines perform
at 2.0 instructions per cycle (only 3× better).
<
A majority of all CPU cycles are spent in loops (say 80%-ish). If VVM
can make ½ of all loops run only 4× faster then we have a little bitty
in-order machine performing at 1.5 IPC with a die footprint about
10% that of the GBMOoO machine.
<
We also have the property that vector width is not a part of the architectural
specification; so the same code that runs optimally on the LBIO machine
also runs optimally on the GBMOoO machines. This preserves the software
investment.

<
> If the loop count is not known at the loop entry, the hardware must decode the entire loop before it can know the necessary vector length and what SIMD resources to use.
<

No. Why do you think loop count alters the data-flow analysis the
hardware performs on behalf of the loop?
<
You need to remember, The Virtual Vector Method has no vector
(or SIMD) registers, no trip count or masking register like vector
machines and SIMD machines. Masking is performed lane by lane
with predication, vector length is as long as it wants and needs to
be.
<
VVM is not a HW mechanism that translates scalar code to Vector
(or SIMD) code (instruction set). It is a a data-flow analysis attached
to a SIMD-like set of flip-flops near the cache ports of an implemen-
tation. You can call these registers SIMD, but SW cannot access them
and they disappear on interrupts and exceptions and loop termination.
<
VVM also has the property that if an exception happens in the middle
of the loop, that the loop collapses to scalar form, the exception is
processed, and SW debugging will see a scalar machine, and when (if)
control returns, the VVM loop will be reinstalled when the LOOP inst-
ruction is encountered.
<
So, My 66000 provides the 373 vector instructions and the 700-odd
SIMD instructions of RISC-V in exactly 2 instructions. At some point::
Reduced Instruction Set Computer should have a reduced instruction
set! My 66000 currently has 62 instructions. In comparison ARM-64
has 1730-ish instructions.

<
> Where does it store the decoded loop?
<

Different implementations are free to choose this for themselves.
<
But, in general, one would expect the instructions to be stored in
the equivalent of the reservation stations*. These stations have
become modified to hold onto the instruction (and scalar operands)
only waiting for the LOOP instruction to signal "another loop", and
wait for any dynamic operand(s). Think:: multi-fire RS.
<
(*) one could imaging a scoreboard, dispatch stack, or any kind
of instruction and data-flow matching instruction dispatch point
in an implementation. How it is done is implementation dependent.

<
> What if the loop is too big to store in the CPU?
<

The front part of the loop is installed, while the back pat of the
loop is performed as if the loop remained scalar. This gradually
degrades to "a little faster than scalar" on huge loops.
<
> What about branches?
<
One can use predication in vectorized loops but not branches.
HW assumes that all taken control transfers terminate the loop.
HW assumes predication provides the if-then-else within the loop.
<
> Nested loops?
<
There is no nesting of vectorization. outer loops run scalar.
<
> Function calls inside loops?
<
Same as branches.

MitchAlsup

unread,

Jan 30, 2023, 9:11:11 PM1/30/23

to

On Monday, January 30, 2023 at 2:18:29 AM UTC-6, Agner Fog wrote:

> Where can I find documentation on My 66000? I can only find bits and pieces here in comp.arch.
<

If your Google-fu is out of practice: email is:: Mitch...@aol.com
<

luke.l...@gmail.com

unread,

Jan 31, 2023, 8:00:15 AM1/31/23

to

On Monday, January 30, 2023 at 6:50:03 PM UTC, MitchAlsup wrote:
> On Monday, January 30, 2023 at 2:18:29 AM UTC-6, Agner Fog wrote:
> >
> Hello, Agner, and welcome to comp.arch.

likewise

> You need to remember, The Virtual Vector Method has no vector
> (or SIMD) registers, no trip count or masking register like vector
> machines and SIMD machines. Masking is performed lane by lane
> with predication, vector length is as long as it wants and needs to
> be.

as long as the hardware has resources to autoallocate inflight uOps.
one thing Mitch does not mention is that VVM relies on identifying
Load-Process-Store loops. you cannot Vectorise In-register-only
loops (as best i know) because (as bext i inow) VVM only works
with Load-Process-Store. given that that is a massive percentage
of all generl purpose loops you get a very high bang-per-buck.

> So, My 66000 provides the 373 vector instructions and the 700-odd
> SIMD instructions of RISC-V in exactly 2 instructions.

mmm... a little overstated but pretty much, yes :) the big big
advantage of VVM is you do NOT add one scalar instruction
then also a vector instruction and a tangled morass of
identical SIMD instructions: you just add one Scalar. therefore
anything missing is way easier to add.

> At some point::
> Reduced Instruction Set Computer should have a reduced instruction
> set! My 66000 currently has 62 instructions. In comparison ARM-64
> has 1730-ish instructions.

did you remember to include the 1000s from SVE/2? :)
all gone to hell and not even done well, sigh. non-orthogonal
as they tried to jam into 32bit.

> <
> > Where does it store the decoded loop?
> <
> Different implementations are free to choose this for themselves.
> <
> But, in general, one would expect the instructions to be stored in
> the equivalent of the reservation stations*. These stations have
> become modified to hold onto the instruction (and scalar operands)
> only waiting for the LOOP instruction to signal "another loop", and
> wait for any dynamic operand(s). Think:: multi-fire RS.

another way to think of it is, you have a loop short enough to identify
the start and end, therefore you micro-code back-end SIMD and
multi-issue to the back-end *implicitly*... all without actually needing
any EXPLICIT SIMD instructions.

separate in your mind "SIMD front-end ISA" from "SIMD back-end micro-architecture"
and you will do fine.

> > What about branches?
> <
> One can use predication in vectorized loops but not branches.
> HW assumes that all taken control transfers terminate the loop.
> HW assumes predication provides the if-then-else within the loop.
> <
> > Nested loops?
> <
> There is no nesting of vectorization. outer loops run scalar.

this is possible with SVP64, by storing Vectorisation State on the stack,
but SVP64 is a whole different ballgame and a different paradigm.

l.

BGB

unread,

Jan 31, 2023, 10:47:05 AM1/31/23

to

On 1/31/2023 7:00 AM, luke.l...@gmail.com wrote:
> On Monday, January 30, 2023 at 6:50:03 PM UTC, MitchAlsup wrote:
>> On Monday, January 30, 2023 at 2:18:29 AM UTC-6, Agner Fog wrote:
>>>
>> Hello, Agner, and welcome to comp.arch.
>
> likewise
>
>> You need to remember, The Virtual Vector Method has no vector
>> (or SIMD) registers, no trip count or masking register like vector
>> machines and SIMD machines. Masking is performed lane by lane
>> with predication, vector length is as long as it wants and needs to
>> be.
>
> as long as the hardware has resources to autoallocate inflight uOps.
> one thing Mitch does not mention is that VVM relies on identifying
> Load-Process-Store loops. you cannot Vectorise In-register-only
> loops (as best i know) because (as bext i inow) VVM only works
> with Load-Process-Store. given that that is a massive percentage
> of all generl purpose loops you get a very high bang-per-buck.
>
>> So, My 66000 provides the 373 vector instructions and the 700-odd
>> SIMD instructions of RISC-V in exactly 2 instructions.
>
> mmm... a little overstated but pretty much, yes :) the big big
> advantage of VVM is you do NOT add one scalar instruction
> then also a vector instruction and a tangled morass of
> identical SIMD instructions: you just add one Scalar. therefore
> anything missing is way easier to add.
>

But, it does mean that the CPU needs to be smarter.
The main advantage of SIMD is that, while it requires more instructions,
it allows for a comparably simpler and dumber CPU.

Main drawback case is if one needs to fake certain SIMD ops via pipelining.

>> At some point::
>> Reduced Instruction Set Computer should have a reduced instruction
>> set! My 66000 currently has 62 instructions. In comparison ARM-64
>> has 1730-ish instructions.
>
> did you remember to include the 1000s from SVE/2? :)
> all gone to hell and not even done well, sigh. non-orthogonal
> as they tried to jam into 32bit.
>

It is a tradeoff.

I can note that SIMD in BJX2 has neither packed byte ops nor saturating ops.

If one mostly only does, say:
Packed Int16
Packed Int32
Packed Binary16
Packed Binary32

And partial:
A few Packed Binary64 ops.

That is comparably fewer ops.
If one skips out convert+op forms (as were present in NEON) this also
reduces the amount of encoding space needed.

Some other cases can be handled via converter ops:
Packed Byte;
Packed RGB555 / RGB444A3;
Packed FP8S/FP8U;
...

This does at least slightly reduce instruction cost.
One other trick (used by SuperH) being to encode parts of the ISA by
twiddling mode bits. Technically sucks though.

>> <
>>> Where does it store the decoded loop?
>> <
>> Different implementations are free to choose this for themselves.
>> <
>> But, in general, one would expect the instructions to be stored in
>> the equivalent of the reservation stations*. These stations have
>> become modified to hold onto the instruction (and scalar operands)
>> only waiting for the LOOP instruction to signal "another loop", and
>> wait for any dynamic operand(s). Think:: multi-fire RS.
>
> another way to think of it is, you have a loop short enough to identify
> the start and end, therefore you micro-code back-end SIMD and
> multi-issue to the back-end *implicitly*... all without actually needing
> any EXPLICIT SIMD instructions.
>
> separate in your mind "SIMD front-end ISA" from "SIMD back-end micro-architecture"
> and you will do fine.
>

Or, one could encode it via prefix encodings, making all of the SIMD ops
64-bit or longer...

Pack4x32 prefix + FADD = Packed 4x Binary32 FADD.

Then one defines which combinations are or are not allowed.

If I were to do this in BJX2, it would probably consist of taking some
instructions which are not allowed in a WEX encoding, and redefining
their use in a WEX encoding as encoding a prefix modifier.

This is possible as I don't really want to burn much more of the 32-bit
encoding space on SIMD ops (as well, most SIMD ops on BJX2 generally
does not allow for operation in bundles, so not being able to encode
them in a bundled form would not be a huge loss).

>>> What about branches?
>> <
>> One can use predication in vectorized loops but not branches.
>> HW assumes that all taken control transfers terminate the loop.
>> HW assumes predication provides the if-then-else within the loop.
>> <
>>> Nested loops?
>> <
>> There is no nesting of vectorization. outer loops run scalar.
>
> this is possible with SVP64, by storing Vectorisation State on the stack,
> but SVP64 is a whole different ballgame and a different paradigm.
>

OK.

> l.