The synergy of type tags on register file registers

JimBrakefield

unread,

May 18, 2023, 10:48:37 PM5/18/23

to

Have a habit of running out of op-code space on new ISA designs.

So, added a type tag to each register. I want four types: unsigned, signed, float and Posit. That and four data sizes requires 16 distinct memory read instructions! But it ends there. Only four store instructions, one for each data size. Only one instance of ADD SUB MUL DIV ! E.g. the R or S register determines the type of operation, automatic conversions between types at the option of the implementation, either do the implied conversion, take an exception or do an unimplemented instruction trap.

If I decide to add more types, such as a saturation flag, that goes in the type tag and requires more load instructions. At which point I go to a longer load instruction with room for additional tag bits. No additional ADD SUB MUL DIV instructions needed! If one wants more than one data item per register (MMX etc) same situation.

The down side is saving and restoring registers. For 24 and 48-bit registers no problem: save into 32 or 64-bit memory. For 32 and 64-bit registers, ugh, on an FPGA just create a small amount of wider memory. Another option is to save the load type info as a load instruction with a very long immediate?

At this time a 24 and 32-bit RISC instructions with two and three operands respectively. Showing 25% remaining op-code space on a full set of op-codes. Probably will add 40 or 48-bit load instructions with lots of instruction modifiers and additional register fields.

Terje Mathisen

unread,

May 19, 2023, 12:25:19 AM5/19/23

to

JimBrakefield wrote:
> Have a habit of running out of op-code space on new ISA designs.
>
> So, added a type tag to each register. I want four types: unsigned,
> signed, float and Posit. That and four data sizes requires 16
> distinct memory read instructions! But it ends there. Only four
> store instructions, one for each data size. Only one instance of ADD
> SUB MUL DIV ! E.g. the R or S register determines the type of
> operation, automatic conversions between types at the option of the
> implementation, either do the implied conversion, take an exception
> or do an unimplemented instruction trap.

Please take another look at how Mill is doing this exact thing!

Ivan (and a few others) have expounded on this here in c.arch for a
number of years now.

Yes, the idea is one of those that seem obvious in hindsight. :-)

The Mill founders decided on a slightly different separation between the
load tags and the instructions needed, in particular you want separate
FP and INT opcodes just so that you can do Int/logical ops on FP bit
patterns.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

robf...@gmail.com

unread,

May 19, 2023, 2:56:36 AM5/19/23

to

Been thinking about adding a tag to the register file myself to track operand
types. Thor2024 has separate float and integer files, but the float file may
contain float, decimal-float or posits. The compiler may know what type of
data is in the register, so this may be of limited use.

For Thor four different sets of instructions are used for signed, unsigned,
float and posit. It takes a lot of opcodes, but so what? Seven bits at the
root level. This allows the decoder to decide rather than having to read a
register first. Instructions could be queued before the register file is read
or valid. Immediate constants are interpreted differently by the decoder
depending on the operand type. Decode would have to wait for a register
file read if it went by the type stored with a register.

The compare-branch instruction has a two-bit field for the compare type
which is signed, unsigned, float, and posit.

Float loads and stores in Thor2024 have opcodes separate from integer
loads and stores as do posits as well. The float loads and stores convert
data to or from different data widths for storage to a fixed size for
processing.

luke.l...@gmail.com

unread,

May 22, 2023, 6:44:37 AM5/22/23

to

On Friday, May 19, 2023 at 3:48:37 AM UTC+1, JimBrakefield wrote:
> Have a habit of running out of op-code space on new ISA designs.
>
> So, added a type tag to each register. I want four types: unsigned,
> signed, float and Posit. That and four data sizes requires 16
> distinct memory read instructions! But it ends there. Only
> four store instructions, one for each data size.

ridiculous, isn't it? you'd think every RISC ISA would work that way.
you can also get away with a much shorter STORE instruction
because you simply take the tag's width.

as Terje mentions, the Mill has had this solved since its inception.
the tag is carried from LOAD thru ARITH right to STORE, and copied
across to any reg-to-reg intermediates for further STORE.
there are "NARROW" and "WIDEN" instructions which change the
tag-type, there *may* be some "fp<->int" conversions that also
change the tag-type but apart from that it's all there is.

so for example as hinted above if you really want a different
STORE width you perform a NARROW or WIDEN first. this you
will find greatly reduces Hazard complexity as it fire-breaks
the STORE Dependency Hazards (3-in 1-out) from the
1-in 1-out needed for N/W.

the problems come as you already suspected on context-switch
(including function calls). there you suddenly have not only
registers to save/restore but also their tags. 8-bit tags is
now 32 bytes of extra context, and with the tags being "bits"
your ISA had better have a damn good way to save/restore
individual bytes into tags.

if you go for a separate "tag regfile" you need not poison the
ISA by reducing the bit-width of GPR/FPRs to something mad
like 24-bit instead of 32-bit. this then solves the issue of
SRAM allocation in FPGAs, they're allocated separately.

one other nice thing to consider would be a "this register is in use"
tag-bit. if set to zero the contents of the register may be
assumed to be zero (or *actually* set to zero) and thus those bits
may be used as a Predicate Mask for Vector LOAD/STORE,
performing automatic VCOMPRESS and VEXPAND.

the bit would be automatically set the moment the register
is used as a destination. if used as a source and the bit
is not set, an Illegal Instruction Trap can be raised, which
would prevent the scenario where someone leaked data
from a different security context.

last time i heard about tags was an obscure Texas Instruments
DSP from the 90s. beyond that for some bizarre unfathomable
reason the idea went completely out of fashion.

l.

luke.l...@gmail.com

unread,

May 22, 2023, 7:15:55 AM5/22/23

to

https://science.lpnu.ua/sites/default/files/journal-paper/2019/jul/17084/volum3number1text-9-16_1.pdf

Burroughs, Elbrus, the historical conclusion was
(bear in mind: no L1/L2 Caches), memory is so
damn expensive it is insane to have tags.

primary focus of that paper seems to be to take
tagged registers as a "given" and to then see if
Parallelism may be achieved on top of that.

l.

Niklas Holsti

unread,

May 22, 2023, 8:04:05 AM5/22/23

to

On 2023-05-22 13:44, luke.l...@gmail.com wrote:
> On Friday, May 19, 2023 at 3:48:37 AM UTC+1, JimBrakefield wrote:
>> Have a habit of running out of op-code space on new ISA designs.
>>
>> So, added a type tag to each register. I want four types: unsigned,
>> signed, float and Posit. That and four data sizes requires 16
>> distinct memory read instructions! But it ends there. Only
>> four store instructions, one for each data size.
>
> ridiculous, isn't it? you'd think every RISC ISA would work that way.
> you can also get away with a much shorter STORE instruction
> because you simply take the tag's width.
>
> as Terje mentions, the Mill has had this solved since its inception.
> the tag is carried from LOAD thru ARITH right to STORE, and copied
> across to any reg-to-reg intermediates for further STORE.
> there are "NARROW" and "WIDEN" instructions which change the
> tag-type, there *may* be some "fp<->int" conversions that also
> change the tag-type but apart from that it's all there is.

The Mill tags ("operand metadata" in Mill terms) do not indicate the
semantic type of the operand value, only its width in bits and whether
it is a single (scalar) value or a vector (SIMD). The tags also flag
various error conditions such as values that are not available for some
reason. The operations that convert integer values to floating-point
values or vice versa do not change the tag (unless the operand width
changes).

The Mill has separate instructions (opcodes) for integer (signed or
unsigned) and floating point (binary or decimal) operations. Thanks to
the tags, the same instructions/opcodes apply uniformly to operands of
all sizes and to scalars and vectors.

Scott Lurndal

unread,

May 22, 2023, 9:44:33 AM5/22/23

to

"luke.l...@gmail.com" <luke.l...@gmail.com> writes:
>https://science.lpnu.ua/sites/default/files/journal-paper/2019/jul/17084/volum3number1text-9-16_1.pdf
>
>Burroughs, Elbrus, the historical conclusion was
>(bear in mind: no L1/L2 Caches), memory is so
>damn expensive it is insane to have tags.

The burroughs large systems still use tags (albeit emulated
on Intel CPUs) to define data types.

The burroughs medium systems had a slightly different mechanism
to indicate data type independent of the instruction; rather the
address syllables (instruction operands) encoded the data type
in the most significant digit as one of four possibilities:
0b00 - Unsigned Numeric
0b01 - Signed Numeric
0b10 - Unsigned Alphanumeric
0b11 - Indirect (i.e. the syllable was a pointer to an address syllable,
which could, in turn have a data type of 0b11, und so weiter).

The other two bits of the MSD encoded an optional index register
to apply to the address calculation.

Which allowed mixed operand types on an instruction:

ADD 0505 Operand1(UN), Operand2(SN), Operand3(UA)

which would add the unsigned first operand to the second
and store the sign in the zone digit of the most significant
byte of the alpha (numeric + the EBCDIC/ASCII zone digit @F@/@3@).

ARMv9 has the MTE (Memory Tagging Extension) which defines a
separate bank of memory for the tag bits (to leverage existing
DIMM widths). This is more designed around security than reducing
the instruction set.

MitchAlsup

unread,

May 22, 2023, 1:52:21 PM5/22/23

to

On Monday, May 22, 2023 at 5:44:37 AM UTC-5, luke.l...@gmail.com wrote:
> On Friday, May 19, 2023 at 3:48:37 AM UTC+1, JimBrakefield wrote:
> > Have a habit of running out of op-code space on new ISA designs.
> >
> > So, added a type tag to each register. I want four types: unsigned,
> > signed, float and Posit. That and four data sizes requires 16
> > distinct memory read instructions! But it ends there. Only
> > four store instructions, one for each data size.
<
> ridiculous, isn't it? you'd think every RISC ISA would work that way.
<

What to leave in, what to leave out........

<
> you can also get away with a much shorter STORE instruction
> because you simply take the tag's width.
>
> as Terje mentions, the Mill has had this solved since its inception.
> the tag is carried from LOAD thru ARITH right to STORE, and copied
> across to any reg-to-reg intermediates for further STORE.
> there are "NARROW" and "WIDEN" instructions which change the
> tag-type, there *may* be some "fp<->int" conversions that also
> change the tag-type but apart from that it's all there is.
>
> so for example as hinted above if you really want a different
> STORE width you perform a NARROW or WIDEN first. this you
> will find greatly reduces Hazard complexity as it fire-breaks
> the STORE Dependency Hazards (3-in 1-out) from the
> 1-in 1-out needed for N/W.
>
> the problems come as you already suspected on context-switch
> (including function calls). there you suddenly have not only
> registers to save/restore but also their tags. 8-bit tags is
> now 32 bytes of extra context, and with the tags being "bits"
> your ISA had better have a damn good way to save/restore
> individual bytes into tags.
<

It is not just the context, but now you are faced with needing even
more memory reference instructions just to save and restore the
registers and associated tags--when you don't know what value
in in the tags !!??!!

>
> if you go for a separate "tag regfile" you need not poison the
> ISA by reducing the bit-width of GPR/FPRs to something mad
> like 24-bit instead of 32-bit. this then solves the issue of
> SRAM allocation in FPGAs, they're allocated separately.
>
> one other nice thing to consider would be a "this register is in use"
> tag-bit. if set to zero the contents of the register may be
> assumed to be zero (or *actually* set to zero) and thus those bits
> may be used as a Predicate Mask for Vector LOAD/STORE,
> performing automatic VCOMPRESS and VEXPAND.
<

With an 8-bit tag, you could simply define rules and just have
generic arithmetic that simply works::
<
i = f×d-f×l+u
<
Which would produce 1 multiply, and 1 mac without any conversions.

>
> the bit would be automatically set the moment the register
> is used as a destination. if used as a source and the bit
> is not set, an Illegal Instruction Trap can be raised, which
> would prevent the scenario where someone leaked data
> from a different security context.
<

When storing preserved (callee) registers you could set the tag
to empty, allowing it to be used as zero, but not allowing this
subroutine to see the value in the register it does not have
ownership of.

luke.l...@gmail.com

unread,

May 22, 2023, 4:08:30 PM5/22/23

to

On Monday, May 22, 2023 at 1:04:05 PM UTC+1, Niklas Holsti wrote:

> The Mill tags ("operand metadata" in Mill terms) do not indicate the
> semantic type of the operand value, only its width in bits and whether
> it is a single (scalar) value or a vector (SIMD). The tags also flag
> various error conditions such as values that are not available for some
> reason.

ahh, thank you for correcting my memory-recall which is somewhat
akin to holographic storage.

On Monday, May 22, 2023 at 6:52:21 PM UTC+1, MitchAlsup wrote:

> It is not just the context, but now you are faced with needing even
> more memory reference instructions just to save and restore the
> registers and associated tags--when you don't know what value
> in in the tags !!??!!

which would tend to suggest that swapping at least one tag
of one register into context-aware state, and replacing it with
known-good state (uint/64-bit would do) would seem to be
a good idea. that one register is then in a fit state to be used
to perform bare-minimum operations such as save/restore
of other reg-tags.

in Power ISA, the two context-swapping SPRs are SRR0 and
SRR1. normally SRR0 contains the Program Counter and
SRR1 contains the MSR (Machine Status Register), so an
entire 8-bits of SRR1 have to be taken up to store the tag
of *one* general-purpose register, which is a hell of a lot.

therefore those 8-bits would have to be in MSR itself (then
copied to SRR1), which is also a hell of a lot.

l.

JimBrakefield

unread,

May 22, 2023, 6:32:48 PM5/22/23

to

Many/most RISC ISAs have bulk/multiple register save/restore instructions.
It is these instructions that would need to preserve register tags.
Will not claim to know how to do it for 32 & 64 bit registers?
Perhaps these instructions need to be privileged so as to prohibit tag modifications.
(particularly if "address" is a type tag?)
Unexpected type coercion traps would provide an additional protection against
mischievous actions between save and restore?

luke.l...@gmail.com

unread,

May 22, 2023, 6:57:22 PM5/22/23

to

On Monday, May 22, 2023 at 11:32:48 PM UTC+1, JimBrakefield wrote:

> Many/most RISC ISAs have bulk/multiple register save/restore instructions.

hardware architects attempting to implement high-speed designs
tend to hate them. ARM retired the bulk/multiple save/restore
instructions in favour of a "save two at a time" instruction which
is way easier on Hazard Management and fits better with OoO
Micro-Architectures.

> It is these instructions that would need to preserve register tags.
> Will not claim to know how to do it for 32 & 64 bit registers?
> Perhaps these instructions need to be privileged so as to prohibit tag modifications.

no, just the "valid and in-use" bit. or, better: if userspace sets
it, the contents of the register are zero'd at the same time.

l.

MitchAlsup

unread,

May 22, 2023, 9:22:09 PM5/22/23

to

Where many/most does not include:: {MIPS, Mc 88K, SPARCV8 or V9, RISC-V}

JimBrakefield

unread,

May 22, 2023, 10:14:43 PM5/22/23

to

Does include Power PC integer registers, M68000, AMD29000, SPARC call/return (eight registers implied), VAX (four registers at a time) and MY 66000?
Context switches (thread, process) not considered in the same class as load/store multiple?
Ugh, so off the hook on "many", definitely not "most", could say move multiple is common?

Scott Lurndal

unread,

May 22, 2023, 10:33:49 PM5/22/23

to

MitchAlsup <Mitch...@aol.com> writes:
>On Monday, May 22, 2023 at 5:32:48=E2=80=AFPM UTC-5, JimBrakefield wrote:
>> On Monday, May 22, 2023 at 12:52:21=E2=80=AFPM UTC-5, MitchAlsup wrote:=

>> Many/most RISC ISAs have bulk/multiple register save/restore instructions=
>.=20

><
>Where many/most does not include:: {MIPS, Mc 88K, SPARCV8 or V9, RISC-V}
><

ARMv8 only supports load pair and store pair. And when system
bus widths allow, LDP/SDP will use single-copy atomic 128-bit bus transactions
when loading or storing a pair of registers.

A new feature FEAT_LS64 provides two new instructions issue single-copy
atomic loads and stores of 64 bytes. These use 8 consecutive 64-bit
general purpose registers.

I don't believe any extant cores implement this feature yet. It's designed
to pass data to on-board coprocessors (like a hardware network stack).

Paul A. Clayton

unread,

May 22, 2023, 10:43:22 PM5/22/23

to

MitchAlsup wrote:
> On Monday, May 22, 2023 at 5:44:37 AM UTC-5, luke.l...@gmail.com wrote:

[snip]

>> the problems come as you already suspected on context-switch
>> (including function calls). there you suddenly have not only
>> registers to save/restore but also their tags. 8-bit tags is
>> now 32 bytes of extra context, and with the tags being "bits"
>> your ISA had better have a damn good way to save/restore
>> individual bytes into tags.
>
> It is not just the context, but now you are faced with needing even
> more memory reference instructions just to save and restore the
> registers and associated tags--when you don't know what value
> in in the tags !!??!!

This assumes that the save/restore is done by instructions. For
context switching, My 66000 does this without instructions but
would still require special support for function interfaces (only
for caller-save registers? though architecting the ABI for that
case may be ill-advised).

The Mill saves the entire valid Belt on function calls without
special instructions. (The Mill requires additional instructions
for widening [and narrowing] when a traditional RISC automatically
widens on a load and truncates [possibly unsafely] on a store.)

Something like Itanium's Register Stack Engine combined with
hardware context switching might also avoid adding spill/fill
instructions that preserve the metadata.

I could also imagine a load multiple instruction that used a
register to provide the metadata and the store multiple
instruction that accumulated the metadata into a register that is
then stored (with an ordinary store instruction as its type is
known). If the ISA has a zero register, including the zero
register in the store list might save the metadata in the zero
register's spot.

Something like My 66000's ENTER/EXIT could probably be extended to
save the metadata.

I suspect there are other ways of avoiding more instructions if
that is seen as sufficiently important.

MitchAlsup

unread,

May 22, 2023, 10:46:32 PM5/22/23

to

68K is not RISC,
SPARC call/ret did not HAVE to ST/LD registers,
My 66000 is arguable as to whether it is a RISC--granted it is near RISC--but is it really--
Well in the context of actually having a Reduced ISA, it is, in what is in that ISA; maybe not.

<
> Context switches (thread, process) not considered in the same class as load/store multiple?
<

No not really, you are manipulating not only register files, but thread and system queues,
manipulating control registers in 2 states, All at once.
<
LDM/STM are purely reg<->mem.

<
> Ugh, so off the hook on "many", definitely not "most", could say move multiple is common?
<

If you count the number of RISC chips selling more than 1,000,000 per month,
then the distinction is moot anyway.

MitchAlsup

unread,

May 22, 2023, 10:51:21 PM5/22/23

to

On Monday, May 22, 2023 at 9:43:22 PM UTC-5, Paul A. Clayton wrote:
> MitchAlsup wrote:
> > On Monday, May 22, 2023 at 5:44:37 AM UTC-5, luke.l...@gmail.com wrote:
> [snip]
> >> the problems come as you already suspected on context-switch
> >> (including function calls). there you suddenly have not only
> >> registers to save/restore but also their tags. 8-bit tags is
> >> now 32 bytes of extra context, and with the tags being "bits"
> >> your ISA had better have a damn good way to save/restore
> >> individual bytes into tags.
> >
> > It is not just the context, but now you are faced with needing even
> > more memory reference instructions just to save and restore the
> > registers and associated tags--when you don't know what value
> > in in the tags !!??!!
<
> This assumes that the save/restore is done by instructions. For
> context switching, My 66000 does this without instructions but
> would still require special support for function interfaces (only
> for caller-save registers? though architecting the ABI for that
> case may be ill-advised).
<

Doing it without instructions is how it gets done transparently.
Doing it without instructions is how you get shot down in flames, too.

>
> The Mill saves the entire valid Belt on function calls without
> special instructions. (The Mill requires additional instructions
> for widening [and narrowing] when a traditional RISC automatically
> widens on a load and truncates [possibly unsafely] on a store.)
>
> Something like Itanium's Register Stack Engine combined with
> hardware context switching might also avoid adding spill/fill
> instructions that preserve the metadata.
>
> I could also imagine a load multiple instruction that used a
> register to provide the metadata and the store multiple
> instruction that accumulated the metadata into a register that is
> then stored (with an ordinary store instruction as its type is
> known). If the ISA has a zero register, including the zero
> register in the store list might save the metadata in the zero
> register's spot.
>
> Something like My 66000's ENTER/EXIT could probably be extended to
> save the metadata.
<

No extension needed. ENTER and EXIT do not expose the saved
register data (locations, fomats,...) to software (when SafeStack
is in use.)

luke.l...@gmail.com

unread,

May 22, 2023, 11:21:00 PM5/22/23

to

On Tuesday, May 23, 2023 at 3:14:43 AM UTC+1, JimBrakefield wrote:

> Does include Power PC [...]

[minor niggle for future reference: "PowerPC(tm)" was a joint initiative between
IBM, Motorola, Apple and I think Sony(?), i am still learning the details but you
likely meant to use "Power ISA" here not "Power PC". the "Power PC" initiative
ended around 15 years ago. Power ISA - the instruction set *behind* the "Power PC"
initiative - is very much still active.]

> Ugh, so off the hook on "many", definitely not "most", could say move multiple is common?

even the SVP64 Extension for Power ISA can only handle up to 127 (registers)
VCOMPRESS/VEXPAND operations and only then by using CR Fields as Predicate
Masks, and that's pushing your luck. 64 would be safer then an INT can
be used as the Predicate Mask. bear in mind that there are 128 GPRs 128 FPRs
128 CR Fields all of which need context-switching so two instructions each
(1: 0-63, 2: 64-127) is still "many".

(127 is down to a limitation of 7 bits for element-offsets)

----------------------

On Tuesday, May 23, 2023 at 3:33:49 AM UTC+1, Scott Lurndal wrote:

> ARMv8 only supports load pair and store pair.

that's the ones i referred to, yes. LD/ST-multi being deprecated.

> A new feature FEAT_LS64 provides two new instructions issue single-copy
> atomic loads and stores of 64 bytes. These use 8 consecutive 64-bit
> general purpose registers.

sigh just when one Hardware Engineer says "oh god no please no
more LD/ST-Multi", another Hardware Engineer says "oh god yes,
please we need LD/ST-Multi"?

a case of left hand not knowing what the right hand did?

-----------------------

On Monday, May 22, 2023 at 6:52:21 PM UTC+1, MitchAlsup wrote:

> With an 8-bit tag, you could simply define rules and just have
> generic arithmetic that simply works::
> i = f×d-f×l+u
> Which would produce 1 multiply, and 1 mac without any conversions.

hang on hang on... this flew by so my brain is catching up. are you
suggesting that by implicit-zeroing it becomes possible to use a
single pipeline that basically does everything? "mul-add, mul-sub,
mul, add, sub" - all in one?

if like in Microwatt (which does micro-coding mentioned last week)
including Carry-in / 0 / 1 as an additional option:

c = runtime-selected Carry-in, 0 or 1
i = f×d-f×l+u + c

then that would be pretty neat. question: would it introduce latency?
how many cycles to complete, compared to a "vanilla" add pipeline?

l.

Niklas Holsti

unread,

May 23, 2023, 3:12:00 AM5/23/23

to

On 2023-05-23 5:14, JimBrakefield wrote:
> On Monday, May 22, 2023 at 8:22:09 PM UTC-5, MitchAlsup wrote:
>> On Monday, May 22, 2023 at 5:32:48 PM UTC-5, JimBrakefield wrote:

[...]

>>> Many/most RISC ISAs have bulk/multiple register save/restore instructions.
>> <
>> Where many/most does not include:: {MIPS, Mc 88K, SPARCV8 or V9, RISC-V}

[...]

> Does include Power PC integer registers, M68000, AMD29000, SPARC

> call/return (eight registers implied), [...]

SPARC call/return do not save multiple registers. You are probably
thinking of the SAVE and RESTORE instructions which rotate the register
ring, but do not perform any reads or writes to memory, unless the
register ring overflows or underflows in which case there is a trap to a
handler that does read or write memory, but not with multiple-register
operations.

SAVE and RESTORE in SPARC are a form of program-controlled register
renaming. They change the mapping of the ISA register numbers to the HW
registers in the register ring.

Granted, SAVE/RESTORE are often used in conjunction with call/return,
but can also be omitted in some cases (say, when calling a leaf routine
that does not need a frame pointer).

MitchAlsup

unread,

May 23, 2023, 10:03:19 AM5/23/23

to

On Monday, May 22, 2023 at 10:21:00 PM UTC-5, luke.l...@gmail.com wrote:

> -----------------------
> On Monday, May 22, 2023 at 6:52:21 PM UTC+1, MitchAlsup wrote:
> > With an 8-bit tag, you could simply define rules and just have
> > generic arithmetic that simply works::
> > i = f×d-f×l+u
> > Which would produce 1 multiply, and 1 mac without any conversions.
<
> hang on hang on... this flew by so my brain is catching up. are you
> suggesting that by implicit-zeroing it becomes possible to use a
> single pipeline that basically does everything? "mul-add, mul-sub,
> mul, add, sub" - all in one?
<

I am suggesting that once you tag registers with the data-type,
you are in a position to build calculation units that can perform
cross-type calculations.

>
> if like in Microwatt (which does micro-coding mentioned last week)
> including Carry-in / 0 / 1 as an additional option:
>
> c = runtime-selected Carry-in, 0 or 1
> i = f×d-f×l+u + c
>
> then that would be pretty neat. question: would it introduce latency?
<

Probably, but not much. And you can avoid numeric problems along
the way. For example, say we have an integer with significance bigger
than 1<<52. When converted to FP64 it will get rounded, but when
auto-converted with tagged-registers it does NOT have to be rounded--
it can be used directly with more significance than fits in FP64 !!

<
> how many cycles to complete, compared to a "vanilla" add pipeline?
<

probably 1 cycle.
>
> l.

Scott Lurndal

unread,

May 23, 2023, 10:31:25 AM5/23/23

to

"luke.l...@gmail.com" <luke.l...@gmail.com> writes:

>On Tuesday, May 23, 2023 at 3:14:43=E2=80=AFAM UTC+1, JimBrakefield wrote:
>
>> Does include Power PC [...]
>

>> A new feature FEAT_LS64 provides two new instructions issue single-copy=
>=20
>> atomic loads and stores of 64 bytes. These use 8 consecutive 64-bit=20
>> general purpose registers.=20

>
>sigh just when one Hardware Engineer says "oh god no please no
>more LD/ST-Multi", another Hardware Engineer says "oh god yes,
>please we need LD/ST-Multi"?
>
>a case of left hand not knowing what the right hand did?

In this case, the assumption is that an implementation of FEAT_LS64 will have
a 512-bit-wide bus from the processor to the coprocessor
and the transfer will be a single bus transaction.

MitchAlsup

unread,

May 23, 2023, 11:38:59 AM5/23/23

to

Technology is at the point where 512-bit wide bus (line size) should be standard.
Then bus transactions are single beat--which makes the protocol(s) easier.
<
As to the assertion that LDM/STM are "hard" on the renamer--I found that a
simple mask addition at the renamer is adequate for dealing with all the
standard call/return interlocking.

Thomas Koenig

unread,

May 28, 2023, 10:35:20 AM5/28/23

to

MitchAlsup <Mitch...@aol.com> schrieb:

> Probably, but not much. And you can avoid numeric problems along
> the way. For example, say we have an integer with significance bigger
> than 1<<52. When converted to FP64 it will get rounded, but when
> auto-converted with tagged-registers it does NOT have to be rounded--
> it can be used directly with more significance than fits in FP64 !!

While true, this would go against the rules of many, if not all
programming langues. These prescribe conversion of integer to
floating point before doing a calculation with a floating point.

luke.l...@gmail.com

unread,

May 28, 2023, 2:19:33 PM5/28/23

to

also stops you from being able to service an Interrupt during
the time that the instruction is issued until there is no longer
a time it could be used. as that could be tens of millions of
instructions later, you are hosed.

* reg1: 9999<<52 (int)
* instruction issued reg2 = BIGINT_to_FP64(reg1)
* instruction(s) issued: reg3 = reg2 * 2.5
* ...
* ...
* millions more instructions none overwritign reg2
* another use of reg2: reg4 = reg2 + 0.0000001

if reg2 is *NOT* exactly the same accuracy as what
went into reg3 and reg4, where are the extraneous bits
stored during the millions of instructions executed?

the only possibility would be an overwrite macro-op
fusion:

reg2 = BIGINT_to_FP64(reg1)
reg2 = reg2 * 2.5

and then it becomes vital to not allow interrupts
between those two instructions, you also just made
a mess of Multi-Issue Decode (because that's now
a double-normal-length *ACTUAL* instruction and
you just unintentionally introduced Variable-Length
Encoding by the back door), and generally gotten
into ISA hell from that point onward.

sounds as Mitch suggests much better to do what
everyone else does.

l.

BGB

unread,

May 28, 2023, 3:25:33 PM5/28/23

to

Agreed.

Also it seems like it could lead to problems.

What if an interrupt happens, does the precision of the calculation
depend on the presence or absence of an interrupt?... Or, does the ISR
now need to save and restore, say, 96-bit registers? ...

For many problems, it seems more like repeatability matters more than
absolute precision. Say, that the same calculation with the same inputs
always produces the same result, even if this answer is "wrong" in a
more strict interpretation.

Say, with the amount of "implicit" architectural state being kept to a
minimum.

Ironically, in my case this led to things like emulator fun of needing
to fake my CPU's arguably "bad" handling of floating-point in some
cases, rather than using the "better" native FPU operations, since this
led to behavioral divergence in some cases.

Kinda similar ends up applying to things like RGB555 or block-texture
decoding, where the "bad" definitions often end up being preferable to
the "better" definitions for one reason or another.

...

MitchAlsup

unread,

May 28, 2023, 3:50:28 PM5/28/23

to

On Sunday, May 28, 2023 at 1:19:33 PM UTC-5, luke.l...@gmail.com wrote:
> On Sunday, May 28, 2023 at 3:35:20 PM UTC+1, Thomas Koenig wrote:
> > MitchAlsup <Mitch...@aol.com> schrieb:
> > > Probably, but not much. And you can avoid numeric problems along
> > > the way. For example, say we have an integer with significance bigger
> > > than 1<<52. When converted to FP64 it will get rounded, but when
> > > auto-converted with tagged-registers it does NOT have to be rounded--
> > > it can be used directly with more significance than fits in FP64 !!
<
> > While true, this would go against the rules of many, if not all
> > programming langues. These prescribe conversion of integer to
> > floating point before doing a calculation with a floating point.
<
> also stops you from being able to service an Interrupt during
> the time that the instruction is issued until there is no longer
> a time it could be used. as that could be tens of millions of
> instructions later, you are hosed.
>
> * reg1: 9999<<52 (int)
> * instruction issued reg2 = BIGINT_to_FP64(reg1)
> * instruction(s) issued: reg3 = reg2 * 2.5
<

Any normal machine could take an interrupt here.

<
> * ...
> * ...
> * millions more instructions none overwritign reg2
> * another use of reg2: reg4 = reg2 + 0.0000001
>
> if reg2 is *NOT* exactly the same accuracy as what
> went into reg3 and reg4, where are the extraneous bits
> stored during the millions of instructions executed?
<

And that is because it is not a CVT followed by a MUL,
it is a CVT-MUL as 1 instruction.

>
> the only possibility would be an overwrite macro-op
> fusion:
>
> reg2 = BIGINT_to_FP64(reg1)
> reg2 = reg2 * 2.5
<

Just have a MUL instruction that looks at the tags and
sees 1 operand is (long) while the other is (double)
and perform CVT (double)Rd,(long)Rs followed by
MUL (double)Rd,(double)Rs1,(double)2.5
<
Op fusion buying performance is a symptom of 2 things
a) the ISA is not sufficiently expressive
b) you ran out of encoding bits
Although may ISA designers suffering from (a) blame it on (b)
because it is so easy to hide the trail of how your ISA got so
screwed up......

>
> and then it becomes vital to not allow interrupts
> between those two instructions, you also just made
> a mess of Multi-Issue Decode (because that's now
> a double-normal-length *ACTUAL* instruction and
> you just unintentionally introduced Variable-Length
> Encoding by the back door), and generally gotten
> into ISA hell from that point onward.
>
> sounds as Mitch suggests much better to do what
> everyone else does.
<

This truly has something to do with age--but I forgive you.
>
> l.

MitchAlsup

unread,

May 28, 2023, 3:55:23 PM5/28/23

to

On Sunday, May 28, 2023 at 2:25:33 PM UTC-5, BGB wrote:
> On 5/28/2023 9:33 AM, Thomas Koenig wrote:
> > MitchAlsup <Mitch...@aol.com> schrieb:
> >
> >> Probably, but not much. And you can avoid numeric problems along
> >> the way. For example, say we have an integer with significance bigger
> >> than 1<<52. When converted to FP64 it will get rounded, but when
> >> auto-converted with tagged-registers it does NOT have to be rounded--
> >> it can be used directly with more significance than fits in FP64 !!
> >
> > While true, this would go against the rules of many, if not all
> > programming langues. These prescribe conversion of integer to
> > floating point before doing a calculation with a floating point.
> Agreed.
>
>
> Also it seems like it could lead to problems.
>
> What if an interrupt happens, does the precision of the calculation
> depend on the presence or absence of an interrupt?... Or, does the ISR
> now need to save and restore, say, 96-bit registers? ...
>

If the program can detect the size of the 96-bit containers IN ANY WAY,
they must be preserved across an interrupt, trap, or syscall. It is the
only SANE thing to do.

>
> For many problems, it seems more like repeatability matters more than
> absolute precision. Say, that the same calculation with the same inputs
> always produces the same result, even if this answer is "wrong" in a
> more strict interpretation.
>
> Say, with the amount of "implicit" architectural state being kept to a
> minimum.
>
>
>
> Ironically, in my case this led to things like emulator fun of needing
> to fake my CPU's arguably "bad" handling of floating-point in some
> cases, rather than using the "better" native FPU operations, since this
> led to behavioral divergence in some cases.
<

Kind of like an eXcel program that runs error-free on a machine with x87
that is replete with rounding errors on a machine without--even those
that purport to emulate x98 (eXcel binary from before the SSE2 time
period, running on a 10-year old Opteron versus a last year Core 7)

Terje Mathisen

unread,

May 28, 2023, 4:06:27 PM5/28/23

to

Yes, but it would allow asm code to implement trig functions (and
others) with more precision/better performance than a pure FP version.

MitchAlsup

unread,

May 28, 2023, 5:46:56 PM5/28/23

to

On Sunday, May 28, 2023 at 3:06:27 PM UTC-5, Terje Mathisen wrote:
> Thomas Koenig wrote:
> > MitchAlsup <Mitch...@aol.com> schrieb:
> >
> >> Probably, but not much. And you can avoid numeric problems along
> >> the way. For example, say we have an integer with significance bigger
> >> than 1<<52. When converted to FP64 it will get rounded, but when
> >> auto-converted with tagged-registers it does NOT have to be rounded--
> >> it can be used directly with more significance than fits in FP64 !!
> >
> > While true, this would go against the rules of many, if not all
> > programming langues. These prescribe conversion of integer to
> > floating point before doing a calculation with a floating point.
> >
> Yes, but it would allow asm code to implement trig functions (and
> others) with more precision/better performance than a pure FP version.
<

At that point why not go all the way and make the elementary functions
into instructions, so you can pull off even more tricks--like I do with argue-
ment reduction in SIN() and COS() and TAN() in 2 passes through the
multiplier instead of 4 using Payne and Hanek synthetization.
<
That is instead of a trick here and a trick there, just give them high
performance (few clocks) and high precision (better than 1 ULP)!!

BGB

unread,

May 28, 2023, 6:18:40 PM5/28/23

to

On 5/28/2023 2:55 PM, MitchAlsup wrote:
> On Sunday, May 28, 2023 at 2:25:33 PM UTC-5, BGB wrote:
>> On 5/28/2023 9:33 AM, Thomas Koenig wrote:
>>> MitchAlsup <Mitch...@aol.com> schrieb:
>>>
>>>> Probably, but not much. And you can avoid numeric problems along
>>>> the way. For example, say we have an integer with significance bigger
>>>> than 1<<52. When converted to FP64 it will get rounded, but when
>>>> auto-converted with tagged-registers it does NOT have to be rounded--
>>>> it can be used directly with more significance than fits in FP64 !!
>>>
>>> While true, this would go against the rules of many, if not all
>>> programming langues. These prescribe conversion of integer to
>>> floating point before doing a calculation with a floating point.
>> Agreed.
>>
>>
>> Also it seems like it could lead to problems.
>>
>> What if an interrupt happens, does the precision of the calculation
>> depend on the presence or absence of an interrupt?... Or, does the ISR
>> now need to save and restore, say, 96-bit registers? ...
>>
> If the program can detect the size of the 96-bit containers IN ANY WAY,
> they must be preserved across an interrupt, trap, or syscall. It is the
> only SANE thing to do.

Agreed.

>>
>> For many problems, it seems more like repeatability matters more than
>> absolute precision. Say, that the same calculation with the same inputs
>> always produces the same result, even if this answer is "wrong" in a
>> more strict interpretation.
>>
>> Say, with the amount of "implicit" architectural state being kept to a
>> minimum.
>>
>>
>>
>> Ironically, in my case this led to things like emulator fun of needing
>> to fake my CPU's arguably "bad" handling of floating-point in some
>> cases, rather than using the "better" native FPU operations, since this
>> led to behavioral divergence in some cases.
> <
> Kind of like an eXcel program that runs error-free on a machine with x87
> that is replete with rounding errors on a machine without--even those
> that purport to emulate x98 (eXcel binary from before the SSE2 time
> period, running on a 10-year old Opteron versus a last year Core 7)

Yeah.

In my case, it was stuff like:
Native FPU:
May give differently rounded results;
May alter the values when the bit pattern encodes a NaN;
Handle Int<->FP conversion differently for out-of-range values.
Handle FP conversion differently for out-of-range values;
...
BJX2 FPU:
Discards low bits from intermediate results
Sometimes effects the final rounding.
Conversion ops preserve the values of NaNs to what extent possible.
Conversion ops identity-map zero-range values between formats;
Small values clamp to 0;
Large values clamp to Inf;
Sub-normal values are assumed not to exist;
The zero-range hack is mostly so Fp32->Fp64->Fp32 is exact;
SSE clamps Fp->Int32 to 0x80000000 if out of range,
on BJX2 is modulo by default;
FP-SIMD ops and value conversions are truncate;
...

But, on the merit (or drawbacks) of using hard-wired truncate for SIMD
ops, it is not ideal if, say, the FPGA version uses truncate but the
emulator uses round-nearest-even, ...

I still make no claim of strict IEEE-754 conformance.

Thomas Koenig

unread,

May 29, 2023, 3:10:25 AM5/29/23

to

Terje Mathisen <terje.m...@tmsw.no> schrieb:

> Thomas Koenig wrote:
>> MitchAlsup <Mitch...@aol.com> schrieb:
>>
>>> Probably, but not much. And you can avoid numeric problems along
>>> the way. For example, say we have an integer with significance bigger
>>> than 1<<52. When converted to FP64 it will get rounded, but when
>>> auto-converted with tagged-registers it does NOT have to be rounded--
>>> it can be used directly with more significance than fits in FP64 !!
>>
>> While true, this would go against the rules of many, if not all
>> programming langues. These prescribe conversion of integer to
>> floating point before doing a calculation with a floating point.
>>
> Yes, but it would allow asm code to implement trig functions (and
> others) with more precision/better performance than a pure FP version.

Or we could re-introduce 80-bit floating point registers with a 64-bit
mantissa and have 64- and 80-bit instructions for them (both load/store
and calculation). You need 64-bit integer arithmetic anyway.

But I don't see that on the horizon any time soon - presumably, 128-bit
IEEE will be used instead (Like POWER does).

Terje Mathisen

unread,

May 29, 2023, 6:32:13 AM5/29/23

to

Oh, I absolutely agree that your approach is much better for anything
you know up front that you will need!

I'm willing to put it even stronger: Your approach should be the only
one worth considering for new designs.

OTOH, having those extended primitives would make it easier to implement
arbitrary, but unsupported in HW, functions.

Terje Mathisen

unread,

May 29, 2023, 6:36:13 AM5/29/23

to

Absolutely correct, fp128 is the only sane way forward.

Even there you could take advantage of a few mixed FP/Int primitives for
core 128-bit functions.

luke.l...@gmail.com

unread,

May 29, 2023, 6:50:56 AM5/29/23

to

On Monday, May 29, 2023 at 11:36:13 AM UTC+1, Terje Mathisen wrote:

> > But I don't see that on the horizon any time soon - presumably, 128-bit
> > IEEE will be used instead (Like POWER does).
> >
> Absolutely correct, fp128 is the only sane way forward.

unfortunately it comes with the implication that 128-bit
registers are now a first-order part of the ISA, or that
pairs of 64-bit registers are required. at which point
you have to think through how to handle up to 6-in
2-out register hazard management just for FMAC.
(there are solutions, none of them very pretty).

it puts any such design firmly out of reach of "desktop
notebook smartbook tablet chromebook Industrial
IOT Edge computing" and into "Server HPC Supercomputer
GPU" territory.

l.

Scott Lurndal

unread,

May 29, 2023, 9:46:29 AM5/29/23

to

"luke.l...@gmail.com" <luke.l...@gmail.com> writes:
>On Monday, May 29, 2023 at 11:36:13=E2=80=AFAM UTC+1, Terje Mathisen wrote:
>
>> > But I don't see that on the horizon any time soon - presumably, 128-bit=
>=20
>> > IEEE will be used instead (Like POWER does).=20
>> >
>> Absolutely correct, fp128 is the only sane way forward.=20

>
>unfortunately it comes with the implication that 128-bit
>registers are now a first-order part of the ISA, or that
>pairs of 64-bit registers are required. at which point
>you have to think through how to handle up to 6-in
>2-out register hazard management just for FMAC.
>(there are solutions, none of them very pretty).
>
>it puts any such design firmly out of reach of "desktop
>notebook smartbook tablet chromebook Industrial
>IOT Edge computing" and into "Server HPC Supercomputer
>GPU" territory.

Does it? Pretty much every ARMv8 CPU has 128-bit registers
today for the Neon/SIMD unit, and the floating point unit uses
the same register file (supporting 8-bit, 16-bit, 32-bit and
64-bit floating point).

luke.l...@gmail.com

unread,

May 29, 2023, 10:14:15 AM5/29/23

to

On Monday, May 29, 2023 at 2:46:29 PM UTC+1, Scott Lurndal wrote:

> Does it? Pretty much every ARMv8 CPU has 128-bit registers
> today for the Neon/SIMD unit,

with associated SIMD Hell Baggage, which is a whole new
topic, discussed many times even in the relatively short
years i've been on comp.arch
https://www.sigarch.org/simd-instructions-considered-harmful/

l.

Thomas Koenig

unread,

May 29, 2023, 11:11:51 AM5/29/23

to

luke.l...@gmail.com <luke.l...@gmail.com> schrieb:

> On Monday, May 29, 2023 at 11:36:13 AM UTC+1, Terje Mathisen wrote:
>
>> > But I don't see that on the horizon any time soon - presumably, 128-bit
>> > IEEE will be used instead (Like POWER does).
>> >
>> Absolutely correct, fp128 is the only sane way forward.
>
> unfortunately it comes with the implication that 128-bit
> registers are now a first-order part of the ISA, or that
> pairs of 64-bit registers are required. at which point
> you have to think through how to handle up to 6-in
> 2-out register hazard management just for FMAC.
> (there are solutions, none of them very pretty).

If you need 8 registers just for an FMAC, chances are that 32
registers will become problematic, you will probably want 64
registers then.

And the arithmetic will presumably be microcoded^H^H^H^H^H^H^H^H^H^H
sequenced (I do not expect a 113-bit wide full multiplier) and
will take quite a few cycles anyway, even though it will still
likely be much faster than software.

> it puts any such design firmly out of reach of "desktop
> notebook smartbook tablet chromebook Industrial
> IOT Edge computing" and into "Server HPC Supercomputer
> GPU" territory.

I somehow doubt that, given the huge number of gates that a modern
CPU throws at runnding Windows and Teams...

Terje Mathisen

unread,

May 29, 2023, 1:01:41 PM5/29/23

to

luke.l...@gmail.com wrote:
> On Monday, May 29, 2023 at 11:36:13 AM UTC+1, Terje Mathisen wrote:
>
>>> But I don't see that on the horizon any time soon - presumably, 128-bit
>>> IEEE will be used instead (Like POWER does).
>>>
>> Absolutely correct, fp128 is the only sane way forward.
>
> unfortunately it comes with the implication that 128-bit
> registers are now a first-order part of the ISA, or that
> pairs of 64-bit registers are required. at which point
> you have to think through how to handle up to 6-in
> 2-out register hazard management just for FMAC.
> (there are solutions, none of them very pretty).

Except we already have SIMD vector regs of 128/256/512 bits and they
already support FMAC type operations, the only difference being that the
indivdual parts are 32 or 64 bits.

> it puts any such design firmly out of reach of "desktop
> notebook smartbook tablet chromebook Industrial
> IOT Edge computing" and into "Server HPC Supercomputer
> GPU" territory.
>
> l.
>

See above...

BGB

unread,

May 29, 2023, 3:10:19 PM5/29/23

to

On 5/29/2023 5:50 AM, luke.l...@gmail.com wrote:
> On Monday, May 29, 2023 at 11:36:13 AM UTC+1, Terje Mathisen wrote:
>
>>> But I don't see that on the horizon any time soon - presumably, 128-bit
>>> IEEE will be used instead (Like POWER does).
>>>
>> Absolutely correct, fp128 is the only sane way forward.
>
> unfortunately it comes with the implication that 128-bit
> registers are now a first-order part of the ISA, or that
> pairs of 64-bit registers are required. at which point
> you have to think through how to handle up to 6-in
> 2-out register hazard management just for FMAC.
> (there are solutions, none of them very pretty).
>

FWIW: 6-in / 2-out is how 128-bit FP-SIMD works in the BJX2 core...

There was at one point a feature to allow truncated Binary128 as an
intermediary, basically:
S.E15.F80.Z32

Where the Z bits are "ignore on input, zeroed on output".

Where, the 80-bit mantissa was mostly because, the cost increase in
going from 80 to 112 bits is, quite steep...

But, even this was a bit too expensive, so I ended up backing off (to
using only Binary64 in hardware, and handling Binary128 in software).

Dropping to 64-bit mantissa would make this part a lot cheaper, but is
"kinda lame" if it has barely any more precision than normal Binary64.

Could in theory handle full Binary128 ops (including FDIV) in hardware
using a modified Shift-Add unit for FMUL (I had already used Shift-Add
for Binary64 FDIV), but this would likely be barely much faster than a
plain software implementation.

The above is more of a "we need it to check something off a checklist"
as opposed to something people would actually use (say, if one has a 120
cycle FMUL and 240 cycle FDIV, this is not exactly all that usable in
terms of performance).

But, still interesting if the "actual CPU" people can make this work.

> it puts any such design firmly out of reach of "desktop
> notebook smartbook tablet chromebook Industrial
> IOT Edge computing" and into "Server HPC Supercomputer
> GPU" territory.
>

Full Binary128: Quite possibly.

6-in / 2-out register ports for an operation:
Probably fine, most non microcontroller CPUs probably already have
enough register ports...

> l.

BGB

unread,

May 29, 2023, 3:22:29 PM5/29/23

to

On 5/29/2023 10:10 AM, Thomas Koenig wrote:
> luke.l...@gmail.com <luke.l...@gmail.com> schrieb:
>> On Monday, May 29, 2023 at 11:36:13 AM UTC+1, Terje Mathisen wrote:
>>
>>>> But I don't see that on the horizon any time soon - presumably, 128-bit
>>>> IEEE will be used instead (Like POWER does).
>>>>
>>> Absolutely correct, fp128 is the only sane way forward.
>>
>> unfortunately it comes with the implication that 128-bit
>> registers are now a first-order part of the ISA, or that
>> pairs of 64-bit registers are required. at which point
>> you have to think through how to handle up to 6-in
>> 2-out register hazard management just for FMAC.
>> (there are solutions, none of them very pretty).
>
> If you need 8 registers just for an FMAC, chances are that 32
> registers will become problematic, you will probably want 64
> registers then.
>

Cough, errm, can you guess why I ended up going for 64 GPRs with me
doing 128-bit SIMD ops in GPRs in my case?...

Yeah, this sort of thing was sort of a factor here...

It was effectively "all but required" for the ABI built around 128-bit
pointers as well.

> And the arithmetic will presumably be microcoded^H^H^H^H^H^H^H^H^H^H
> sequenced (I do not expect a 113-bit wide full multiplier) and
> will take quite a few cycles anyway, even though it will still
> likely be much faster than software.
>

It is debatable...

If one uses a naive Shift-Add design, it is possible in many cases for
the software implementation to actually be faster (say, if one has a
32-bit multiplier with a 3-cycle latency).

With division being the main exception case (both integer division in
software, and Newton-Raphson, are going to be slow).

Could consider it if, for whatever reason, a sudden need came up for
support for 128-bit integer multiply and divide, since Binary128 support
could just sort of be "thrown in" to the unit.

As-is, things like 128-bit integer ALU ops can help somewhat with
Binary128 emulation (without being too unreasonably expensive for the
hardware).

>> it puts any such design firmly out of reach of "desktop
>> notebook smartbook tablet chromebook Industrial
>> IOT Edge computing" and into "Server HPC Supercomputer
>> GPU" territory.
>
> I somehow doubt that, given the huge number of gates that a modern
> CPU throws at runnding Windows and Teams...

Probably...

If I just wanted it as a slow "checklist feature", it is already pretty
doable on the larger end of the Artix-7 class FPGAs, so an ASIC
shouldn't have too much issue...

MitchAlsup

unread,

May 29, 2023, 4:11:12 PM5/29/23

to

On Monday, May 29, 2023 at 5:50:56 AM UTC-5, luke.l...@gmail.com wrote:
> On Monday, May 29, 2023 at 11:36:13 AM UTC+1, Terje Mathisen wrote:
>
> > > But I don't see that on the horizon any time soon - presumably, 128-bit
> > > IEEE will be used instead (Like POWER does).
> > >
> > Absolutely correct, fp128 is the only sane way forward.
<

128-bit might be the only sane way forward,
BUT pairing and sharing of registers is not.

<
> unfortunately it comes with the implication that 128-bit
> registers are now a first-order part of the ISA, or that
> pairs of 64-bit registers are required. at which point
> you have to think through how to handle up to 6-in
> 2-out register hazard management just for FMAC.
> (there are solutions, none of them very pretty).
<

Back when engineers only got 16-bit ALUs and had to build
systems that smelled like they supported 64-bit FP they used
a term that has very much fallen out of favor today:: µCode.

MitchAlsup

unread,

May 29, 2023, 4:13:34 PM5/29/23

to

Not only is SIMD harmful to your ISA it is harmful to your though processes.
<
On the other hand:: If you want to call an architecture with 1300[0] instructions
RISC; all you have done is change the first letter of RISC into Rediculous.
<
> l.

MitchAlsup

unread,

May 29, 2023, 4:14:53 PM5/29/23

to

On Monday, May 29, 2023 at 12:01:41 PM UTC-5, Terje Mathisen wrote:
> luke.l...@gmail.com wrote:
> > On Monday, May 29, 2023 at 11:36:13 AM UTC+1, Terje Mathisen wrote:
> >
> >>> But I don't see that on the horizon any time soon - presumably, 128-bit
> >>> IEEE will be used instead (Like POWER does).
> >>>
> >> Absolutely correct, fp128 is the only sane way forward.
> >
> > unfortunately it comes with the implication that 128-bit
> > registers are now a first-order part of the ISA, or that
> > pairs of 64-bit registers are required. at which point
> > you have to think through how to handle up to 6-in
> > 2-out register hazard management just for FMAC.
> > (there are solutions, none of them very pretty).
<
> Except we already have SIMD vector regs of 128/256/512 bits and they
> already support FMAC type operations, the only difference being that the
> indivdual parts are 32 or 64 bits.
<

Are you actually arguing that 512-bit registers are the right thing to do ?

Terje Mathisen

unread,

May 29, 2023, 4:30:59 PM5/29/23

to

MitchAlsup wrote:
> On Monday, May 29, 2023 at 12:01:41 PM UTC-5, Terje Mathisen wrote:
>> luke.l...@gmail.com wrote:
>>> On Monday, May 29, 2023 at 11:36:13 AM UTC+1, Terje Mathisen wrote:
>>>
>>>>> But I don't see that on the horizon any time soon - presumably, 128-bit
>>>>> IEEE will be used instead (Like POWER does).
>>>>>
>>>> Absolutely correct, fp128 is the only sane way forward.
>>>
>>> unfortunately it comes with the implication that 128-bit
>>> registers are now a first-order part of the ISA, or that
>>> pairs of 64-bit registers are required. at which point
>>> you have to think through how to handle up to 6-in
>>> 2-out register hazard management just for FMAC.
>>> (there are solutions, none of them very pretty).
> <
>> Except we already have SIMD vector regs of 128/256/512 bits and they
>> already support FMAC type operations, the only difference being that the
>> indivdual parts are 32 or 64 bits.
> <
> Are you actually arguing that 512-bit registers are the right thing to do ?

Not really, but OTOH, I do believe that cache-line size (i.e. typically
64 bytes/512 bits) is the natural working set/chunk size for the CPU
hardware. If you can do that with clever encoding tricks like your VMM,
then more power to you.

Demanding separate opcodes for every possible size SIMD is NOT the
proper way to do it. Also ref Mill and our recent register tag bits thread.

Scott Lurndal

unread,

May 29, 2023, 4:37:38 PM5/29/23

to

MitchAlsup <Mitch...@aol.com> writes:
>On Monday, May 29, 2023 at 12:01:41=E2=80=AFPM UTC-5, Terje Mathisen wrote:
>> luke.l...@gmail.com wrote:=20
>> > On Monday, May 29, 2023 at 11:36:13=E2=80=AFAM UTC+1, Terje Mathisen wr=
>ote:=20
>> >=20
>> >>> But I don't see that on the horizon any time soon - presumably, 128-b=
>it=20
>> >>> IEEE will be used instead (Like POWER does).=20
>> >>>=20
>> >> Absolutely correct, fp128 is the only sane way forward.=20
>> >=20
>> > unfortunately it comes with the implication that 128-bit=20
>> > registers are now a first-order part of the ISA, or that=20
>> > pairs of 64-bit registers are required. at which point=20
>> > you have to think through how to handle up to 6-in=20
>> > 2-out register hazard management just for FMAC.=20

>> > (there are solutions, none of them very pretty).
><

>> Except we already have SIMD vector regs of 128/256/512 bits and they=20
>> already support FMAC type operations, the only difference being that the=
>=20

>> indivdual parts are 32 or 64 bits.
><
>Are you actually arguing that 512-bit registers are the right thing to do ?

ARM's scalable vector extension allows implementations to choose
any power of two between 128 and 2048 bits. Current extant implementations
only support 128 bits so far as I am aware.

They are certainly useful for some set of workloads; albeit
specialized.

MitchAlsup

unread,

May 29, 2023, 5:17:51 PM5/29/23

to

On Monday, May 29, 2023 at 3:30:59 PM UTC-5, Terje Mathisen wrote:
> MitchAlsup wrote:
> > On Monday, May 29, 2023 at 12:01:41 PM UTC-5, Terje Mathisen wrote:
> >> luke.l...@gmail.com wrote:
> >>> On Monday, May 29, 2023 at 11:36:13 AM UTC+1, Terje Mathisen wrote:
> >>>
> >>>>> But I don't see that on the horizon any time soon - presumably, 128-bit
> >>>>> IEEE will be used instead (Like POWER does).
> >>>>>
> >>>> Absolutely correct, fp128 is the only sane way forward.
> >>>
> >>> unfortunately it comes with the implication that 128-bit
> >>> registers are now a first-order part of the ISA, or that
> >>> pairs of 64-bit registers are required. at which point
> >>> you have to think through how to handle up to 6-in
> >>> 2-out register hazard management just for FMAC.
> >>> (there are solutions, none of them very pretty).
> > <
> >> Except we already have SIMD vector regs of 128/256/512 bits and they
> >> already support FMAC type operations, the only difference being that the
> >> indivdual parts are 32 or 64 bits.
> > <
> > Are you actually arguing that 512-bit registers are the right thing to do ?
> Not really, but OTOH, I do believe that cache-line size (i.e. typically
> 64 bytes/512 bits) is the natural working set/chunk size for the CPU
> hardware. If you can do that with clever encoding tricks like your VMM,
> then more power to you.
<

Yes, some 1/2^n ; n = { 0, 1, 2 } portion of cache line size is appropriate.
a) caches are built that way regardless.
b) cache width is natural.
c) one can calculation units build as wide as they can feed data.
<
But I found no particular reason these chunks of memory have to be exposed
in a register file, or have "a special OpCode Group" to utilize them. This is
what VVM gets rid of.

luke.l...@gmail.com

unread,

May 29, 2023, 6:39:11 PM5/29/23

to

On Monday, May 29, 2023 at 9:37:38 PM UTC+1, Scott Lurndal wrote:

> ARM's scalable vector extension allows implementations

the key being *implementations* choose. not ARM, and not programmers.

> to choose any power of two between 128 and 2048 bits.

fixed-length SIMD creates a different kind of hell: binary
incompatibility. https://www.youtube.com/watch?v=HNEm8zmkjBU
the two key instructions that catastrophically fail are element-permute
and element-rotate, both of which *silently* produce different results
for the exact same binary.

> Current extant implementations
> only support 128 bits so far as I am aware.

that is because at conferences the consensus between
ARM Licensees who knew of the binary-incompatibility
decided that the best way to avoid it was to simply never
implement anything other than 128.

l.

luke.l...@gmail.com

unread,

May 29, 2023, 6:43:01 PM5/29/23

to

On Monday, May 29, 2023 at 9:30:59 PM UTC+1, Terje Mathisen wrote:

> Demanding separate opcodes for every possible size SIMD is NOT the
> proper way to do it.

according to Intel

...with AVX AVX2 AVX-512 and the mad number
of subsets and the mental number of times the registers have
been extended....

it most certainly is the best possible way!
https://en.wikipedia.org/wiki/Advanced_Vector_Extensions

(and what size and combinatorial-explosion of instruction-lengths
are they up to now, with over 10,000 instructions?)

l.

luke.l...@gmail.com

unread,

May 29, 2023, 6:45:15 PM5/29/23

to

On Monday, May 29, 2023 at 9:11:12 PM UTC+1, MitchAlsup wrote:

> 128-bit might be the only sane way forward,
> BUT pairing and sharing of registers is not.

pairing is caring...

l.

MitchAlsup

unread,

May 29, 2023, 6:48:39 PM5/29/23

to

My last count was 13,700±.......
<
I get similar functionality with +2 instructions...........
>
> l.

Stephen Fuld

unread,

May 29, 2023, 7:29:59 PM5/29/23

to

On 5/29/2023 1:11 PM, MitchAlsup wrote:
> On Monday, May 29, 2023 at 5:50:56 AM UTC-5, luke.l...@gmail.com wrote:
>> On Monday, May 29, 2023 at 11:36:13 AM UTC+1, Terje Mathisen wrote:
>>
>>>> But I don't see that on the horizon any time soon - presumably, 128-bit
>>>> IEEE will be used instead (Like POWER does).
>>>>
>>> Absolutely correct, fp128 is the only sane way forward.
> <
> 128-bit might be the only sane way forward,
> BUT pairing and sharing of registers is not.

You have mentioned before that compiler guys don't like register
pairing. But I did a (very)little research and thought that the work of
Preston Briggs (expressed in his thesis, several papers and a book) on
the subject of register allocation showed that it wasn't much of a
problem. Is that incorrect?

Of course, it is not optimal, but ISTM that the advantages of not having
to name additional registers in the instructions, and being able to use
load/store multiple (or even double) are pretty compelling.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

MitchAlsup

unread,

May 29, 2023, 7:57:31 PM5/29/23

to

On Monday, May 29, 2023 at 6:29:59 PM UTC-5, Stephen Fuld wrote:
> On 5/29/2023 1:11 PM, MitchAlsup wrote:
> > On Monday, May 29, 2023 at 5:50:56 AM UTC-5, luke.l...@gmail.com wrote:
> >> On Monday, May 29, 2023 at 11:36:13 AM UTC+1, Terje Mathisen wrote:
> >>
> >>>> But I don't see that on the horizon any time soon - presumably, 128-bit
> >>>> IEEE will be used instead (Like POWER does).
> >>>>
> >>> Absolutely correct, fp128 is the only sane way forward.
> > <
> > 128-bit might be the only sane way forward,
> > BUT pairing and sharing of registers is not.
<
> You have mentioned before that compiler guys don't like register
> pairing. But I did a (very)little research and thought that the work of
> Preston Briggs (expressed in his thesis, several papers and a book) on
> the subject of register allocation showed that it wasn't much of a
> problem. Is that incorrect?
<

My word from people who actually write compilers is "do not pair or
share" registers with any sort of implication (S 360; R | 1; others R + 1).
<
What happens is that the register allocator goes through several
stages as things get converted from SSA into R[k]. And along the
way implicit pairing can end up attempting to do things like::
<
DADD {R8,R19},{R7,R6},{R1,R13}
<
And at this point, one has to go back and find a free R7 and back
substitute R19 to get {R8,R7}; an find a useable R0 and back-
substitute R13 so we can get {R1,R0}.
<
I trust the compiler guys on this. Is it impossible--obviously, is it
a hassle--obviously, does it lead to slower compiler times--absolutely,
...

>
> Of course, it is not optimal, but ISTM that the advantages of not having
> to name additional registers in the instructions, and being able to use
> load/store multiple (or even double) are pretty compelling.
<

The problem with that thinking is LDM often takes a couple of cycles
to reach optimal data/unit_time throughput whereas LDs do not.

Stephen Fuld

unread,

May 30, 2023, 11:39:44 AM5/30/23

to

OK, thanks

> ...
>>
>> Of course, it is not optimal, but ISTM that the advantages of not having
>> to name additional registers in the instructions, and being able to use
>> load/store multiple (or even double) are pretty compelling.
> <
> The problem with that thinking is LDM often takes a couple of cycles
> to reach optimal data/unit_time throughput whereas LDs do not.

I am sure that is right, but not having optimal throughput (i.e. as good
as you can get) is not necessarily the same as being worse than two load
instructions. Specifically, are you saying that a LDM of two registers
would take longer elapsed time than two single load instructions?

Anton Ertl

unread,

May 30, 2023, 1:54:38 PM5/30/23

to

Stephen Fuld <sf...@alumni.cmu.edu.invalid> writes:
>You have mentioned before that compiler guys don't like register
>pairing. But I did a (very)little research and thought that the work of
>Preston Briggs (expressed in his thesis, several papers and a book) on
>the subject of register allocation showed that it wasn't much of a
>problem. Is that incorrect?

I dimly remember Preston Briggs' paper on paired-register allocation.
It shows that it can be done. However, I am not aware of this having
been integrated into register allocators of widely available compilers
(but maybe this happened without fanfare).

I am also not aware of recent instruction sets getting
pair-register-load and pair-register store instructions. Instead, ARM
A64 added load-pair and store-pair instructions where the instructions
in the pair can be determined independently, which is significantly
easier to utilize in a compiler.

The reason for not having paired registers in A64 may be that a paired
register may be just as hard to deal with in the register renamer as
two individual registers (Mitch Alsup claimed it was easy, but I did
not understand his explanation). If that is the case, the absence of
paired registers as an architectural feature does not say anything
about how hard paired registers are to deal with in a compiler; but if
paired registers are easier than two individual registers in hardware,
the use of individual registers in A64 may be due to compiler objections.

Back to compilers: Basic register allocation is not that hard, but
then you add in requirements coming from the calling convention, and
from this and that (e.g., architectural featured like paired
registers). You can do each requirement individually, but doing them
alltogether makes things very nasty, especially because you also want
the resulting code quality to be good. Already without paired
registers the register allocations I see are not as good as I would
like them to be*. So if the register allocator developers have any
developer capacity left, I'd prefer that they make the existing
allocation better rather than throwing new architectural misfeatures
at them.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

Stephen Fuld

unread,

May 30, 2023, 2:25:33 PM5/30/23

to

I don't disagree with any of your points. But then how do you propose
handling 128 bit arithmetic (both integer and floating), assuming these
are requirements? ISTM that if you don't use paired registers, then you
have to double the number of register specifiers in each ALU
instruction or do something like Mitch's instruction modifier mechanism
or go to 128 bit registers. None of these seem "optimal" either. :-(

MitchAlsup

unread,

May 30, 2023, 3:27:47 PM5/30/23

to

On Tuesday, May 30, 2023 at 12:54:38 PM UTC-5, Anton Ertl wrote:
> Stephen Fuld <sf...@alumni.cmu.edu.invalid> writes:
> >You have mentioned before that compiler guys don't like register
> >pairing. But I did a (very)little research and thought that the work of
> >Preston Briggs (expressed in his thesis, several papers and a book) on
> >the subject of register allocation showed that it wasn't much of a
> >problem. Is that incorrect?
> I dimly remember Preston Briggs' paper on paired-register allocation.
> It shows that it can be done. However, I am not aware of this having
> been integrated into register allocators of widely available compilers
> (but maybe this happened without fanfare).
>
> I am also not aware of recent instruction sets getting
> pair-register-load and pair-register store instructions. Instead, ARM
> A64 added load-pair and store-pair instructions where the instructions
> in the pair can be determined independently, which is significantly
> easier to utilize in a compiler.
>
> The reason for not having paired registers in A64 may be that a paired
> register may be just as hard to deal with in the register renamer as
> two individual registers (Mitch Alsup claimed it was easy, but I did
> not understand his explanation).
<

If you have a DBLE instruction, you can simply write::
<
DBLE R9,R7,R5
QMUL R8,R6,R4
<
and the renamer has all the information it needs all the ports it needs
and all the registers remain independent--no pairing or sharing.
<
The thing to face is that you have an x-wide machine {32-bit or 64-bit}
and you want the ability to easily express 2×x-wide calculations, but
you don't foresee a need for doing 2×x-width every cycle, then DBLE
works just fine. It takes an instruction (size) but may not take a beat
in the pipeline.
<
If on the other hand, you foresee a need for 2×-wide more-or-less all
the time, it is time to add a 2×register file (or wider) and all the inst
needed to utilize the new register file.
<
A 32 register machine can do simple algorithms at 2×wide, dabble
with 3×wide algorithms, and the occasional 4×wide, but quickly runs
out of registers.

<
> If that is the case, the absence of
> paired registers as an architectural feature does not say anything
> about how hard paired registers are to deal with in a compiler; but if
> paired registers are easier than two individual registers in hardware,
> the use of individual registers in A64 may be due to compiler objections.
>
> Back to compilers: Basic register allocation is not that hard, but
> then you add in requirements coming from the calling convention, and
> from this and that (e.g., architectural featured like paired
> registers). You can do each requirement individually, but doing them
> alltogether makes things very nasty, especially because you also want
> the resulting code quality to be good.
<

Register allocation is not "that hard", but every addition {paired registers,
shared registers, mandated registers,...} adds 1 to the exponent of
complexity. Quickly getting to the "its really hard" point. The more
registers on has the easier it all gets.

MitchAlsup

unread,

May 30, 2023, 3:32:00 PM5/30/23

to

It really depends on what your target(s) is(are).
a) do you want to perform 128-bits occasionally,
b) do you want to perform 128-bits rather often,
c) do you want to perform 128-bits all the time.
<
One solution does not cover all the cases. My 66000 solution fits
between (a) and (b):: it is fundamentally a 64-bit machine {registers
data paths, memory hierarchy,...} with the ability to express wider
widths when needed (but you better not need it all the time.)

Scott Lurndal

unread,

May 30, 2023, 4:09:24 PM5/30/23

to

MitchAlsup <Mitch...@aol.com> writes:
>On Tuesday, May 30, 2023 at 1:25:33=E2=80=AFPM UTC-5, Stephen Fuld wrote:

>> I don't disagree with any of your points. But then how do you propose=20
>> handling 128 bit arithmetic (both integer and floating), assuming these=
>=20
>> are requirements? ISTM that if you don't use paired registers, then you=
>=20
>> have to double the number of register specifiers in each ALU=20
>> instruction or do something like Mitch's instruction modifier mechanism=
>=20

>> or go to 128 bit registers. None of these seem "optimal" either. :-(
><

>It really depends on what your target(s) is(are).=20

>a) do you want to perform 128-bits occasionally,
>b) do you want to perform 128-bits rather often,
>c) do you want to perform 128-bits all the time.
><
>One solution does not cover all the cases. My 66000 solution fits
>between (a) and (b):: it is fundamentally a 64-bit machine {registers
>data paths, memory hierarchy,...} with the ability to express wider
>widths when needed (but you better not need it all the time.)

In my experience, the primary use of 128-bit operations
is single-copy atomic stores to coprocessors, which fall
into (a) and/or (b).

Stephen Fuld

unread,

May 30, 2023, 11:44:48 PM5/30/23

to

I think that is true, but with some caveats.

1. I don't know what the market today wants. Perhaps John Dallman or
Thomas Koenig, with their perspective can chime in.

2. Whatever the market is today, it will be different some years from
now. I don't have a prediction as to the speed of adoption, however,

3. There is a feedback effect. The easier/less costly it is to use, the
more it will be used. Your My 66000 is an example of that, where
essentially all calculations are done in 64 bit precision, not 32 bit
because the cost is so low. This wouldn't have happened say a decade
ago at least partially due to the increased cost differential of 64 bit
precision versus 32.

Anton Ertl

unread,

May 31, 2023, 2:03:40 AM5/31/23

to

Stephen Fuld <sf...@alumni.cmu.edu.invalid> writes:
>But then how do you propose
>handling 128 bit arithmetic (both integer and floating), assuming these
>are requirements? ISTM that if you don't use paired registers, then you
> have to double the number of register specifiers in each ALU
>instruction or do something like Mitch's instruction modifier mechanism
>or go to 128 bit registers.

128-bit registers look fine to me. AMD64 has them already (the xmm
registers), ARMv8-A has them already (and I think ARMv7-A, too;
ARMv9-A is an extension of ARMv8.5-A, and so also supports these
registers). So you just need to add the instructions that use them
for computing with 128-bit values.

BGB

unread,

May 31, 2023, 3:13:41 AM5/31/23

to

IMHO, going much past 128 bits likely goes into diminishing returns
territory.

For practical working sizes for data, 32/64/128 make sense.
8 and 16 bits make sense for storage or for SIMD elements.
256 or 512 are almost "too big to be useful" in most cases.

Progressively doubling the size of the registers doesn't seem to make
sense from a cost POV.

Say, for example, I would imagine that 32x 512-bit registers will cost a
fair bit more than 32x 128-bit registers.

But, what can one "actually do with them" to justify the additional
costs?...

OTOH, paired 64-bit registers, while not necessarily the most elegant
solution, does seem to make sense in terms of implementation cost (or,
at least, vs having 128-bit registers and having most of the
instructions only use 1/2 or 1/4 of the register...).

Well, also I sort of like the idea of having the Lane 1 and 2 units
combine for 128-bit ops. In the design, most of the units are actually
only 64-bit internally, and one needs redundant units for ILP reasons.

If the 128-bit ops can be done through the 64-bit units, this seems good
(well, except that 128 bit ADD/SUB/CMPxx requires a carry-bit dependency
between the units).

...

But, implicitly, such a scheme does not give any obvious way to expand
things further.

JimBrakefield

unread,

May 31, 2023, 8:43:44 AM5/31/23

to

On Monday, May 29, 2023 at 5:43:01 PM UTC-5, luke.l...@gmail.com wrote:

It is possible to extend the register type field to support AVX/SIMD data: four types, four sizes and
1 to 64 repetitions, all info (~10 bits) present in extended load/store instructions? Additional types
such as complex, quaternion, RGB-alpha, ... should be considered/provisioned?

Speaking from software experience, every increase in data dimensionality calls for an order of magnitude
more subroutines/op-codes. E.g. scalar versus vector (signal processing) versus matrix (image processing), ...

An FPGA coprocessor would be ideal as it could handle wide data, except at one eighth the speed?
(only add the instructions needed for a specific job instead of the hundreds/thousands
needed in an all encompassing ISA). Would expect the FPGA manufacturers are working on this?

That said, VVM is the more pleasant approach.
Of course we already have FPGAs with multiple ARM cores,
and ARM cores keep getting faster. x86-64 is late to the party?
Would be useful if there was a basic/open VVM implementation to build on?

Stephen Fuld

unread,

May 31, 2023, 11:22:24 AM5/31/23

to

On 5/30/2023 10:54 PM, Anton Ertl wrote:
> Stephen Fuld <sf...@alumni.cmu.edu.invalid> writes:
>> But then how do you propose
>> handling 128 bit arithmetic (both integer and floating), assuming these
>> are requirements? ISTM that if you don't use paired registers, then you
>> have to double the number of register specifiers in each ALU
>> instruction or do something like Mitch's instruction modifier mechanism
>> or go to 128 bit registers.
>
> 128-bit registers look fine to me. AMD64 has them already (the xmm
> registers), ARMv8-A has them already (and I think ARMv7-A, too;
> ARMv9-A is an extension of ARMv8.5-A, and so also supports these
> registers). So you just need to add the instructions that use them
> for computing with 128-bit values.

Fair enough. I agree with Mitch in that I don't like the separate SIMD
register approach, preferring VVM, but that would mean 128 bit GPRs,
which we are not ready for yet. (The key word here is yet. I expect
they will come.) If 128 bit arithmetic becomes popular, and it is only
implemented in the SIMD registers, you could end up with something like
the 68000 with separate data versus address registers. :-(

Bill Findlay

unread,

May 31, 2023, 12:38:08 PM5/31/23

to

On 31 May 2023, Anton Ertl wrote
(in article<2023May3...@mips.complang.tuwien.ac.at>):

> Stephen Fuld<sf...@alumni.cmu.edu.invalid> writes:
> > But then how do you propose
> > handling 128 bit arithmetic (both integer and floating), assuming these
> > are requirements? ISTM that if you don't use paired registers, then you
> > have to double the number of register specifiers in each ALU
> > instruction or do something like Mitch's instruction modifier mechanism
> > or go to 128 bit registers.
>
> 128-bit registers look fine to me. AMD64 has them already (the xmm
> registers), ARMv8-A has them already (and I think ARMv7-A, too;
> ARMv9-A is an extension of ARMv8.5-A, and so also supports these
> registers). So you just need to add the instructions that use them
> for computing with 128-bit values.
>
> - anton

The ICL 2900 had a 128-bit arithmetic register (top-of-stack) 50 years ago.

--
Bill Findlay

robf...@gmail.com

unread,

May 31, 2023, 12:46:19 PM5/31/23

to

128-bit registers have been spec’d out for the Thor ISA basically because of
128-bit DFP. Current implementation is limited to 64-bits. It is tempting to do
the FP arithmetic in software and stay with 64-bit registers.

Personally, I am liking the SIMD approach, even though ugly. 512-bit cache-line
wide buckets. With some tag bits somewhere to specify the format.

Researching CORDIC ATM and I am reminded of IIR filters. Can CORDIC be
implemented with a modified IIR? Pondering adding filtering functions to Thor.

Supporting interruptible “micro-code” in Thor. The low order bits of the PC are
the micro-code address. Normally these bits are zero. I am wondering how
many bits to dedicate to micro-code. I am thinking 12-bits would be plenty.
But fewer bits would be better. The micro-code is just a dedicated table of
Thor ISA instructions used to implement macro instructions. How much
micro-code is required to implement FP functions like tan, tanh, etc?

BGB

unread,

May 31, 2023, 1:12:39 PM5/31/23

to

I would personally much rather see a single unified register file, in
any case.

An ISA with 128-bit GPRs would be the more preferable option, and one
could likely use the large registers either for a large address space,
or maybe bounds-checked access or capabilities.

Say, Array Pointer:
( 63: 0): Address Field
( 83: 64): Array Size (Scaled)
(103: 84): Array Lower Bias
(108:104): Array Scale / Exponent
(111:109): Array RWX Mode
(123:112): Array Element Tag
(127:124): Top-Level Tag (Say, 0011 for Array)

Could still need fine adjustment, potentially 20 bits is overkill for
the size field when combined with an exponent.
Say, if we allow an 8B alignment, this allows an array that is 8MB which
is, most likely, going to be page-aligned.

If we assume a 4K alignment, this allows up to 4GB. OTOH, if the field
were smaller, this would mean more likely needing to pad arrays in many
cases.

Main drawback being that, if the program is primarily working with
32-bit 'int' values, then 128-bit registers would be kind of a waste.

It is usually only in edge cases that 128-bit integers really "make
sense", and even then, existing C compilers tend to not support them.

Like, if on many targets, if one does "__int128", the compiler is like
"Not supported on this target", this is a disincentive to use them, and
means that typically the program will (at most) end up only using half
the register.

Though, could have the amusing property if, say, one supports fixnum's,
and the fixnum ends up having a larger value range than C's 'long' and
'long long' types...

Other concerns:
Resource cost would likely be higher, as now supporting superscalar is a
bit more of an ask;
Maximum clock speed would likely be lower;
...

Could be OK for a 1-wide scalar machine though (or maybe a future 2-wide
superscalar).

But, yeah, could work...

>
>

robf...@gmail.com

unread,

May 31, 2023, 9:04:47 PM5/31/23

to

I think 64-bits for the address field may not be enough for a virtual address. I
seem to recall a demand for 100-bit addressing in an alternate reality, but that
is probably quite a few years away. I also think it is better to have registers
just as a plain bucket. As soon as structure is added the design is stuck with
it. A separate tags file associated with registers might be easier to manage.

I am not so sure about the slowness of a superscalar. I think it may have more
to do with how full the FPGA is, the size and style of the design. I have been
able to hit close to 50 MHz timing according to the tools. I think 50 MHz is on
par with many other FPGA designs. Even if it is a few MHz slower the ability of
OoO to hide latency may make it worth it.

*****
I cannot believe I ignored Cordic for so long, having discovered its beauty.
Answering my own engineering question, 10-bits max micro-code it is; that
should be overkill. It should not take very many, if any micro-code words for
basic trig. Previously seven bits were used for Thor2021, but only about a half
dozen non-math instructions were micro-coded.

Having done some head scratching over cordic and how to calculate the tan,
I am thinking of just providing a cordic instruction that takes all three
arguments and calculates away, and then leave it up to the programmer to
decide on operands and what is being calculated. I get how to calculate sin
and cosine via cordic and I think tan finally. Specifying that x, y must be
fractions between 0 and 1, and that the angle must be radians between 0
and 2 pi.

I gather that modern processor do not use Cordic, and use polynomial
expansions instead.

For Thor Cordic is calculated out to 61-bits as that is eight more than the
53-bit significand.

13k LUTs to implement parallel cordic. So, switched to sequential cordic:
1.5 k LUTs. Might be supportable.

MitchAlsup

unread,

May 31, 2023, 9:50:26 PM5/31/23

to

Beautiful and slow as crap.

<
> Answering my own engineering question, 10-bits max micro-code it is; that
> should be overkill. It should not take very many, if any micro-code words for
> basic trig. Previously seven bits were used for Thor2021, but only about a half
> dozen non-math instructions were micro-coded.
>
> Having done some head scratching over cordic and how to calculate the tan,
> I am thinking of just providing a cordic instruction that takes all three
> arguments and calculates away, and then leave it up to the programmer to
> decide on operands and what is being calculated. I get how to calculate sin
> and cosine via cordic and I think tan finally. Specifying that x, y must be
> fractions between 0 and 1, and that the angle must be radians between 0
> and 2 pi.
>
> I gather that modern processor do not use Cordic, and use polynomial
> expansions instead.
<

Typical SW can perform tan() in 150± cycles.
My HW TMAC can do tan() in 19 or 36 cycles. between the latency of FDIV
and 2×FDIV. {This includes Payne and Hanek argument reduction.}

>
> For Thor Cordic is calculated out to 61-bits as that is eight more than the
> 53-bit significand.
>
> 13k LUTs to implement parallel cordic. So, switched to sequential cordic:
> 1.5 k LUTs. Might be supportable.
<

How many cycles of latency ?
<
For My 66000 TMAC unit::
...............float...............double
sin()..........11...................19
cos().........11...................19
tan().......11-22..............19-33
ln2()............9...................15
exp2().........9...................15
atan2().....26....................36
pow()........21...................34
{including all argument reduction, polynomial evaluation, and reconstruction}

BGB

unread,

Jun 1, 2023, 1:13:31 AM6/1/23

to

It is a tradeoff.

As-is, I have 96-bit, but this is likely overkill.

As noted, the registers themselves would be unstructured, but operations
on those registers could assume a certain amount of structure.

Sort of like, how 4x Binary32 SIMD ops, kinda assume that the vector is
4x Binary32.

As noted, pointer formats in my case are basically:
(47: 0): Address
(59:48): Type-Tag / Etc
(63:60): Top-Level Tag

But, for the most part, Load/Store ops, etc ignore the high 16 bits.

But, say, the top 4 bits are:
0000: Basic (Object) pointer
0001: Small Value Spaces
0010: Bound Array
0011: Bound Array
01xx: Fixnum (62 bit)
10xx: Flonum (62 bit)
1100: TagArray + Base Offset
1101: Dense Vector
1110: Typed Pointer
1111: Extended 60-bit Linear Address (Optional, 1)

1: This format would effectively map a 64-bit pointer to a range of
multiple "quadrants" (or, possibly a table of locations within the
larger 96-bit address space).

There is a 128-bit format:
( 47: 0): Address (Low 48-bits)
( 59: 48): Type-Tag / Etc
( 63: 60): Top-Level Tag
(111: 64): Address (High 48-bits)
(127:112): Additional Tag Metadata

The formats for bounded-arrays are expanded somewhat.

The Fixnum an Flonum formats would expand to 124 bits in this case.

Current format for function pointers and link register is:
( 0): Inter-ISA Bit
(47: 1): Address
(63:48): Mode Flags

If Addr(0) is 0, the high 16 bits are ignored (or Trap if the Link
Register is expected). If 1, the high bits encode the operating mode and
similar.

High bits:
* (63:56), Saved SR(15: 8), U0..U7
* (55:52), Saved SR(23:20), WX3, WX4, WM1, WM2
* (51:50), Saved SR(27:26), WXE, WX2
* (49:48), Saved SR( 1: 0), S and T

The WXn and WMn bits encode the operating mode (BJX2 vs XG2 vs RV64 vs
XG2RV).

U0..U7 are user-defined or context-dependent flag bits (2).

S and T are the values of the S and T flag bits (in the current form of
the ISA, these are saved across function calls as this makes it possible
to predicate blocks with function calls).

2: In an x86 emulator, it is possible that these could be used as a
stand-in for ALU status flags or similar. These are also preserved
across function calls. They may also be used for more complex
predication (predicating based on logical relations between U-bits).

> I am not so sure about the slowness of a superscalar. I think it may have more
> to do with how full the FPGA is, the size and style of the design. I have been
> able to hit close to 50 MHz timing according to the tools. I think 50 MHz is on
> par with many other FPGA designs. Even if it is a few MHz slower the ability of
> OoO to hide latency may make it worth it.
>

Issue is mostly "net delay" (and to an extent, "fanout"):
The bulkier/wider/etc the logic gets, the slower it gets...

And, say, one can do a "simple 16-bit machine" and run it at 200 MHz.
But, not so much anything much bigger than said simple 16-bit machine.

Or a 32-bit machine running at 100MHz (say, MicroBlaze falls in here).

But, a 3-wide 64-bit machine is limited to around 50 MHz.
Had gotten 1-wide variants running at 100 MHz, but not reliably.
A 1-wide core running at 75 MHz is a little easier to pull off though.

Making the caches bigger also make them slower, etc.

But, going from 64-bit to 128-bit is likely to result in a notable
increase in area (particularly for superscalar or VLIW), which is likely
going to make the "net delay" issues a lot worse.

For 128-bit, one might end up, say, with a scalar machine that runs at
50 MHz, but if they want to go 2-wide, they need to drop it to 33 or 25 MHz.

On the other side, 32-bit machines, while one can get them running at
potentially 100 or (maybe) 150 MHz, are more limited in some areas (and
32-bit is started to look a little "stale" at this point).

While a 16-bit core can get a higher clock-speed, a 16-bit machine is
too limited to really accomplish a whole lot (and pretty much anything
one can do to make it "actually useful" would either come at the expense
of reducing clock speed, or taking too many clock cycles to give it much
of an advantage).

But, yeah, the limitation for MHz on FPGA seems to be more about how
long it takes signals to propagate around the FPGA, rather than about
the speed of the individual LUTs and similar.

So, if a given piece of FPGA logic is small, it can be internally run at
a higher clock-speed. Though, for external IO, one is limited some in
that one can't drive the pins much faster than around 100 MHz without
using SERDES.

Decided to leave out stuff about clock speeds and external wiring
(signal integrity over wiring gets finicky as MHz increases).

> *****
> I cannot believe I ignored Cordic for so long, having discovered its beauty.
> Answering my own engineering question, 10-bits max micro-code it is; that
> should be overkill. It should not take very many, if any micro-code words for
> basic trig. Previously seven bits were used for Thor2021, but only about a half
> dozen non-math instructions were micro-coded.
>
> Having done some head scratching over cordic and how to calculate the tan,
> I am thinking of just providing a cordic instruction that takes all three
> arguments and calculates away, and then leave it up to the programmer to
> decide on operands and what is being calculated. I get how to calculate sin
> and cosine via cordic and I think tan finally. Specifying that x, y must be
> fractions between 0 and 1, and that the angle must be radians between 0
> and 2 pi.
>
> I gather that modern processor do not use Cordic, and use polynomial
> expansions instead.
>

Yeah.

I mostly used Taylor Series expansions and similar...

In my own fiddling, didn't find much that is both faster than a Taylor
expansion and also gave similarly good accuracy.

"Lookup-Table and Interpolate" is faster, but falls short in terms of
accuracy.

A "reasonable balance" being to use a few table lookups followed by
cubic-spline interpolation.

This would "mostly be good enough" for use in games or similar, but the
C library doesn't really define any "fast but approximate" math
functions, leaving this sort of thing as non-standard extensions.

Say:
//do 'sin(x)', but faster and less accurate...
double sin_fast(double x);

One wouldn't do this with the normal "sin(x)" though as the assumption
is that these give an accurate value, rather than a fast value (and
programs that need "fast but inaccurate" sin/cos usually provide their
own lookup-table versions).

And, it is also not always clean where the optimal balance point should
be for "fast" variants (programs likely still using their own if the C
library version is slower than their own; or it being unusable if the
accuracy isn't good enough for for a given calculation).

Faster but less accurate:
Single lookup, don't bother with interpolation.

Slower but more accurate:
Taylor expansion but with reduced stages (vs the full version).

> For Thor Cordic is calculated out to 61-bits as that is eight more than the
> 53-bit significand.
>
> 13k LUTs to implement parallel cordic. So, switched to sequential cordic:
> 1.5 k LUTs. Might be supportable.
>

Possible. Not sure how easy it would be to glue onto my shift-add unit.
As noted, my FPU doesn't do trigonometric functions itself though, but
leaves all this up to software.

From what I can gather, it doesn't really look like Cordic is
(particularly) likely to beat out a Taylor expansion in terms of speed.

But, if it were up to me, the fundamental FPU operations would mostly be:
FADD, FSUB, FMUL
Maybe: FMAC (Z=Z+X*Y)

I can do FDIV with the shift-add unit, but:
It is slower than approximate versions;
Say, a crude approximation followed by 1 or 2 Newton-Raphson stages.
It is only slightly faster than the "full Newton-Raphson version".
But, does give correct results in the low 4 bits of the result.
Software N-R seemingly unable to fully converge the last 4 bits (1).

Though, it being slightly faster in the generic "double x,y,z; z=x/y;"
case, and slightly more accurate, is at least "sort of useful".

So, ironically, the "fast" version is "do it in software".

1: Once it gets sufficiently close to the target value, "Brownian
Motion" seems to take over instead (and switches back to heading towards
the exact value whenever it gets outside of ~ +/- 7 ULP or so).

Software Shift-Add could give an exact FDIV, but would be slower than
either of the above.

Anton Ertl

unread,

Jun 1, 2023, 2:25:00 AM6/1/23

to

Stephen Fuld <sf...@alumni.cmu.edu.invalid> writes:
>If 128 bit arithmetic becomes popular, and it is only
>implemented in the SIMD registers, you could end up with something like
>the 68000 with separate data versus address registers. :-(

Yes, or the CDC 6600 A/B and X registers. But I don't expect that
128-bit arithmetic will become popular enough that popular
architectures add 128-bit instructions for them. As for 128-bit
addresses, given the slowdown in semiconductor advances, we may never
get there.

Scott Lurndal

unread,

Jun 1, 2023, 9:56:55 AM6/1/23

to

"robf...@gmail.com" <robf...@gmail.com> writes:
>On Wednesday, May 31, 2023 at 1:12:39=E2=80=AFPM UTC-4, BGB wrote:
>> On 5/31/2023 10:22 AM, Stephen Fuld wrote:=20
>> > On 5/30/2023 10:54 PM, Anton Ertl wrote:=20

>>=20
>> An ISA with 128-bit GPRs would be the more preferable option, and one=20
>> could likely use the large registers either for a large address space,=20
>> or maybe bounds-checked access or capabilities.=20
>>=20
>>=20
>> Say, Array Pointer:=20
>> ( 63: 0): Address Field=20
>> ( 83: 64): Array Size (Scaled)=20
>> (103: 84): Array Lower Bias=20
>> (108:104): Array Scale / Exponent=20
>> (111:109): Array RWX Mode=20
>> (123:112): Array Element Tag=20
>> (127:124): Top-Level Tag (Say, 0011 for Array)=20

See https://www.cl.cam.ac.uk/research/security/ctsrd/cheri/

Stephen Fuld

unread,

Jun 1, 2023, 10:13:41 AM6/1/23

to

On 5/31/2023 11:18 PM, Anton Ertl wrote:
> Stephen Fuld <sf...@alumni.cmu.edu.invalid> writes:
>> If 128 bit arithmetic becomes popular, and it is only
>> implemented in the SIMD registers, you could end up with something like
>> the 68000 with separate data versus address registers. :-(
>
> Yes, or the CDC 6600 A/B and X registers. But I don't expect that
> 128-bit arithmetic will become popular enough that popular
> architectures add 128-bit instructions for them.

You may be right, or it may be a question of when, not if. Of course,
if when is a long enough time, it becomes practically indistinguishable
from when. :-)

> As for 128-bit
> addresses, given the slowdown in semiconductor advances, we may never
> get there.

Agreed, but again, it may come down to a question of when.

Stephen Fuld

unread,

Jun 1, 2023, 10:35:16 AM6/1/23

to

On 6/1/2023 7:13 AM, Stephen Fuld wrote:
> On 5/31/2023 11:18 PM, Anton Ertl wrote:
>> Stephen Fuld <sf...@alumni.cmu.edu.invalid> writes:
>>> If 128 bit arithmetic becomes popular, and it is only
>>> implemented in the SIMD registers, you could end up with something like
>>> the 68000 with separate data versus address registers. :-(
>>
>> Yes, or the CDC 6600 A/B and X registers. But I don't expect that
>> 128-bit arithmetic will become popular enough that popular
>> architectures add 128-bit instructions for them.
>
> You may be right, or it may be a question of when, not if. Of course,
> if when is a long enough time, it becomes practically indistinguishable
> from when. :-)

Sorry, that last "when" should have been "if". :-(

MitchAlsup

unread,

Jun 1, 2023, 12:28:37 PM6/1/23

to

On Thursday, June 1, 2023 at 12:13:31 AM UTC-5, BGB wrote:
>
> > I gather that modern processor do not use Cordic, and use polynomial
> > expansions instead.
> >
> Yeah.
>
> I mostly used Taylor Series expansions and similar...
<

Chebyshev not Taylor; and sometimes Remez correction of Chebyshev
coefficients.
>

Anton Ertl

unread,

Jun 1, 2023, 1:24:47 PM6/1/23

to

sc...@slp53.sl.home (Scott Lurndal) writes:
>ARM's scalable vector extension allows implementations to choose
>any power of two between 128 and 2048 bits. Current extant implementations
>only support 128 bits so far as I am aware.

The Fujitsu A64FX supports 512-bit SVE; it was one of the earliest, if
not the earliest SVE implementation.

Anton Ertl

unread,

Jun 1, 2023, 1:38:31 PM6/1/23

to

BGB <cr8...@gmail.com> writes:
>On 5/29/2023 3:36 PM, Scott Lurndal wrote:
>IMHO, going much past 128 bits likely goes into diminishing returns
>territory.

The Cray-1 has 4096-bit vector registers, and some of the other vector
machines succeeding it used even longer vectors.

>For practical working sizes for data, 32/64/128 make sense.
> 8 and 16 bits make sense for storage or for SIMD elements.
> 256 or 512 are almost "too big to be useful" in most cases.

If you do some data-parallel processing (where the SIMD stuff shines),
doubling the SIMD width can (in the best case) halve the time needed
to process the data.

>Progressively doubling the size of the registers doesn't seem to make
>sense from a cost POV.

You double only the data and don't need to double rename resources
etc. (IIRC Zen4 splits an AVX-512 instruction into two 256-bit parts
that have independently renamed 256-bit registers, though).

>Say, for example, I would imagine that 32x 512-bit registers will cost a
>fair bit more than 32x 128-bit registers.
>
>But, what can one "actually do with them" to justify the additional
>costs?...

Encoding, decoding, encrypting, wheather prediction, hydrodynamics,
image processing, AI etc.

MitchAlsup

unread,

Jun 1, 2023, 1:46:31 PM6/1/23

to

On Thursday, June 1, 2023 at 12:38:31 PM UTC-5, Anton Ertl wrote:
> BGB <cr8...@gmail.com> writes:
> >On 5/29/2023 3:36 PM, Scott Lurndal wrote:
> >IMHO, going much past 128 bits likely goes into diminishing returns
> >territory.
> The Cray-1 has 4096-bit vector registers, and some of the other vector
> machines succeeding it used even longer vectors.
> >For practical working sizes for data, 32/64/128 make sense.
> > 8 and 16 bits make sense for storage or for SIMD elements.
> > 256 or 512 are almost "too big to be useful" in most cases.
> If you do some data-parallel processing (where the SIMD stuff shines),
> doubling the SIMD width can (in the best case) halve the time needed
> to process the data.
> >Progressively doubling the size of the registers doesn't seem to make
> >sense from a cost POV.
> You double only the data and don't need to double rename resources
> etc. (IIRC Zen4 splits an AVX-512 instruction into two 256-bit parts
> that have independently renamed 256-bit registers, though).
> >Say, for example, I would imagine that 32x 512-bit registers will cost a
> >fair bit more than 32x 128-bit registers.
> >
> >But, what can one "actually do with them" to justify the additional
> >costs?...
> Encoding,

integer
> decoding,
integer
> encrypting,
integer
> wheather prediction,
large FP
> hydrodynamics,
large FP
> image processing,
mostly integer
now graphics is a mix of short FP and integer
> AI etc.
really short FP--but still searching for exactly what really short FP.

BGB

unread,

Jun 1, 2023, 9:46:43 PM6/1/23

to

Yeah.

This was not meant to reflect the CHERI layout, but would serve a
functionally similar purpose.

I already have something similar in my ISA, though its scope is more
limited (mostly just used for bounds checks).

CHERI is pros/cons, but some parts seem like good ideas (and ISA
enforced bounds checking is a pretty nifty debugging feature, in any case).

In my case, the bounding scheme operates in a slightly different way.

Looking stuff up:
CHERI "Low Fat" 64-bit
(45: 0): Address
(51:46): Base
(57:52): Top
(63:58): Exponent

CHERI-128:
( 63: 0): Address
( 66: 64): BaseExp
( 83: 67): Base
( 86: 84): TopExp
(104: 87): Top
( 105): Sealed
( 106): Top-Increment
(111:107): -
(127:112): Permissions

And, my case:
BJX2, Bounds-Checked, 64-bit
(47: 0): Address
(50:48): Array Size
(55:51): Exponent (Log2)
(59:56): Bias
(63:60): Tag (0011)

BJX2, Bounds-Checked, 128-bit
( 47: 0): Address (Low, Addr(47:0) )
( 59: 48): Size
( 63: 60): Tag (0011)
(111: 64): Address (High, Addr(95:48) )
(123:112): Bias
(127:124): Exponent (Log4, 0=16B)

The CHERI schemes would be more accurate in these cases.

They seem to work by using the Exponent, Base, and Top, to unpack into a
pair of upper and lower bounds, eg:
LowBound = { Addr[max:exp+width], Base, zeroes[exp-1:0] };
TopBound = { Addr[max:exp+width], Top, zeroes[exp-1:0] };

And then evaluating:
LowBound <= Addr < TopBound

The BJX2 scheme had instead encoded a relative upper bound, and a Bias
which encodes the aligned distance between current address and the Lower
Bound.

In the 64 bit format, the array size is encoded with a hidden bit (like
in a floating point formats). The current 128-bit formats lack a hidden
bit (so are non-normalized).

Also, very possibly the 96-bit address space is needlessly overkill.

The existing 128-bit format was the older format in this case (partly
related to the Log4 wonk). The use of Log4 seems like a misstep in
retrospect.

So, an operation like LEAT would perform a LEA, and then use this to
figure out how far to adjust the Bias (my scheme needs to account for
carry out of the low-order bits, which would be N/A in the CHERI scheme).

In my case, the bounds-checking operation is performed relative to the
exponent rather than using absolute address comparisons. Mostly because
this is cheaper and "less bad" for timing.

So, in this case, after an address calculation, so long as:
0 <= NewBias < Size

If NewBias is out of range, either trigger a Bounds Fault, or change to
a Trap representation, depending on the operation.

There are pros/cons between the two bounds encoding schemes.
Not entirely sure which is better.
Though, the CHERI scheme does seem to make more intuitive sense.

There is a functional difference between 64 and 128-bit pointers in my case:
64-bit pointers are unchecked by default;
128-bit pointers are assumed bounds-checked if the tag says so.

In the former case, it is needed to invoke the bounds check manually,
whereas the latter will just assume the use of bounds checking.

Had considered a possible variant that uses a 128-bit pointer with
48-bit address (located in the same part of the address space as the
64-bit pointers)

Could then afford to put access-rights metadata in the pointer.

Say, for example:
( 47: 0): Address (Low)
( 59: 48): Object/Array Type
( 63: 60): Tag
( 83: 64): Bias
(103: 84): Size
(108:104): Exponent
(111:109): RWX Mask
(127:112): ACLID (Access Control List) (?)

...

In case anyone is wondering, No, the ACL's are not intended as any sort
of "server" feature; rather they are a part of my memory-protection
scheme. Though, whether putting them in pointers "makes sense" (and does
not open a potential security hole) is debatable.

I guess the other factor is whether to try to find some way of taking
registers (and memory) as pointer or non-pointer.

Doing this for registers wouldn't be too hard (there would just now need
to be an extra control register holding the "pointer bit" for all the
other registers).

Memory tagging is the harder problem here...

...

MitchAlsup

unread,

Jun 1, 2023, 10:03:56 PM6/1/23

to

On Thursday, June 1, 2023 at 8:56:55 AM UTC-5, Scott Lurndal wrote:
>
> See https://www.cl.cam.ac.uk/research/security/ctsrd/cheri/
<
If there is one thing I am smart enough to understand::
<
It is that I am not smart enough to grasp capabilities to the extent
needed to architect a capability machine.
<
............

BGB

unread,

Jun 1, 2023, 11:07:46 PM6/1/23

to

Is any of us?...

This is why BJX2 is not a capability machine, even if it does "borrow" a
few design ideas...

Admittedly, I have put a more of my "security eggs" in the "page tables
and page-level memory protection" basket.

Well, and throw in some ASLR and similar for good measure...

luke.l...@gmail.com

unread,

Jun 2, 2023, 4:45:50 AM6/2/23

to

my professor at imperial back in (eek!) 1989 explained it
to me, i wish i could remember his name.

* Access Control gives you information about what exists
*and* then also says whether you can (or cannot) use it
* Cambridge Capability doesn't even let you know if the
information exists, let alone allows you to use it.

therefore we may say that Virtual Memory Page tables
are in fact a "Capability" System, but the Page-Table
read/write/execute bits are *NOT* Capabiltiy-compliant
they are Access Control (and frequently described in
actual ISA Ref Manuals as such).

that was back then - i have no idea what they done now.

l.

Michael S

unread,

Jun 2, 2023, 6:03:06 AM6/2/23

to

How do you know?
From past experience I would guess that you didn't actually measure it, do you?

Michael S

unread,

Jun 2, 2023, 6:56:10 AM6/2/23

to

The earliest by far. Fugaku prototype topped Green500 list back in Nov 2019.
I think, the next implementation didn't appear until Dec 2021.

As to extant implementation with width that differs from both 512 and 128,
there is Arm Neoverse-V1 that features 256-bit SVE.
Considering that V1 is at heart of AWS Graviton3, I'd say that it not just extant,
but orders of magnitude more extant than any Arm SOC ever made by Cavium.
Probably at least order of magnitude more extant than SoCs made by ex-Cavium
division of Marvel.
On the other hand, in comparison to Cortex X2 and X3 (128-bit SVE) Neoverse-V1
is indeed not numerous.

Scott Lurndal

unread,

Jun 2, 2023, 10:57:56 AM6/2/23

to

"luke.l...@gmail.com" <luke.l...@gmail.com> writes:
>On Friday, June 2, 2023 at 3:03:56=E2=80=AFAM UTC+1, MitchAlsup wrote:
>> On Thursday, June 1, 2023 at 8:56:55=E2=80=AFAM UTC-5, Scott Lurndal wrot=
>e:=20
>> >=20
>> > See https://www.cl.cam.ac.uk/research/security/ctsrd/cheri/=20
>> <=20
>> If there is one thing I am smart enough to understand::=20
>> It is that I am not smart enough to grasp capabilities to the extent=20
>> needed to architect a capability machine.=20

>
>my professor at imperial back in (eek!) 1989 explained it
>to me, i wish i could remember his name.
>
>* Access Control gives you information about what exists
> *and* then also says whether you can (or cannot) use it
>* Cambridge Capability doesn't even let you know if the
> information exists, let alone allows you to use it.

The key point about capabilities is that they can only
be created or modified under very constrained conditions;
e.g. new capabilities can only be constructed as a subset
of an existing capability. This requires that capabilities
be 'tagged' in some way to prevent software from corrupting
or updating the capability.

BGB

unread,

Jun 2, 2023, 12:13:33 PM6/2/23

to

Yes.

This is also where the problems begin...
And also their major weakness if used as a security feature (in place of
more traditional access control).

This is why I had initially thought of only trying to copy the
bounds-checking aspects, and then handle the security aspects via a
different mechanism (access control lists) and putting things at random
addresses within a 96-bit address space.

Though, the latter point turned out to be a problem, as with deep page
tables, randomizing the address can make for a "very bad" level of
overhead (but mostly a non-issue for shallow tables). Only workarounds I
have are, arguably, not ideal.

Will at least claim they work in my use-cases though (which are not in
areas that are traditionally all that concerned with security). And, my
stuff is admittedly "not very secure", but does have an "implicit air
gap" for sake of not having a network stack... (network support being
"TODO, eventually").

BGB

unread,

Jun 2, 2023, 12:29:05 PM6/2/23

to

On 6/1/2023 12:46 PM, MitchAlsup wrote:
> On Thursday, June 1, 2023 at 12:38:31 PM UTC-5, Anton Ertl wrote:
>> BGB <cr8...@gmail.com> writes:
>>> On 5/29/2023 3:36 PM, Scott Lurndal wrote:
>>> IMHO, going much past 128 bits likely goes into diminishing returns
>>> territory.
>> The Cray-1 has 4096-bit vector registers, and some of the other vector
>> machines succeeding it used even longer vectors.
>>> For practical working sizes for data, 32/64/128 make sense.
>>> 8 and 16 bits make sense for storage or for SIMD elements.
>>> 256 or 512 are almost "too big to be useful" in most cases.
>> If you do some data-parallel processing (where the SIMD stuff shines),
>> doubling the SIMD width can (in the best case) halve the time needed
>> to process the data.

If your data is big enough to make effective use of the larger SIMD
registers, and there is a good way to express them in the language...

This is where "SIMD intrinsics" in compilers tend to hit a road bump, as
there is typically no real way to deal with variability in register size
of feature-set short of rewriting the code for each combination.

I guess large vectors could be more useful if the language supported
"abstract large vector" and "tensor" types and operators. Where, say, if
one specifies a tensor type and operator, they don't need to care how
wide the underlying SIMD unit is.

But, then again, since it is pretty hit-or-miss whether '__int128' will
work on a given target, one can't hold out much hope...

>>> Progressively doubling the size of the registers doesn't seem to make
>>> sense from a cost POV.
>> You double only the data and don't need to double rename resources
>> etc. (IIRC Zen4 splits an AVX-512 instruction into two 256-bit parts
>> that have independently renamed 256-bit registers, though).

Some stuff still moves quickly it seems...

It was only a few years ago that I got a Zen+, which seemingly still
uses 128-bit internally (but sorta fake it for 256-bit AVX ops).

>>> Say, for example, I would imagine that 32x 512-bit registers will cost a
>>> fair bit more than 32x 128-bit registers.
>>>
>>> But, what can one "actually do with them" to justify the additional
>>> costs?...
>> Encoding,
> integer
>> decoding,
> integer
>> encrypting,
> integer
>> wheather prediction,
> large FP
>> hydrodynamics,
> large FP
>> image processing,
> mostly integer
> now graphics is a mix of short FP and integer
>> AI etc.
> really short FP--but still searching for exactly what really short FP.

Yeah.

For image processing:
RGB555 and RGBA8888 work well for storage, but aren't ideal for processing.

For image processing, in many cases, one often needs either packed Int16
or Binary16 for the intermediate values.

Similarly, for the NNs I had used (mostly for image processing tasks and
similar thus far), Binary16 works OK for processing.

Some people use BF16 (S.E8.F7), but I mostly prefer Binary16 instead. I
suspect most of the appeal of BF16 is that it is easier to pull off
efficiently on machines which have Binary32 support but not Binary16.

One can try using 8-bit formats, such as S.E4.F3 or S.E3.F4, as a
processing format, but these are a little limited even for NNs (but,
they can still be useful as storage formats).

Say, once the value goes through an activation function, the precision
requirements drop significantly (just they matter a little more for the
"multiply and accumulate" stage).

A cheap activation function can also help (such as a crude approximation
of a square-root or signed square root of similar).

Well, along with:
y=(y<0)?0:y;
y=(y<0.5)?0.0:1.0;
...

Using intermediate forms, like:
S.E5.F6 or S.E5.F4
Can still be effective though.

Scott Lurndal

unread,

Jun 2, 2023, 1:14:48 PM6/2/23

to

Please enumerate the problems as you see them.

>And also their major weakness if used as a security feature (in place of
>more traditional access control).

Again, why do you believe this to be a major weakness, and what
"traditional access control" mechanism do you believe is superior?

Note that the Unisys clearpath libra systems are still completely
capability based and security is considered superior to the more
modern non-capability based systems (some extremely small part of which
is due to obscurity, granted). The stack is hardware managed, so no
buffer overflow can target the return instruction address; buffer overflows
are impossible in any case, ROP attacks are impossible, etc.

The only viable attack on those systems was discovered fifty years
ago (and fixed shortly thereafter) and involved copying a program to
tape, loading it on an IBM mainframe, patching it to mark it a compiler[*],
then reloading it on the Burroughs machines. Was even written up in
CACM, IIRC.

[*] Compilers were the only way to generate executable codefiles, and
security restrictions were enforced by the compilers.

MitchAlsup

unread,

Jun 2, 2023, 1:16:27 PM6/2/23

to

https://hal.inria.fr/inria-00070636/document
and
http://www.cl.cam.ac.uk/~jrh13/slides/gelato-25may05/slides.pdf
<
the later using 2 FMACs simultaneously.
But there is scant little data on it.

BGB

unread,

Jun 2, 2023, 2:21:51 PM6/2/23

to

The need for tagged memory... Well, and/or opaque handles with a
separate address space for the capabilities; take a pick which is worse.

The need for a bunch of new instructions specific to pointer management
(as opposed to using generic) integer instructions for everything.

No deal-breakers per se.

Though, admittedly, the existing bounds-checking scheme does create an
incentive to add more specialized pointer ops, so may still end up with
these in any case.

>> And also their major weakness if used as a security feature (in place of
>> more traditional access control).
>
> Again, why do you believe this to be a major weakness, and what
> "traditional access control" mechanism do you believe is superior?
>

If you can find a way to forge capabilities, or steal capabilities from
elsewhere, this shoots a big hole in the security model. Maintaining
security without also adding significant hinderances to software
development seems like a bit of a balancing act.

Not so much a fundamental weakness of capability model itself, but
rather an uncertainty about the activities of "mere mortal" programmers
(the incentive being far more often to leave a big gaping hole in the
name of convenience or experience than to "do the right thing" in a
security sense).

But, as noted, my thinking was more that one assigns access control
lists to each memory allocation, and "keys" to each thread or
(temporarily) on a per-module basis.

Say, a certain DLL is allowed to access a certain ACL, but no one else,
with the OS moderating control-flow into and out-of this location (but,
then one needs mutual non-executability/etc in terms of the code
regions, say to prevent someone from trying to hack-in callbacks to
sidestep the flow-control; and all this adds about a potential attack
surface if things aren't set up correctly; back to the whole
"expedience" thing). So, per-thread ACL keys is more likely the more
practical model.

As-is, I had gone per-thread, with (effectively) some APIs working as C
wrappers for COM style objects, with the COM-style objects effectively
initiating a task-switch and RPC into another thread (which operates as
an event-pump for that API).

Say, for example, TKGDI might have Read-Only or Read-Write access to
parts of the program's address space (RW to Heap, RO to most other
areas, No Execute). Program has no direct access to TKGDI's data.

Plan was to move TKRA-GL over to a similar model (eventually, but OpenGL
is a slightly more complex API, and work needs to be split with the
"client side" to avoid it becoming dead-slow).

My model is still not exactly perfect though.

Addr96 also complicates stuff, so I was left considering a scheme to
"compact" Addr96 access down to a 64-bit pointer format (by adding a
level of indirection, treating the pointer as essentially a
Handle+Offset pair, with the "actual" 128-bit base pointer held in a
separate kernel-managed table).

This can allow a "slightly more workable" model for temporarily sharing
objects between tasks (since they could then look more like local pointers).

This would almost work in a vaguely similar way to how "far pointers"
were seemingly intended to be used in x86 protected mode, just with an
internal "shared pointer table" taking over the role of the GDT and LDT,
likely existing in semi-disjoint address space from the main 48-bit
virtual address space.

> Note that the Unisys clearpath libra systems are still completely
> capability based and security is considered superior to the more
> modern non-capability based systems (some extremely small part of which
> is due to obscurity, granted). The stack is hardware managed, so no
> buffer overflow can target the return instruction address; buffer overflows
> are impossible in any case, ROP attacks are impossible, etc.
>

Yeah, eliminating buffer overflows was a motivation for borrowing "some"
ideas from CHERI and similar, even if my scheme isn't exactly 1:1.

It is also useful for debugging, as now going out of bounds in an array
can be an immediate CPU fault.

> The only viable attack on those systems was discovered fifty years
> ago (and fixed shortly thereafter) and involved copying a program to
> tape, loading it on an IBM mainframe, patching it to mark it a compiler[*],
> then reloading it on the Burroughs machines. Was even written up in
> CACM, IIRC.
>
> [*] Compilers were the only way to generate executable codefiles, and
> security restrictions were enforced by the compilers.

OK.

I was hoping for a model that still holds in the face of untrusted
binary code.

But, this is a bit of an ask...
What I have now still falls well short.

Terje Mathisen

unread,

Jun 2, 2023, 2:45:39 PM6/2/23

to

Anton Ertl wrote:
> Stephen Fuld <sf...@alumni.cmu.edu.invalid> writes:
>> But then how do you propose
>> handling 128 bit arithmetic (both integer and floating), assuming these
>> are requirements? ISTM that if you don't use paired registers, then you
>> have to double the number of register specifiers in each ALU
>> instruction or do something like Mitch's instruction modifier mechanism
>> or go to 128 bit registers.
>
> 128-bit registers look fine to me. AMD64 has them already (the xmm
> registers), ARMv8-A has them already (and I think ARMv7-A, too;
> ARMv9-A is an extension of ARMv8.5-A, and so also supports these
> registers). So you just need to add the instructions that use them
> for computing with 128-bit values.

This is pretty obviously going to be the main path for adding fp128 to
x64, particularly since you can emulate it now by simply storing the
fp128 bit pattern in an SSE register (or a pair of them in AVX etc), use
the integer load/store operations and trap to an emulator for
FADD128/FSUB128/FMUL128/FMAC128/FDIV128 etc.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Terje Mathisen

unread,

Jun 2, 2023, 2:49:25 PM6/2/23

to

Stephen Fuld wrote:

> On 5/30/2023 10:54 PM, Anton Ertl wrote:
>> Stephen Fuld <sf...@alumni.cmu.edu.invalid> writes:
>>> But then how do you propose
>>> handling 128 bit arithmetic (both integer and floating), assuming these
>>> are requirements? ISTM that if you don't use paired registers, then you
>>> have to double the number of register specifiers in each ALU
>>> instruction or do something like Mitch's instruction modifier mechanism
>>> or go to 128 bit registers.
>>
>> 128-bit registers look fine to me. AMD64 has them already (the xmm
>> registers), ARMv8-A has them already (and I think ARMv7-A, too;
>> ARMv9-A is an extension of ARMv8.5-A, and so also supports these
>> registers). So you just need to add the instructions that use them
>> for computing with 128-bit values.
>

> Fair enough. I agree with Mitch in that I don't like the separate SIMD
> register approach, preferring VVM, but that would mean 128 bit GPRs,
> which we are not ready for yet. (The key word here is yet. I expect
> they will come.) If 128 bit arithmetic becomes popular, and it is only
> implemented in the SIMD registers, you could end up with something like
> the 68000 with separate data versus address registers. :-(

I would guess that for a large percentage of the "real work" any given
PC is doing, it is using just such a model, with all the core library
routines (graphics/video/crypto/etc) implemented in SIMD, where the
integer regs are mostly used for addressing and loop book keeping.

Scott Lurndal

unread,

Jun 2, 2023, 3:09:52 PM6/2/23

to

BGB <cr8...@gmail.com> writes:
>On 6/2/2023 12:14 PM, Scott Lurndal wrote:
>> BGB <cr8...@gmail.com> writes:
>>> On 6/2/2023 9:57 AM, Scott Lurndal wrote:

>>
>> Again, why do you believe this to be a major weakness, and what
>> "traditional access control" mechanism do you believe is superior?
>>
>
>If you can find a way to forge capabilities, or steal capabilities from
>elsewhere, this shoots a big hole in the security model.

So how is that any different from any other security model?

The idea is to make the security footprint as small as possible,
and capabilities are the ideal there, and the corresponding hardware
is easier to make secure.

> Maintaining
>security without also adding significant hinderances to software
>development seems like a bit of a balancing act.

The compilers handle everything. There is no 'significant hindrance
to software development'.

>
>Not so much a fundamental weakness of capability model itself, but
>rather an uncertainty about the activities of "mere mortal" programmers
>(the incentive being far more often to leave a big gaping hole in the
>name of convenience or experience than to "do the right thing" in a
>security sense).

Programmers don't even _know_ (or care) that capabilities are in use. Except
for the very very small percentage that write bare-metal software or
compilers. And that's a very very small percentage of programmers.

>
>Say, a certain DLL is allowed to access a certain ACL, but no one else,
>with the OS moderating control-flow into and out-of this location (but,

DLL? You're still stuck in windows land. In any case, your system will
be far more complex than a capability system, and correspondingly easier
to abuse or bypass. How, exactly, does an operating system 'moderate'
control flow between the application and the DLL?

Stephen Fuld

unread,

Jun 2, 2023, 4:26:13 PM6/2/23

to

I don't think that is true, at least for C programmers. Some of the
things that are perfectly fine as far as the language is concerned are a
no no for capabilities architectures. That is why C on the
Burroughs/Unisys A series has to be sand boxed.

MitchAlsup

unread,

Jun 2, 2023, 4:31:35 PM6/2/23

to

On Friday, June 2, 2023 at 2:09:52 PM UTC-5, Scott Lurndal wrote:
> BGB <cr8...@gmail.com> writes:
> >On 6/2/2023 12:14 PM, Scott Lurndal wrote:
> >> BGB <cr8...@gmail.com> writes:
> >>> On 6/2/2023 9:57 AM, Scott Lurndal wrote:
>
> >>
> >> Again, why do you believe this to be a major weakness, and what
> >> "traditional access control" mechanism do you believe is superior?
> >>
> >
> >If you can find a way to forge capabilities, or steal capabilities from
> >elsewhere, this shoots a big hole in the security model.
<
> So how is that any different from any other security model?
>
> The idea is to make the security footprint as small as possible,
> and capabilities are the ideal there, and the corresponding hardware
> is easier to make secure.
<

But harder to make fficient.

<
> > Maintaining
> >security without also adding significant hinderances to software
> >development seems like a bit of a balancing act.
<
> The compilers handle everything. There is no 'significant hindrance
> to software development'.
<

char *p[] = &array[i][j];

<
> >
> >Not so much a fundamental weakness of capability model itself, but
> >rather an uncertainty about the activities of "mere mortal" programmers
> >(the incentive being far more often to leave a big gaping hole in the
> >name of convenience or experience than to "do the right thing" in a
> >security sense).
<
> Programmers don't even _know_ (or care) that capabilities are in use. Except
> for the very very small percentage that write bare-metal software or
> compilers. And that's a very very small percentage of programmers.
> >
> >Say, a certain DLL is allowed to access a certain ACL, but no one else,
> >with the OS moderating control-flow into and out-of this location (but,
<
> DLL? You're still stuck in windows land. In any case, your system will
> be far more complex than a capability system, and correspondingly easier
> to abuse or bypass. How, exactly, does an operating system 'moderate'
> control flow between the application and the DLL?
<

Trampolines ?!?

Scott Lurndal

unread,

Jun 2, 2023, 5:17:51 PM6/2/23

to

Indeed, C is not designed for a stack archicture that was designed a
decade before C. IIRC, Unisys just give a large data region (described
by a capability) to the C runtime and let the C runtime deal with
translating between native MCP calls and C runtime functionality.

You say 'sandboxed', and that's a good term - within the sandbox
anything goes, but it can't escape the underlying capability
describing the C program address space.

That isn't a problem for CHERI so long as the compilers are
aware of the architectural requirements.

Scott Lurndal

unread,

Jun 2, 2023, 5:20:52 PM6/2/23

to

MitchAlsup <Mitch...@aol.com> writes:
>On Friday, June 2, 2023 at 2:09:52=E2=80=AFPM UTC-5, Scott Lurndal wrote:
>> BGB <cr8...@gmail.com> writes:=20
>> >On 6/2/2023 12:14 PM, Scott Lurndal wrote:=20
>> >> BGB <cr8...@gmail.com> writes:=20
>> >>> On 6/2/2023 9:57 AM, Scott Lurndal wrote:=20
>>=20
>> >>=20
>> >> Again, why do you believe this to be a major weakness, and what=20
>> >> "traditional access control" mechanism do you believe is superior?=20
>> >>=20
>> >=20
>> >If you can find a way to forge capabilities, or steal capabilities from=
>=20

>> >elsewhere, this shoots a big hole in the security model.
><

>> So how is that any different from any other security model?=20
>>=20
>> The idea is to make the security footprint as small as possible,=20
>> and capabilities are the ideal there, and the corresponding hardware=20

>> is easier to make secure.
><
>But harder to make fficient.

Harder, but not impossible.

><
>> > Maintaining=20
>> >security without also adding significant hinderances to software=20

>> >development seems like a bit of a balancing act.
><

>> The compilers handle everything. There is no 'significant hindrance=20
>> to software development'.
><
> char *p[] =3D &array[i][j];

The compiler can, in CHERI, create a sub-capability for the target region,
which has a fixed size in this case if I recall correctly. Implementations
by a processor vendor would likely differ from pure CHERI.

>> DLL? You're still stuck in windows land. In any case, your system will=20
>> be far more complex than a capability system, and correspondingly easier=
>=20
>> to abuse or bypass. How, exactly, does an operating system 'moderate'=20

>> control flow between the application and the DLL?
><
>Trampolines ?!?

Complicated, performance killing potential security hole.

MitchAlsup

unread,

Jun 2, 2023, 8:18:24 PM6/2/23

to

GOT and PLT kill performance ??
<
how else do you perform dynamic linking ??
<
{Hint:: My 66000 only needs GOT and not PLT}.

BGB

unread,

Jun 2, 2023, 10:43:05 PM6/2/23

to

On 6/2/2023 2:09 PM, Scott Lurndal wrote:
> BGB <cr8...@gmail.com> writes:
>> On 6/2/2023 12:14 PM, Scott Lurndal wrote:
>>> BGB <cr8...@gmail.com> writes:
>>>> On 6/2/2023 9:57 AM, Scott Lurndal wrote:
>
>>>
>>> Again, why do you believe this to be a major weakness, and what
>>> "traditional access control" mechanism do you believe is superior?
>>>
>>
>> If you can find a way to forge capabilities, or steal capabilities from
>> elsewhere, this shoots a big hole in the security model.
>
> So how is that any different from any other security model?
>
> The idea is to make the security footprint as small as possible,
> and capabilities are the ideal there, and the corresponding hardware
> is easier to make secure.
>

But, not entirely transparent to "ye olde C".

At least if one accepts all the various ways people like to abuse the
language as at least "semi valid".

>
>> Maintaining
>> security without also adding significant hinderances to software
>> development seems like a bit of a balancing act.
>
> The compilers handle everything. There is no 'significant hindrance
> to software development'.
>

If your language looks like Java or C# or similar.

Or, one is happy to use a subset of C that doesn't allow "abusing"
pointers and type-casts in various ways, or makes various other "non
portable" assumptions.

Admittedly, a lot of this also runs afoul of my bounds-checking scheme.

And also a lot of stuff that is "undefined behavior" as well.

But, if someone tries to compile code and the compiler is like "error:
that aint gonna fly here", they are liable to blame the architecture
rather than fix the code.

OTOH, if it is opt-in, people are more likely to know what they are
getting into (and not enable the option unless they intend to go and fix
their code to clean up any exceptions that may result).

Granted, on any current implementation, it is likely that "breaking"
capabilities will still leave one with the protections offered by the
standard memory protection schemes.

So, one has bounds checking, which is the main practical advantage, and
the "non forgeable" aspect, which is needed for the formal model (eg:
"actually being capabilities"), but likely of secondary importance in
practice (and the one that adds some of the "harder" issues).

>>
>> Not so much a fundamental weakness of capability model itself, but
>> rather an uncertainty about the activities of "mere mortal" programmers
>> (the incentive being far more often to leave a big gaping hole in the
>> name of convenience or experience than to "do the right thing" in a
>> security sense).
>
> Programmers don't even _know_ (or care) that capabilities are in use. Except
> for the very very small percentage that write bare-metal software or
> compilers. And that's a very very small percentage of programmers.
>

Possibly.

Yes, if one writes reasonably type-safe and standards-compliant code, it
is less of an issue.

How easily one can implement various parts of the C library, without
running afoul of things, is also a factor (and the main sorts of areas
where "shortcuts" are a risk).

>
>>
>> Say, a certain DLL is allowed to access a certain ACL, but no one else,
>> with the OS moderating control-flow into and out-of this location (but,
>
> DLL? You're still stuck in windows land. In any case, your system will
> be far more complex than a capability system, and correspondingly easier
> to abuse or bypass. How, exactly, does an operating system 'moderate'
> control flow between the application and the DLL?
>

It is a hybrid model.
But, I am still using a PE/COFF variant, and EXE/DLL file naming
conventions, ...

API conventions are mixed, but uses a Unix style filesystem layout.
So, sort of like a poor knockoff of the Cygwin experience...

As for moderating DLL calls...

Imagine, for a moment, that A does not call B directly, but instead a
thunk is used that effectively performs a system call, which then
performs an implicit "partial task switch", and then invokes a call to
the exported function via a sort of event-dispatcher-loop (whose job is
mostly to handle calls into a given DLL).

In this case, the PE/COFF loader would set up the thunks and dispatcher
tasks as needed.

Kinda sucks, and is slow, but don't yet have a better approach (that
also avoids adding obvious security holes).

So, don't currently plan to go down this particular path for normal DLLs.

Though, at a minimum, would need a system call to reload KRR, and a
mechanism to perform a system call to restore KRR on the return path
(and a way to ensure that no other control flow paths can exist between
the caller and callee, whereas this is easier to enforce if implemented
via an implicit task switch).

Though, generally the caller and callee exist within the same address
space in this model, but this may change.

Have mostly ended up using going with COM-style objects instead though
for this (where the COM-Object method dispatch is also handled via a
similar mechanism, just a specialized system call that is like "Hey,
invoke Method 13 on this object handle with these arguments", and the
system-call mechanism is responsible for performing the appropriate
context switch to the associated handler task).

The objects may then given a C wrapper on the client side.

Well, and also easier to justify some of the seemingly arbitrary
restrictions on the handling of memory references and similar with an
interface resembling COM Objects, than with function pointers or DLL
imports.

As for Addr96 compacting, possibly, say:
6bbb_aaaaaaaa

Maps to "imported object=bbb, offset=aaaaaaaa" mapped to a list of
imported and exported objects (could maybe also make provision for
assigning less address space to smaller objects).

The list would likely need to be tied to the ASID if handled using the
existing TLB mechanism. Giving each task its own ASID would not be ideal
for performance though.

It could make sense to have an ACLID for each object, which would then
be used as the ACLID when accessing the object through the pointer.

This would make provision for exporting an object to an ACL other than
the one assigned to the memory holding the object (say, the ACL of the
object is based on the intended recipient rather than its origin).

The TLB Miss handler would need to do a little extra wonk and remapping
for this though (effectively, it would be first translated by the table,
and then again by the page-table).

...

Scott Lurndal

unread,

Jun 2, 2023, 11:40:57 PM6/2/23

to

MitchAlsup <Mitch...@aol.com> writes:
>On Friday, June 2, 2023 at 4:20:52=E2=80=AFPM UTC-5, Scott Lurndal wrote:
>> MitchAlsup <Mitch...@aol.com> writes:=20
>> >On Friday, June 2, 2023 at 2:09:52=3DE2=3D80=3DAFPM UTC-5, Scott Lurndal=
> wrote:=20
>> >> BGB <cr8...@gmail.com> writes:=3D20=20
>> >> >On 6/2/2023 12:14 PM, Scott Lurndal wrote:=3D20=20
>> >> >> BGB <cr8...@gmail.com> writes:=3D20=20
>> >> >>> On 6/2/2023 9:57 AM, Scott Lurndal wrote:=3D20=20
>> >>=3D20=20
>> >> >>=3D20=20
>> >> >> Again, why do you believe this to be a major weakness, and what=3D2=
>0=20

>> >> >> "traditional access control" mechanism do you believe is superior?=

>=3D20=20
>> >> >>=3D20=20
>> >> >=3D20=20
>> >> >If you can find a way to forge capabilities, or steal capabilities fr=
>om=3D=20
>> >=3D20
>> >> >elsewhere, this shoots a big hole in the security model.=20
>> ><
>> >> So how is that any different from any other security model?=3D20=20
>> >>=3D20=20
>> >> The idea is to make the security footprint as small as possible,=3D20=

>=20
>> >> and capabilities are the ideal there, and the corresponding hardware=

>=3D20
>> >> is easier to make secure.=20
>> ><=20

>> >But harder to make fficient.

>> Harder, but not impossible.=20
>>=20
>> ><=20
>> >> > Maintaining=3D20=20
>> >> >security without also adding significant hinderances to software=3D20
>> >> >development seems like a bit of a balancing act.=20
>> ><
>> >> The compilers handle everything. There is no 'significant hindrance=3D=
>20=20
>> >> to software development'.=20
>> ><=20
>> > char *p[] =3D3D &array[i][j];=20
>>=20
>> The compiler can, in CHERI, create a sub-capability for the target region=
>,=20
>> which has a fixed size in this case if I recall correctly. Implementation=
>s=20
>> by a processor vendor would likely differ from pure CHERI.=20
>>=20
>>=20

>> >> DLL? You're still stuck in windows land. In any case, your system will=

>=3D20=20
>> >> be far more complex than a capability system, and correspondingly easi=
>er=3D=20
>> >=3D20=20

>> >> to abuse or bypass. How, exactly, does an operating system 'moderate'=

>=3D20
>> >> control flow between the application and the DLL?=20
>> ><=20

>> >Trampolines ?!?
>> Complicated, performance killing potential security hole.
><
>GOT and PLT kill performance ??

They have a non-zero cost. I may have engaged in slight hyperbole.

I'm not sure I'd categorize those as "moderateing" control flow,
however, as there's not much the run-time linker can with the PLT
entries once established.

Scott Lurndal

unread,

Jun 2, 2023, 11:46:10 PM6/2/23

to

BGB <cr8...@gmail.com> writes:
>On 6/2/2023 2:09 PM, Scott Lurndal wrote:
>> BGB <cr8...@gmail.com> writes:
>>> On 6/2/2023 12:14 PM, Scott Lurndal wrote:
>>>> BGB <cr8...@gmail.com> writes:
>>>>> On 6/2/2023 9:57 AM, Scott Lurndal wrote:
>>
>>>>
>>>> Again, why do you believe this to be a major weakness, and what
>>>> "traditional access control" mechanism do you believe is superior?
>>>>
>>>
>>> If you can find a way to forge capabilities, or steal capabilities from
>>> elsewhere, this shoots a big hole in the security model.
>>
>> So how is that any different from any other security model?
>>
>> The idea is to make the security footprint as small as possible,
>> and capabilities are the ideal there, and the corresponding hardware
>> is easier to make secure.
>>
>
>But, not entirely transparent to "ye olde C".

The CHERI folks have described language extensions to
adapt C to this type of architecture.

>
>At least if one accepts all the various ways people like to abuse the
>language as at least "semi valid".

They can always devolve to in-line assembler (e.g. for OS/Hypervisors)
to "abuse" the language.

>>
>> The compilers handle everything. There is no 'significant hindrance
>> to software development'.
>>
>
>If your language looks like Java or C# or similar.

Two of the top 5 in use today. Python, C++ and soon
Rust all of which can use capabilites opaque to the
programmer.

>
>Or, one is happy to use a subset of C that doesn't allow "abusing"
>pointers and type-casts in various ways, or makes various other "non
>portable" assumptions.

You mean code for security rather than convenience. I think you
overestimate the amount of language abuse that is tolerated in
production code.

>
>But, if someone tries to compile code and the compiler is like "error:
>that aint gonna fly here", they are liable to blame the architecture
>rather than fix the code.

Come now, every new warning added to gcc or clang is basically
'that ain't gonna fly here' and people aren't blaming anything
but the compiler, if they complain at all, which they don't in
general.