RISC-V adding 48-bit instructions

Quadibloc

unread,

Jul 9, 2022, 12:25:25 AM7/9/22

to

...to make the use of 32-bit immediate values in instructions more convenient.

John Savard

Brett

unread,

Jul 9, 2022, 4:18:35 PM7/9/22

to

Quadibloc <jsa...@ecn.ab.ca> wrote:
> ...to make the use of 32-bit immediate values in instructions more convenient.
>
> John Savard
>

Also known as Very Long Instruction Words (VLIW), 128 bits is popular.
Note that you have a branch about every 4 instructions, so with 128 bits
you can pack in different combinations of loads/stores/ops and optionally a
branch.

Costs an extra cycle for decode, but the rest of the pipeline is easier, as
you are effectively 4 wide without most of the hassles and die size of 4
wide OoO.
And you can scale up to dual issue which is 8 wide for OoOe. Note this
tends to be in-order.

With 48 bits you are going to pack 1-3 actual instructions in that packet
to prevent waste and poor performance.

For 32 bit immediates MIPS used load immediate low and load immediate high,
which can be combined in a long pipeline to execute in one cycle.

I think My 66000 is using 64 bit immediates for long calls, and those calls
would fit in 48 bit instructions.

MitchAlsup

unread,

Jul 9, 2022, 5:29:14 PM7/9/22

to

My 66000::
Short conditional control transfers have a range of 18-bits in a 32-bit instruction.
Long unconditional control transfers have a 32-bit form with 28-bit address range.
Long constant control transfers have a 34-bit range with 64-bit instructions,
.......................................................and a 64-bit range with 96-bit instructions.

BGB

unread,

Jul 10, 2022, 12:43:03 AM7/10/22

to

On 7/9/2022 3:18 PM, Brett wrote:
> Quadibloc <jsa...@ecn.ab.ca> wrote:
>> ...to make the use of 32-bit immediate values in instructions more convenient.
>>
>> John Savard
>>
>
> Also known as Very Long Instruction Words (VLIW), 128 bits is popular.
> Note that you have a branch about every 4 instructions, so with 128 bits
> you can pack in different combinations of loads/stores/ops and optionally a
> branch.
>

Not every VLIW uses fixed-sized bundles.
ESP32 (Xtensa), Hexagon, TMS320C6xxx, etc, tended to use a
variable-length approach.

I suspect the IA-64 approach (large fixed size bundles) may actually be
the minority, but it is more a well-known example in any case.

> Costs an extra cycle for decode, but the rest of the pipeline is easier, as
> you are effectively 4 wide without most of the hassles and die size of 4
> wide OoO.
> And you can scale up to dual issue which is 8 wide for OoOe. Note this
> tends to be in-order.
>

In my case, BJX2 uses a similar encoding for both VLIW and for 32+bit
variable-length instructions (currently up to a limit of 96 bits).

But, yeah, I went the way I did because it allowed a 3-wide core with
only a modest resource cost increase over a 1-wide core (with a similar
feature set). This approach also allows for a bunch of stuff that would
be difficult or expensive to provide with a RISC style design.

Though, small incremental costs add up. There is a pretty big difference
between a core with FPU, SIMD, MMU, ...

Vs a more minimal (integer-only) RISC with pipeline and 2R1W or 3R1W
register file. But, then one might still need some way to "effectively"
load constants (and PC-relative memory loads are "kinda crap").

> With 48 bits you are going to pack 1-3 actual instructions in that packet
> to prevent waste and poor performance.
>

I think the main idea was mostly for them to have a 48-bit 2RI
encodings, such as:
LI Rd, Imm32
And similar...

But, Imm32 would burn through a 48-bit space pretty fast; it can last a
little longer with mostly 17 or 24 bit imm/disp fields (this is what my
original 48-bit encodings had).

Though, this original encoding space way lost due to a partial redesign
of the ISA encoding (the same one that effectively added PrWEX and Jumbo
prefixes).

At present, I have 64-bit encodings, and a "possible" 48 bit space, but
not yet much of anything has been put into the 48 bit space yet, as I
haven't decided if/how exactly it will be used.

Some previous ideas would have put a whole new encoding space there.

A more sane option might be to find a way to treat it as a "more modest"
extension to the existing 32-bit space (likely behaving as-if the 48 bit
form were unpacked into a corresponding Op64 encoding).

From ISA spec:
* 78nm_ZeoZ_Ywvi
Will be unpacked as if it were:
* FFw0_0vxi_FYnm_ZeoZ
Where: Y=0..3

In this case, this Op48 would exist more as a shorthand form for a
subset of Op64 encodings.

> For 32 bit immediates MIPS used load immediate low and load immediate high,
> which can be combined in a long pipeline to execute in one cycle.
>

RISC-V also uses this approach.
Sadly, not really any good way to extend it to 64 bits.

An early approach I had used in both BJX1 and BJX2 was an LDISH16
instruction:
LDISH16 Imm16u, Rn
Which did:
Rn=(Rn<<16)|Imm16u

This could load a 32-bit constant in 2 cycles, or a 64-bit constant in 4
cycles.

This has been (mostly) replaced by the use of Jumbo prefixes, which can
load a 64-bit constant in 1 cycle (using a 96 bit instruction).

> I think My 66000 is using 64 bit immediates for long calls, and those calls
> would fit in 48 bit instructions.

Immediate can't be bigger than the instruction.

In BJX2, FWIW:
Disp8 in 16-bits, +/- 256B
Disp20 in 32-bits, +/- 1MB
Disp33 in 64-bits, +/- 8GB
Abs48 in 64-bits, Entire Quadrant (Low 48 bits of address)

Thomas Koenig

unread,

Jul 10, 2022, 7:06:24 AM7/10/22

to

Quadibloc <jsa...@ecn.ab.ca> schrieb:

> ...to make the use of 32-bit immediate values in instructions more convenient.

I now have access to a computer based on a SiFive U74-MC (currently
boostraping gcc trunk on it), and I am just playing around with
the system compiler (gcc 11.2), plus browsing the CPU handbook,
and "what a mess" does not even start to describe it.

I wonder which version of the combinatiorial explosion of extensions
libraries should be compiled for. Oh, and the handbook also does
not describe which options to pass to gcc for which feature
to be enabled.

Not an architecture for somebody who does not control the
complete tool chain, I'd say.

luke.l...@gmail.com

unread,

Jul 10, 2022, 9:09:19 AM7/10/22

to

On Saturday, July 9, 2022 at 9:18:35 PM UTC+1, gg...@yahoo.com wrote:
> Quadibloc <jsa...@ecn.ab.ca> wrote:
> > ...to make the use of 32-bit immediate values in instructions more convenient.
> >
> > John Savard
> >
> Also known as Very Long Instruction Words (VLIW), 128 bits is popular.

my understanding of VLIW is that it actually packs completely independent
instructions into one "word", often/usually where there is zero register hazard
interaction between them. more sort-of sequential parallel processor.
for example TMS320 DSPs have a 2-wide VLIW where odd and even FP
MACs can be issued in the same clock cycle.

does that sound about right?

l.

John Levine

unread,

Jul 10, 2022, 1:19:26 PM7/10/22

to

>> Also known as Very Long Instruction Words (VLIW), 128 bits is popular.
>
>my understanding of VLIW is that it actually packs completely independent
>instructions into one "word", often/usually where there is zero register hazard
>interaction between them. more sort-of sequential parallel processor.
>for example TMS320 DSPs have a 2-wide VLIW where odd and even FP
>MACs can be issued in the same clock cycle.
>
>does that sound about right?

Yes. VLIW made more sense in the 1980s when we didn't have enough transistors
to do dynamic checks for register and memory bank conflicts. Now it's mostly
a historic curiosity.

I wouldn't call adding extra instruction bits for immediate operands
VLIW. That would make the VAX VLIW.
--
Regards,
John Levine, jo...@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

BGB

unread,

Jul 10, 2022, 1:51:32 PM7/10/22

to

On 7/10/2022 12:19 PM, John Levine wrote:
>>> Also known as Very Long Instruction Words (VLIW), 128 bits is popular.
>>
>> my understanding of VLIW is that it actually packs completely independent
>> instructions into one "word", often/usually where there is zero register hazard
>> interaction between them. more sort-of sequential parallel processor.
>> for example TMS320 DSPs have a 2-wide VLIW where odd and even FP
>> MACs can be issued in the same clock cycle.
>>
>> does that sound about right?
>
> Yes. VLIW made more sense in the 1980s when we didn't have enough transistors
> to do dynamic checks for register and memory bank conflicts. Now it's mostly
> a historic curiosity.
>

Still makes sense on FPGA though...
Can also still make sense as a cost-effective alternative to in-order
superscalar.

For fancier cores (such as those built around OoO), it would likely
makes sense for the CPU to be able to discard the compile-time bundling
and allow the CPU to figure it out for itself.

However, OoO is rare on FPGAs, and usually requires big/expensive FPGAs
(as opposed to a Spartan or Artix or similar).

> I wouldn't call adding extra instruction bits for immediate operands
> VLIW. That would make the VAX VLIW.

Probably depends on how the decoding works.

In my case, the bundling mechanism was also used for Jumbo prefixes, but
is not the only use of this mechanism.

It is also possible that the prefixes could be implemented in a way
which does not depend on this mechanism, say:
Using R0 as decoder scratchpad and using a 'J bit' or similar.

Though, in this case some level of architectural state would need to be
"visible" in order for such an implementation to be able to survive
interrupts (the only real alternative being to disallow interrupts to
have a return point immediately following a Jumbo prefix).

MitchAlsup

unread,

Jul 10, 2022, 5:32:53 PM7/10/22

to

On Sunday, July 10, 2022 at 12:51:32 PM UTC-5, BGB wrote:
> On 7/10/2022 12:19 PM, John Levine wrote:
> >>> Also known as Very Long Instruction Words (VLIW), 128 bits is popular.
> >>
> >> my understanding of VLIW is that it actually packs completely independent
> >> instructions into one "word", often/usually where there is zero register hazard
> >> interaction between them. more sort-of sequential parallel processor.
> >> for example TMS320 DSPs have a 2-wide VLIW where odd and even FP
> >> MACs can be issued in the same clock cycle.
> >>
> >> does that sound about right?
> >
> > Yes. VLIW made more sense in the 1980s when we didn't have enough transistors
> > to do dynamic checks for register and memory bank conflicts. Now it's mostly
> > a historic curiosity.
> >
> Still makes sense on FPGA though...
> Can also still make sense as a cost-effective alternative to in-order
> superscalar.
>
>
> For fancier cores (such as those built around OoO), it would likely
> makes sense for the CPU to be able to discard the compile-time bundling
> and allow the CPU to figure it out for itself.
<

Is this not an indictment of VLSI !! When a later implementation requires
that many of the tenets of a prior generation be altered to make forward
progress, was not the original seriously faulty ??

>
> However, OoO is rare on FPGAs, and usually requires big/expensive FPGAs
> (as opposed to a Spartan or Artix or similar).
> > I wouldn't call adding extra instruction bits for immediate operands
> > VLIW. That would make the VAX VLIW.
<

VLIW does not have any notion that the length of an issue is only determined
by serially decoding byte after byte after byte.

John Levine

unread,

Jul 10, 2022, 6:00:29 PM7/10/22

to

According to MitchAlsup <Mitch...@aol.com>:

>> For fancier cores (such as those built around OoO), it would likely
>> makes sense for the CPU to be able to discard the compile-time bundling
>> and allow the CPU to figure it out for itself.
><
>Is this not an indictment of VLSI !! When a later implementation requires
>that many of the tenets of a prior generation be altered to make forward
>progress, was not the original seriously faulty ??

Assuming you mean VLIW, I wouldn't say it's faulty, I'd say that's not one of its goals.

The Multiflow machine came in various widths with more or fewer
functional units, and you had to recompile your program to match the
width of the machine you were running it on.

If VLIW were still interesting, I'd think it'd be straightforward to
so something like what IBM AS/400 aka series i and Android phone apps
do. Compile to a low level intermediate code, and translate that to
match the hardware the first time you run a program on a particular
machine.

Ivan Godard

unread,

Jul 10, 2022, 6:26:10 PM7/10/22

to

On 7/10/2022 3:00 PM, John Levine wrote:
> According to MitchAlsup <Mitch...@aol.com>:
>>> For fancier cores (such as those built around OoO), it would likely
>>> makes sense for the CPU to be able to discard the compile-time bundling
>>> and allow the CPU to figure it out for itself.
>> <
>> Is this not an indictment of VLSI !! When a later implementation requires
>> that many of the tenets of a prior generation be altered to make forward
>> progress, was not the original seriously faulty ??
>
> Assuming you mean VLIW, I wouldn't say it's faulty, I'd say that's not one of its goals.
>
> The Multiflow machine came in various widths with more or fewer
> functional units, and you had to recompile your program to match the
> width of the machine you were running it on.
>
> If VLIW were still interesting, I'd think it'd be straightforward to
> so something like what IBM AS/400 aka series i and Android phone apps
> do. Compile to a low level intermediate code, and translate that to
> match the hardware the first time you run a program on a particular
> machine.
>

Amen!

BGB

unread,

Jul 10, 2022, 8:40:18 PM7/10/22

to

On 7/10/2022 4:32 PM, MitchAlsup wrote:
> On Sunday, July 10, 2022 at 12:51:32 PM UTC-5, BGB wrote:
>> On 7/10/2022 12:19 PM, John Levine wrote:
>>>>> Also known as Very Long Instruction Words (VLIW), 128 bits is popular.
>>>>
>>>> my understanding of VLIW is that it actually packs completely independent
>>>> instructions into one "word", often/usually where there is zero register hazard
>>>> interaction between them. more sort-of sequential parallel processor.
>>>> for example TMS320 DSPs have a 2-wide VLIW where odd and even FP
>>>> MACs can be issued in the same clock cycle.
>>>>
>>>> does that sound about right?
>>>
>>> Yes. VLIW made more sense in the 1980s when we didn't have enough transistors
>>> to do dynamic checks for register and memory bank conflicts. Now it's mostly
>>> a historic curiosity.
>>>
>> Still makes sense on FPGA though...
>> Can also still make sense as a cost-effective alternative to in-order
>> superscalar.
>>
>>
>> For fancier cores (such as those built around OoO), it would likely
>> makes sense for the CPU to be able to discard the compile-time bundling
>> and allow the CPU to figure it out for itself.
> <
> Is this not an indictment of VLSI !! When a later implementation requires
> that many of the tenets of a prior generation be altered to make forward
> progress, was not the original seriously faulty ??

There are several possible routes:
RISC -> Superscalar -> OoO
VLIW -> OoO

Then, the debate as to how "inevitable" it is that one will end up at
OoO, or whether it will likely spend a lot of time in "not OoO" territory.

Given the large numbers of microcontrollers in DSPs still in use, and
many embedded devices still using in-order processors, it is likely that
VLIW could still have a lot of potential applicability in these areas
(say, if they could give good performance relative to their cost).

Well, or at least assuming the ability to cost-effectively produce ASICs.

>>
>> However, OoO is rare on FPGAs, and usually requires big/expensive FPGAs
>> (as opposed to a Spartan or Artix or similar).
>>> I wouldn't call adding extra instruction bits for immediate operands
>>> VLIW. That would make the VAX VLIW.
> <
> VLIW does not have any notion that the length of an issue is only determined
> by serially decoding byte after byte after byte.

Yeah.

It is more about detecting that each instruction in a series of
instructions has a "wide" or "parallel" status flag set.

Checking whether or not some bits are set up a certain way is cheaper
than checking whether some register fields, instruction type, ..., would
allow a pair of instructions to execute in parallel.

Likewise, the compiler can be free to do checks and reordering that
would be difficult to do cost-effectively in hardware.

As I can also noted, adding "alias checking" heuristics, and doing some
more debugging work related to instruction reordering, has increase the
number of bundled instructions.

Say, increasing the percentage of instructions being WEXified from ~ 9%
(Avg Bundle = ~ 1.18) up to around 13% (Avg Bundle = ~ 1.26).

Though, part of this was also due to tweaking the heuristics which try
to optimize for minimizing interlocks to also add in a bias whether or
not adjacent instructions were dependent on each other (it will try to
push dependent and interlocking instructions away from each other).

Adding alias-check heuristics also allowed for more mobility here.

Theoretically, something like a WEXifier could be turned into a
load-time or install-time specializer, but would require an ability to
easily determine the location of every label in the program binary (it
can't be allowed to migrate instructions across label boundaries).

Marcus

unread,

Jul 12, 2022, 2:29:48 AM7/12/22

to

On 2022-07-10, Thomas Koenig wrote:
> Quadibloc <jsa...@ecn.ab.ca> schrieb:
>> ...to make the use of 32-bit immediate values in instructions more convenient.
>
> I now have access to a computer based on a SiFive U74-MC (currently
> boostraping gcc trunk on it), and I am just playing around with
> the system compiler (gcc 11.2), plus browsing the CPU handbook,
> and "what a mess" does not even start to describe it.
>
> I wonder which version of the combinatiorial explosion of extensions
> libraries should be compiled for. Oh, and the handbook also does
> not describe which options to pass to gcc for which feature
> to be enabled.

They came up with this naming scheme, e.g. RV32IMC, RV64G, etc, that you
pass to the compiler to tell which extensions to use, and you need to
know the conventions for how to put together valid names. I never get
them right.

And of course your toolchain (including newlib etc) typically only
supports a handful of extension combinations. The explosion of extension
combinations is definitely one of the main drawbacks of the RISC-V
architecture, and it's only made worse by custom extensions that are
likely to limit developers to use one-off patched toolchains rather than
the latest-and-greatest official toolchains.

What "CPU handbook" are you referring to? I usually read the "RISC-V
Instruction Set Manual", but it's written in a very annoying way, more
like prose than a reference manual.

/Marcus

BGB

unread,

Jul 12, 2022, 5:01:32 AM7/12/22

to

On 7/12/2022 1:29 AM, Marcus wrote:
> On 2022-07-10, Thomas Koenig wrote:
>> Quadibloc <jsa...@ecn.ab.ca> schrieb:
>>> ...to make the use of 32-bit immediate values in instructions more
>>> convenient.
>>
>> I now have access to a computer based on a SiFive U74-MC (currently
>> boostraping gcc trunk on it), and I am just playing around with
>> the system compiler (gcc 11.2), plus browsing the CPU handbook,
>> and "what a mess" does not even start to describe it.
>>
>> I wonder which version of the combinatiorial explosion of extensions
>> libraries should be compiled for. Oh, and the handbook also does
>> not describe which options to pass to gcc for which feature
>> to be enabled.
>
> They came up with this naming scheme, e.g. RV32IMC, RV64G, etc, that you
> pass to the compiler to tell which extensions to use, and you need to
> know the conventions for how to put together valid names. I never get
> them right.
>

Then one may realize that the names are essentially a fixed list rather
than dynamically parsed, so one is limited to the particular
combinations of features that GCC has graced us with...

Then again, I guess it does order things at least slightly to assume
that there is a set of tiers in the feature sets.

Some of this differs from BGBCC, which rather than controlling target
features at build-time, these sorts of things are generally controlled
at run-time via feature-flags or similar (eg: one doesn't necessarily
need to build a bunch of different versions of BGBCC).

> And of course your toolchain (including newlib etc) typically only
> supports a handful of extension combinations. The explosion of extension
> combinations is definitely one of the main drawbacks of the RISC-V
> architecture, and it's only made worse by custom extensions that are
> likely to limit developers to use one-off patched toolchains rather than
> the latest-and-greatest official toolchains.
>

One either has this issue, or the other alternative is to assume that
everything needs to be built with the same baseline feature set.

> What "CPU handbook" are you referring to? I usually read the "RISC-V
> Instruction Set Manual", but it's written in a very annoying way, more
> like prose than a reference manual.
>
> /Marcus
>

I am not sure either.

I am aware of the RISC-V Spec, the Privileged Spec, and typically other
specs for specific extensions...

Agree on the writing style, it is sometimes annoying trying to figure
stuff out from the manual.

At least in my case, I am trying to have a coherent set of specs.

But, I have several different sub-spec's:
A main spec which describes architectural features and contains an ISA
listing;
An "ISA Descriptions" spec, which is more about trying to describe the
behavior of the various CPU instructions, and to a lesser extent the
interrelationships between the various optional features;
A few other minor specs, such as one that describes how RISC-V is mapped
on top of BJX2, ...

Always a risk though of things being:
Out of date;
Out of sync;
Ambiguous;
Important information being absent;
Stuff that was described but not actually implemented as such;
...

It is kind of made harder when one only has their own perspective, and
has little idea of how potential readers might interpret things.

But, this is true of writing in general; always a risk of potentially
being overly ambiguous.

Anton Ertl

unread,

Jul 12, 2022, 7:22:08 AM7/12/22

to

Marcus <m.de...@this.bitsnbites.eu> writes:
>On 2022-07-10, Thomas Koenig wrote:
>> Quadibloc <jsa...@ecn.ab.ca> schrieb:
>>> ...to make the use of 32-bit immediate values in instructions more convenient.
>>
>> I now have access to a computer based on a SiFive U74-MC (currently
>> boostraping gcc trunk on it), and I am just playing around with
>> the system compiler (gcc 11.2), plus browsing the CPU handbook,
>> and "what a mess" does not even start to describe it.
>>
>> I wonder which version of the combinatiorial explosion of extensions
>> libraries should be compiled for. Oh, and the handbook also does
>> not describe which options to pass to gcc for which feature
>> to be enabled.

cat /proc/cpuinfo

on such a CPU says:

processor : 0
hart : 0
isa : rv64imafdc
mmu : sv39
uarch : sifive,u74-mc
...

so here you have it: rv64imafdc

You find a list of the extensions here:
<https://en.wikipedia.org/wiki/RISC-V#ISA_base_and_extensions>

>The explosion of extension
>combinations is definitely one of the main drawbacks of the RISC-V
>architecture

From the POV of someone trying to develop software for it, certainly.
They apparently believe that the advantage of allowing a wide variety
of implementations is worth that drawback. And the architects of
other successful architectures apparently think so, too:

For an Intel Rocket Lake /proc/cpuinfo tells me:

flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx
pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl
xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq
dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm
pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave
avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single
ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept
vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx
avx512f avx512dq rdseed adx smap avx512ifma clflushopt intel_pt
avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves
dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp
hwp_pkg_req avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes
vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid fsrm
md_clear flush_l1d arch_capabilities

vmx flags: vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad
ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid
unrestricted_guest vapic_reg vid ple shadow_vmcs pml
ept_mode_based_exec tsc_scaling

In my recent experiment, I tried to tell gcc that I want AVX512 with
-mavx512, but that option does not exist. gcc suggested -mavx512f,
and that worked, but there are several other AVX512 options for gcc.

For ARM there are two main instruction sets, one of them with two
still-supported encodings, with lots of extensions that are combined
into three profiles (A, M, and R) and various architecture levels. If
you stick with the A64 instruction set, the variations are not so bad
yet, but for the 32-bit instructions, it's pretty extreme.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

Thomas Koenig

unread,

Jul 12, 2022, 1:42:25 PM7/12/22

to

Marcus <m.de...@this.bitsnbites.eu> schrieb:

> On 2022-07-10, Thomas Koenig wrote:
>> Quadibloc <jsa...@ecn.ab.ca> schrieb:
>>> ...to make the use of 32-bit immediate values in instructions more convenient.
>>
>> I now have access to a computer based on a SiFive U74-MC (currently
>> boostraping gcc trunk on it), and I am just playing around with
>> the system compiler (gcc 11.2), plus browsing the CPU handbook,
>> and "what a mess" does not even start to describe it.
>>
>> I wonder which version of the combinatiorial explosion of extensions
>> libraries should be compiled for. Oh, and the handbook also does
>> not describe which options to pass to gcc for which feature
>> to be enabled.

[...]

> What "CPU handbook" are you referring to? I usually read the "RISC-V
> Instruction Set Manual", but it's written in a very annoying way, more
> like prose than a reference manual.

The one to be found under https://www.sifive.com/cores/u74-mc, the
direct link is
https://sifive.cdn.prismic.io/sifive/2dd11994-693c-4360-8aea-5453d8642c42_u74mc_core_complex_manual_21G3.pdf

Stephen Fuld

unread,

Jul 12, 2022, 1:56:00 PM7/12/22

to

snipped long lists of optional features in X86 and ARM CPUs

Yes, but there is a big difference. Many, perhaps even most of the
options for these two architectures arose over a long time, as
requirements and technology changed. It didn't start out that way.

Some of the options are subsumed in newer options, and are maintained in
the compiler only for compatibility. So for new program developers, the
"effective list" (the ones they need to care about) are shorter,
especially if they don't care about running on older, now obsolete systems.

So, the architects of these systems did not set out from the start to
have such a long list. In contrast, it appears that RISC-V is starting
out with a pretty long list of options/variations.

If history is a guide, the lists for all of these architectures will
get longer. But if you start with a long list, it will only get worse. :-(

I do note there are a few exceptions to the inevitable growth of
options. S/360 had floating point and decimal arithmetic as options,
and, in a sense, the presence of an X87 chip was an option until the
486. There are probably others, but they are fairly rare.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Thomas Koenig

unread,

Jul 12, 2022, 1:58:40 PM7/12/22

to

BGB <cr8...@gmail.com> schrieb:

> On 7/12/2022 1:29 AM, Marcus wrote:
>> On 2022-07-10, Thomas Koenig wrote:
>>> Quadibloc <jsa...@ecn.ab.ca> schrieb:
>>>> ...to make the use of 32-bit immediate values in instructions more
>>>> convenient.
>>>
>>> I now have access to a computer based on a SiFive U74-MC (currently
>>> boostraping gcc trunk on it), and I am just playing around with
>>> the system compiler (gcc 11.2), plus browsing the CPU handbook,
>>> and "what a mess" does not even start to describe it.
>>>
>>> I wonder which version of the combinatiorial explosion of extensions
>>> libraries should be compiled for. Oh, and the handbook also does
>>> not describe which options to pass to gcc for which feature
>>> to be enabled.
>>
>> They came up with this naming scheme, e.g. RV32IMC, RV64G, etc, that you
>> pass to the compiler to tell which extensions to use, and you need to
>> know the conventions for how to put together valid names. I never get
>> them right.
>>
>
> Then one may realize that the names are essentially a fixed list rather
> than dynamically parsed, so one is limited to the particular
> combinations of features that GCC has graced us with...

The way gcc is set up, it is actually fairly easy to switch
features on or off in the RTL - if a pattern is not enabled,
it will not be matched.

MitchAlsup

unread,

Jul 12, 2022, 3:49:21 PM7/12/22

to

Yes, it did not start out that way, and a bunch of less than optimal choices
along the way did not help either.

>
> Some of the options are subsumed in newer options, and are maintained in
> the compiler only for compatibility. So for new program developers, the
> "effective list" (the ones they need to care about) are shorter,
> especially if they don't care about running on older, now obsolete systems.
>
> So, the architects of these systems did not set out from the start to
> have such a long list. In contrast, it appears that RISC-V is starting
> out with a pretty long list of options/variations.
<

Which I consider a bad omen--but more importantly I am trying to
council Quadribloc not to go this route.

>
> If history is a guide, the lists for all of these architectures will
> get longer. But if you start with a long list, it will only get worse. :-(
>
> I do note there are a few exceptions to the inevitable growth of
> options. S/360 had floating point and decimal arithmetic as options,
<

And Quad FP option (on the 360/65 and 360/67.)
<
Apparently if you bought the FP package Quad was delivered, but 2 wires
on the backplane needed to be connected to put it "in" ISA. I discovered
this about 2:30 one morning in the machine room at CMU.

BGB

unread,

Jul 12, 2022, 3:51:30 PM7/12/22

to

When I was trying to build it before, had tried some customized
configuration strings, but these didn't build.

When I looked around (partly via grep), I had found somewhere that there
were lists of basically every possible target, with RISC-V basically
dealing with the configurations via a large number of targets with every
combination of features as a different target (well, except the ones I
wanted in this case).

Gave in at the time, and configured it for RV64I, but now could probably
go for RV64IM.

Kind of annoying that they build copies of GCC one-off for each target,
vs say as a collection of optional modules for different types of
targets (say, if GCC were configured more like how one configures the
Linux kernel).

Then, say, one edits a runtime config file or uses command-line options
to configure the backend for the specific target (with premade config
files for more well-known target machines).

An alternate approach would be, say, the compiler is configured with
FOURCC codes for major frontends and backends (source languages and
target machines; with secondary FOURCCs for language variants, target
baseline profiles, and an array of "f options", ...), with the compiler
being able to load in DLL's or SO's for major targets and components
(say, using a system similar to how A/V codecs work on Windows).

One then building the combination of front-end and backend modules which
are useful to them, or with 3rd party modules usable as plug-ins, ...

Some other shared components could potentially also be folded off into
sub-libraries (say, ELF and PE/COFF, which likely only needs minor
customization on a per target basis).

In this case, the core part of the compiler would mostly deal with file
management, typesystem management, and IR stages.

Well, also better would be if one were not launching and terminating the
compiler for every translation unit, ...

Though, I guess one could argue that responsibilities are not as cleanly
divided with a compiler as they are with A/V codecs, say:
Backend choice may effect PP defines in the frontend;
Backend type will effect the behavior of the typesystem;
...

Some of this can be handled some, say:
Backend has an interface to register defines with the frontend;
Various context parameters for configuring the typesystem behavior
(sizeof(long), sizeof(void*), ...).

Backend interface basically consists of a way for querying for support
for a given target, and then creating an instance of that target
(configuring the primary and secondary contexts).

Frontend is similarly about creating a translation instance that can
parse a given source-language and turn it to the AST format, and another
semi-shared stage which can translate ASTs into the IR format.

In BGBCC, the language frontend turning the source code into an XML
parse-tree; which is then converted into the stack-machine IR stage. The
Middle-Section mostly communicates with the backend in terms of a 3AC
IR, so the stack bytecode is mostly limited to the middle-stage.

It is possible one could make a case for splitting the backend into
separate code-generation and assembler phases, but the current backends
for BGBCC don't do this (my early x86 backend did have a separate
assembler, but this backend was dropped long ago).

Say, stages, if things were more cleanly modularized:
Language Frontend (Source -> AST)
Preprocessor (Semi Shared Module)
Parser
Upper-Middle (Shared, AST->IL)
Lower-Middle (Shared, IL->3AC)
Backend (Codegen)
3AC -> ASM
Assembler (Configured by Backend)
ASM -> ObjectModule
Linker (Configured by Backend)
ObjectModule -> Binary

In the current BGBCC, there is only a single middle stage, and the
backend also handles the Assembler and Linker parts.

Though, it could make sense if these could potentially be separate
modules configured by the backend (with an API interface), which may or
may not be part of the same DLL as the backend.

Idea here is that ObjectModule is not a serialized object (COFF or ELF
format), but instead an in-memory structure representing the contents of
such an object (section buffers, symbol and reloc tables, etc), with the
option of being flattened to or parsed from an object file (Eg, COFF).

Note, for example, passing ASM to the compiler might go something like:
Frontend invokes preprocessor;
Gets turned into something like:
<module ...>
<body>
<asm>
<![CDATA[ big blob of ASM code ]]>
</asm>
</body>
</module>

Whether or not using XML for ASTs makes sense, it is basically "what I
have". Some of my other stuff had used CONS-lists and JSON-style
dictionary-objects instead, but BGBCC was forked off of an early VM of
mine which had used XML here (even if arguably a JSON style approach
better fits the AST use-case than using XML here; CONS lists have a
different problem in that by the time they become "usefully flexible",
one is basically using them to awkwardly mimic the behavior of
dictionary objects).

Then, in the stack IL, gets essentially passed along as a big text blob.

In the 3AC IR, turns back to an 'ASM' object, with the ASM still
represented as a big string literal.

(Then again, can also note that in some stages of my compiler, large
numeric types are also passed as string literals).

At present, the backend sees the ASM object and then passes the string
literal to an ASM parser. A backend with a separate ASM stage might
instead emit the ASM blob to the output (then leave it for the assembler
module to deal with).

Configuring modules would generally use an interface similar to the
"DriverProc" style interface used in codec drivers, but with configured
modules typically accessed via VTable pointers.

Eg:
(*CGenCtx)->DoSomething(CGenCtx, ...);
...

It might make sense to keep the assembler and linker stages managed by
the backend rather than managed by the main compiler stage.

As-is, the main stage just sort of expects the backend to produce output
in the requested format (itself also given as FOURCC values).

However, in these "modern times", might make sense to consider expanding
the use of FOURCC's to EIGHTCC's (allowing for a FOURCC to be cast to an
EIGHTCC and still interpreted in a sensible way).

...

Though, one other property I think that is relevant to a compiler:
Not needing absurd amounts of RAM or hour+ build times;
This is basically a big "fail" for LLVM/Clang IMO.

This, and also my disfavor of "Java style" C++, is part of why I went
with reusing my old/crufty C compiler (BGBCC) as a base, rather than
jump over to Clang.

Better IMO if the whole compiler is written in "Ye Olde C" (nevermind
occasional cruft possibly following COM/OLE style conventions or
similar; but this stuff is at least "sort of usable").

...

John Dallman

unread,

Jul 12, 2022, 5:51:30 PM7/12/22

to

In article <2022Jul1...@mips.complang.tuwien.ac.at>,

an...@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

> so here you have it: rv64imafdc

That's a little poor: it does not apparently implement all of RV64G,
which is intended as a general-purpose 64-bit RISC-V.

> <https://en.wikipedia.org/wiki/RISC-V#ISA_base_and_extensions>

That has RV64G as being "Shorthand for the IMAFDZicsr Zifencei base and
extensions, intended to represent a standard general-purpose ISA"

That's my intended target if I get customers wanting RISC-V support on
Linux. I don't have much need for vector extensions, but memory fences
are well worthwhile. The other plausible RISC-V target for me is Android,
where a standard set of extensions will presumably be defined as part of
the platform.

> They apparently believe that the advantage of allowing a wide
> variety of implementations is worth that drawback. And the
> architects of other successful architectures apparently think so,
> too:

As best I can deduce, ARM decided that it was worth addressing the
embedded market and the general-purpose computing market separately, with
their Cortex-M and Cortex-A lines. Cortex-A is heading towards being A64
only, while Cortex-M remains T32.

> For an Intel Rocket Lake /proc/cpuinfo tells me:
>
> flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov

> . . .

In the early stage of Intel's extension-creation, when my employers had a
close relationship with them, they regularly tried to get me to support
the extension-of-the-month. They seemed to believe that application
software was supplied like motherboard firmware: supporting the latest
architectural features would be seen as necessary, and a given version of
software would only be used on a narrow range of Intel processor models.

This simply isn't true: the range of hardware on which commercial
software gets run is very wide, and would not be much more assorted if
there was an international conspiracy to make it so. We felt able to
demand SSE2 on 32-bit Windows /five years/ after Intel and AMD had
stopped selling processors that lacked it to the desktop market, and
still got some complaints. For those five years, we supplied separate x87
and SSE2 versions, doubling the costs of supporting 32-bit Windows.

John

John Dallman

unread,

Jul 12, 2022, 5:51:30 PM7/12/22

to

In article <d5ed7207-41c8-415b...@googlegroups.com>,
Mitch...@aol.com (MitchAlsup) wrote:

> My 66000::

Good design:

> Short conditional control transfers have a range of 18-bits in a
> 32-bit instruction.

Will do for internal branches within (almost) any function.

> Long unconditional control transfers have a 32-bit form with 28-bit
> address range.

Will do for internal calls within (almost) any library.

> Long constant control transfers have a 34-bit range with 64-bit
> instructions,

Will do for calls from executable to libraries and between libraries in
(almost) any process.

> .......................................................and a 64-bit
> range with 96-bit instructions.

For the unreasonable cases.

Neat work.

John

BGB

unread,

Jul 12, 2022, 7:24:44 PM7/12/22

to

On 7/12/2022 4:50 PM, John Dallman wrote:
> In article <2022Jul1...@mips.complang.tuwien.ac.at>,
> an...@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
>
>> so here you have it: rv64imafdc
>
> That's a little poor: it does not apparently implement all of RV64G,
> which is intended as a general-purpose 64-bit RISC-V.
>
>> <https://en.wikipedia.org/wiki/RISC-V#ISA_base_and_extensions>
>
> That has RV64G as being "Shorthand for the IMAFDZicsr Zifencei base and
> extensions, intended to represent a standard general-purpose ISA"
>

Presumably Zicsr and Zifencei are only really needed for OS level usage
though. Normal application code might not even have reason to
notice/care if they don't exist.

In my case, A/F/D are unlikely to be supported in my case for now,
mostly as they can't be cheaply mapped onto the BJX2 core (nor align
well with "my general design direction").

But, granted, this does mean that stuff built for RV64G or similar will
not run on it.

> That's my intended target if I get customers wanting RISC-V support on
> Linux. I don't have much need for vector extensions, but memory fences
> are well worthwhile. The other plausible RISC-V target for me is Android,
> where a standard set of extensions will presumably be defined as part of
> the platform.
>

I guess my case differs as I don't expect anyone would actually want to
buy my stuff...

It is more an "if anyone might find it useful" kind of thing.
But, like, if people found it useful enough that they wanted to pay my
cost of living or similar, this would also be good...

>> They apparently believe that the advantage of allowing a wide
>> variety of implementations is worth that drawback. And the
>> architects of other successful architectures apparently think so,
>> too:
>
> As best I can deduce, ARM decided that it was worth addressing the
> embedded market and the general-purpose computing market separately, with
> their Cortex-M and Cortex-A lines. Cortex-A is heading towards being A64
> only, while Cortex-M remains T32.
>

Seems true enough.

Cortex-M works OK if one doesn't need much performance.
They seem to claim pretty good Dhrystone numbers, but "actual code
performance" on Cortex-M devices (relative to clock speed) "actually
kinda sucks".

For real-time image/video processing tasks, a lot of the Cortex-M
devices are basically unusable.

>> For an Intel Rocket Lake /proc/cpuinfo tells me:
>>
>> flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
>> . . .
>
> In the early stage of Intel's extension-creation, when my employers had a
> close relationship with them, they regularly tried to get me to support
> the extension-of-the-month. They seemed to believe that application
> software was supplied like motherboard firmware: supporting the latest
> architectural features would be seen as necessary, and a given version of
> software would only be used on a narrow range of Intel processor models.
>
> This simply isn't true: the range of hardware on which commercial
> software gets run is very wide, and would not be much more assorted if
> there was an international conspiracy to make it so. We felt able to
> demand SSE2 on 32-bit Windows /five years/ after Intel and AMD had
> stopped selling processors that lacked it to the desktop market, and
> still got some complaints. For those five years, we supplied separate x87
> and SSE2 versions, doubling the costs of supporting 32-bit Windows.
>

In my case, never really worked in the software industry (so have
basically spent my whole life being a hobbyist).

In some sense, commercial software seemed like it was steadily eroding
in the face of open-source, like the whole thing would possibly implode
in a few years or something, ...

Then again, people have been claiming "Year of the Linux Desktop" since
when I was back in high-school (also x86-64 came out around this time as
well; as the world was slowly moving from Win98 to WinXP, ...).

Well, and at the time, I was dual booting Linux and Win2K.

But, much of a lifetime has passed since then (aging millennial, late 30s).

And, somehow, people seem to continue buying software; despite it
seeming like they should be little reason for people to do so, when
pretty much anything one could want (software-wise) is available for free.

Well, except that Linux still doesn't really match Windows on the
"general user experience" front; and MS has been "just short" of
annoying me enough to cause me to jump-ship back over to Linux (where I
had originally migrated back from Linux to Windows, due at the time, to
persistent frustrations with 3D graphics drivers and audio playback).

Last I have heard, 3D acceleration and audio playback issues are still
common problem areas, despite all the time that has passed since then.

But, I still (mostly) run FOSS stuff for things beyond the basic OS and
similar.

...

Stefan Monnier

unread,

Jul 12, 2022, 11:18:04 PM7/12/22

to

> Short conditional control transfers have a range of 18-bits in
> a 32-bit instruction.
> Long unconditional control transfers have a 32-bit form with 28-bit
> address range.
> Long constant control transfers have a 34-bit range with 64-bit
> instructions,
> .......................................................and a 64-bit
> range with 96-bit instructions.

I suspect that you get this 34-bit case almost "for free" from other parts
of your instruction encoding, but otherwise it looks to be of dubious
value: I'd expect the number of cases that need more than 28-bit but
less than 35-bit to be negligible.

Stefan

BGB

unread,

Jul 12, 2022, 11:50:01 PM7/12/22

to

I am inclined to agree.

Say, 20 to 24 bits should be sufficient for the vast majority of
binaries (except "very large" ones).

Much bigger than this should be able to encode pretty much any size of
binary (at least excluding wonk like MIPS style addressing modes).

Anything beyond this is probably a non-local branch and is better served
by some sort of "absolute branch" encoding, whether this be a 64-bit
Abs48 encoding (as in BJX2) or a 96-bit Abs64 encoding (*).

*: Partly N/A for BJX2 as the ISA uses a 48-bit logical address space
for branches. Could be relevant for Inter-ISA branches.

Within the larger 96-bit space, there is not currently any way to branch
outside the current 48-bit quadrant (apart from getting an interrupt or
similar to load something else into the PCH register). Likewise,
expanding the core to support 96-bit branching was "too expensive", so I
ended up not doing this.

As for dealing with branches beyond the 20-bit limit, there are a few
possibilities:
Switching to 64-bit Disp33 encodings;
Inserting branches every 1MB or so which can used for longer "trampoline
branches".

Say, a call from one end of a large text-section to another goes through
a series of branches to get to the target. This is slower and uglier
than using longer branch encodings, but could potentially lead to a
smaller binary by consolidating the cost of multiple long-range branches.

...

Thomas Koenig

unread,

Jul 13, 2022, 2:01:04 AM7/13/22

to

John Dallman <j...@cix.co.uk> schrieb:

> In article <2022Jul1...@mips.complang.tuwien.ac.at>,
> an...@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
>
>> so here you have it: rv64imafdc
>
> That's a little poor: it does not apparently implement all of RV64G,
> which is intended as a general-purpose 64-bit RISC-V.

It seems like Linux (Ubuntu, in this case) is not advocating
everything the processor has...

>
>> <https://en.wikipedia.org/wiki/RISC-V#ISA_base_and_extensions>
>
> That has RV64G as being "Shorthand for the IMAFDZicsr Zifencei base and
> extensions, intended to represent a standard general-purpose ISA"

... because the handbook says

# Each U7 core is configured to support the RV64I base ISA, as
# well as the Multiply (M), Atomic (A), Single-Precision Floating
# Point (F), Double-Precision Floating Point (D), Compressed (C),
# CSR Instructions (Zicsr), Instruction-Fetch Fence (Zifencei),
# Address Calculation (Zba), Basic Bit Manipulation (Zbb), and
# Count Overflow and Mode-Based Filtering (Sscofpmf) RISC‐V
# extensions. This is captured by the RISC‐V extension string:
# RV64GC_Zba_Zbb_Sscofpmf. The base ISA and instruction extensions
# are described in Chapter 6.

> That's my intended target if I get customers wanting RISC-V support on
> Linux. I don't have much need for vector extensions, but memory fences
> are well worthwhile. The other plausible RISC-V target for me is Android,
> where a standard set of extensions will presumably be defined as part of
> the platform.

Standard set of extensions sounds a bit like an oxymoron, doesn't it?

Regarding compiler support: In an ideal world, it would be possible
to pass that option string to a compiler, which would then parse
it and enable / disable the individual features. With gcc, it
would be straightforward to do; I have no idea why the RISC-V
maintaines chose to do otherwise.

>> They apparently believe that the advantage of allowing a wide
>> variety of implementations is worth that drawback. And the
>> architects of other successful architectures apparently think so,
>> too:
>
> As best I can deduce, ARM decided that it was worth addressing the
> embedded market and the general-purpose computing market separately, with
> their Cortex-M and Cortex-A lines. Cortex-A is heading towards being A64
> only, while Cortex-M remains T32.
>
>> For an Intel Rocket Lake /proc/cpuinfo tells me:
>>
>> flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
>> . . .
>
> In the early stage of Intel's extension-creation, when my employers had a
> close relationship with them, they regularly tried to get me to support
> the extension-of-the-month. They seemed to believe that application
> software was supplied like motherboard firmware: supporting the latest
> architectural features would be seen as necessary, and a given version of
> software would only be used on a narrow range of Intel processor models.

Only viable for self-compiled software, really. Fine if you have
the source and regularly compile yourself (which is mostly true
for some scientific software, I suppose), not at all fine if you
use binary software.

> This simply isn't true: the range of hardware on which commercial
> software gets run is very wide, and would not be much more assorted if
> there was an international conspiracy to make it so. We felt able to
> demand SSE2 on 32-bit Windows /five years/ after Intel and AMD had
> stopped selling processors that lacked it to the desktop market, and
> still got some complaints. For those five years, we supplied separate x87
> and SSE2 versions, doubling the costs of supporting 32-bit Windows.

Doubling space, surely. Doubling costs: For such a major difference,
it would be feasible to supply two versions and switch on
CPU feature for performance-critical parts. We did that for
libgfortran for matmul - compile the same source with different
options and switch at run-time. It's annoying to write, sure,
but not expensive to maintain.

With x86, there is at least some sort of progression of the features
that the processors have. You can often (but not always, because
features get dropped) assume a certain level for features, the ones
that come later only tend to be available if the ones that came
earlier are present.

This is cancer, but RISC-V is far worse, and worse by design.

Maybe obfuscated source could make a come-back again - only one step
removed from the Mill's specializers :-)

MitchAlsup

unread,

Jul 13, 2022, 12:18:00 PM7/13/22

to

It depends if you put .bss gigabytes away from .txt gigabytes away from
.data,...
<
The thing is, that you can put these areas that far away from each other
and still map everything from a single page of MMU tables.....
>
>
> Stefan

MitchAlsup

unread,

Jul 13, 2022, 12:23:59 PM7/13/22

to

On Wednesday, July 13, 2022 at 1:01:04 AM UTC-5, Thomas Koenig wrote:
> John Dallman <j...@cix.co.uk> schrieb:
> > In article <2022Jul1...@mips.complang.tuwien.ac.at>,
> > an...@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
> >
> >> so here you have it: rv64imafdc
> >
> > That's a little poor: it does not apparently implement all of RV64G,
> > which is intended as a general-purpose 64-bit RISC-V.
> It seems like Linux (Ubuntu, in this case) is not advocating
> everything the processor has...
> >
> >> <https://en.wikipedia.org/wiki/RISC-V#ISA_base_and_extensions>
> >
> > That has RV64G as being "Shorthand for the IMAFDZicsr Zifencei base and
> > extensions, intended to represent a standard general-purpose ISA"
> ... because the handbook says
>
> # Each U7 core is configured to support the RV64I base ISA, as
> # well as the Multiply (M), Atomic (A), Single-Precision Floating
> # Point (F), Double-Precision Floating Point (D), Compressed (C),
> # CSR Instructions (Zicsr), Instruction-Fetch Fence (Zifencei),
> # Address Calculation (Zba), Basic Bit Manipulation (Zbb), and
> # Count Overflow and Mode-Based Filtering (Sscofpmf) RISC‐V
> # extensions. This is captured by the RISC‐V extension string:
> # RV64GC_Zba_Zbb_Sscofpmf. The base ISA and instruction extensions
> # are described in Chapter 6.
<

In comparison the only optional instruction n My 66000 is BMM
{Bit Matrix Multiply}.

My previous machine was bought in 2012 and died in 2021, it was running
SW from 1996 (CorelDraw) and from 2003 (Office). My current machine
is running the same SW.

Stefan Monnier

unread,

Jul 13, 2022, 12:48:12 PM7/13/22

to

MitchAlsup [2022-07-13 09:17:58] wrote:
> On Tuesday, July 12, 2022 at 10:18:04 PM UTC-5, Stefan Monnier wrote:
>> > Short conditional control transfers have a range of 18-bits in
>> > a 32-bit instruction.
>> > Long unconditional control transfers have a 32-bit form with 28-bit
>> > address range.
>> > Long constant control transfers have a 34-bit range with 64-bit
>> > instructions,
>> > .......................................................and a 64-bit
>> > range with 96-bit instructions.
>> I suspect that you get this 34-bit case almost "for free" from other parts
>> of your instruction encoding, but otherwise it looks to be of dubious
>> value: I'd expect the number of cases that need more than 28-bit but
>> less than 35-bit to be negligible.
> It depends if you put .bss gigabytes away from .txt gigabytes away from
> .data,...

But if you do put these "far apart" (e.g. for ASLR) then it seems likely
that it will be more than "a handful of GB apart" (i.e. past the 34bit
limit).

My intuition says that the only reason why offsets between 28bit and
35bit would be common is if the compiler&linker is tuned specifically to
target that range, but I can't think of a good reason why that range
would be particularly desirable (it seems targeting 26-28bit would give
the same benefits in most cases and would result in tighter code, no?).

Stefan

BGB

unread,

Jul 13, 2022, 1:49:25 PM7/13/22

to

Neither ELF nor PE/COFF will tend to do this though in baseline forms.
Almost invariably, the sections will be right next to each other.

Only way one is likely to get multi-GB distances is with multi-GB
sections, which is probably unlikely, unless the program is doing
something particularly stupid (eg, gigantic arrays).

Eg, programmer:
int somearr[1024][1024][1024]={{{1}}};
Compiler:
"Oh Shi...".

It can happen with ELF-FDPIC, but then one is accessing all
'.data'/'.bss' via the GOT, so PC distance does not matter.

The PBO ABI in my case may also have this, but then '.data'/'.bss' will
be accessed relative to GBR, so the distance between '.text' and '.data'
does not matter in this case.

In other news, I have noted that I can get a pretty big speedup in Doom
and similar mostly by switching the L2 cache back to direct-mapped mode
(from 2-way).

I had been looking once again into a victim buffer as an alternative to
a 2-way L2, but once again it seems to be being "kinda useless" in this
case (very low hit rate).

It also seems that while my memcpy benchmark is faster with the L2 in
2-way mode, both the memset and memload tests are faster in
direct-mapped mode (with these tests basically reaching the theoretical
DRAM speeds).

Some of this is likely due to some "compromises" made to the design of
the 2-way cache to limit costs.

The 256K direct-mapped L2 case does particularly well in Doom.
Quake doesn't really seem to care much either way.

2-way associativity helps L1 I$, but not by much, and seems to be "less"
with bigger I$ (eg: 32K L1 I$ gets ~ 99.9% either way; so seems to be
"more relevant" to 8K or 16K I$).

Some of this seems to come down more to the relative costs between LUTs
and Block-RAM.

...

>>
>>
>> Stefan

MitchAlsup

unread,

Jul 13, 2022, 2:08:07 PM7/13/22

to

On Wednesday, July 13, 2022 at 11:48:12 AM UTC-5, Stefan Monnier wrote:
> MitchAlsup [2022-07-13 09:17:58] wrote:
> > On Tuesday, July 12, 2022 at 10:18:04 PM UTC-5, Stefan Monnier wrote:
> >> > Short conditional control transfers have a range of 18-bits in
> >> > a 32-bit instruction.
> >> > Long unconditional control transfers have a 32-bit form with 28-bit
> >> > address range.
> >> > Long constant control transfers have a 34-bit range with 64-bit
> >> > instructions,
> >> > .......................................................and a 64-bit
> >> > range with 96-bit instructions.
> >> I suspect that you get this 34-bit case almost "for free" from other parts
> >> of your instruction encoding, but otherwise it looks to be of dubious
> >> value: I'd expect the number of cases that need more than 28-bit but
> >> less than 35-bit to be negligible.
<

Note:: 34-bits not 35-bits.

<
> > It depends if you put .bss gigabytes away from .txt gigabytes away from
> > .data,...
<
> But if you do put these "far apart" (e.g. for ASLR) then it seems likely
> that it will be more than "a handful of GB apart" (i.e. past the 34bit
> limit).
<

I don't want, in particular, my ISA to limit {compiler or linker or GuestOS}
choice of how to organize SW. Whether it be ASLR on a per invocation
of the application/GuestOS, sharing re-entrant libraries across GuestOSs
or Hypervisors, or applications.
<
Also note: Even when you place these sections that far apart, you can still
map entire applications in one page of MMU tables; so there is little
reason not to spew things widely apart.

>
> My intuition says that the only reason why offsets between 28bit and
> 35bit would be common is if the compiler&linker is tuned specifically to
> target that range, but I can't think of a good reason why that range
> would be particularly desirable (it seems targeting 26-28bit would give
> the same benefits in most cases and would result in tighter code, no?).
<

I have never seen an application that would benefit from 34-bit addressing
over 28-bit addressing in terms of flow control. However, I have been told
that data base application can do strange stuff like this; on the other hand
Data Bases probably use a layer of indirection when doing stuff like this.
<
And yes, this addressing falls out for "free" in my ISA encoding, so there is
no particular reason to avoid using it if it provides that service.
>
>
> Stefan

EricP

unread,

Jul 13, 2022, 2:47:08 PM7/13/22

to

The compiler doesn't know how far away callee's will be,
the linker determines that.

I assume all compilers emit 64-bit global offsets and a
compacting linker eliminates the excess when it determines
which calls and mem refs are intra-dll and which are inter.
The ISA supported offset sizes just determines the frequency
of each compaction.

The ability for the linker to compact also depends on exactly
how OS memory sections are allocated and relocated.
If data sections can be arbitrarily relocated relative to code
then linker has to leave offsets at 64 bits, just in case.

If an OS and EXE format allows a set of multiple memory sections
to be specified at fixed relative offset to each other,
then linker can use that knowledge to compact PC-relative offsets
to data objects, since linker knows that if load-time relocation
is required that the set of memory sections relocate together.

BGB

unread,

Jul 13, 2022, 3:18:54 PM7/13/22

to

On 7/13/2022 11:47 AM, Stefan Monnier wrote:
> MitchAlsup [2022-07-13 09:17:58] wrote:
>> On Tuesday, July 12, 2022 at 10:18:04 PM UTC-5, Stefan Monnier wrote:
>>>> Short conditional control transfers have a range of 18-bits in
>>>> a 32-bit instruction.
>>>> Long unconditional control transfers have a 32-bit form with 28-bit
>>>> address range.
>>>> Long constant control transfers have a 34-bit range with 64-bit
>>>> instructions,
>>>> .......................................................and a 64-bit
>>>> range with 96-bit instructions.
>>> I suspect that you get this 34-bit case almost "for free" from other parts
>>> of your instruction encoding, but otherwise it looks to be of dubious
>>> value: I'd expect the number of cases that need more than 28-bit but
>>> less than 35-bit to be negligible.
>> It depends if you put .bss gigabytes away from .txt gigabytes away from
>> .data,...
>
> But if you do put these "far apart" (e.g. for ASLR) then it seems likely
> that it will be more than "a handful of GB apart" (i.e. past the 34bit
> limit).
>

Yes, and ASLR is typically "per loaded image" and not "per section".

Linker isn't usually going to be like:
01100000 .text
7C000000 .data
7FE00000 .bss

Like, just, no...

Except unless maybe if setting up a ROM image, but this is its own thing.

My PBO ABI does something a little funky:
.text (Code)
.strtab (Strings)
.rodata (Read-Only Data, Eg, const arrays)
.idata (Imports)
.edata (Exports)
.pdata (Exception Unwind Tables)
.rsrc (Resource Data; RWAD, *1)
.reloc (Relocation Table)
...
(GBR points here)
.data (Initialized data)
.bss (Uninitialized data)

The image omits .bss from the total image size, whereas "Global Pointer"
in the Data-Directory defines the combined size of '.data' and '.bss'
(with anything past the end of the image being implicitly initialized to
zero).

Everything outside of the "Global Pointer" area is implicitly read-only
as far as the ABI is concerned (typically Read+Execute); Whereas stuff
inside Global-Pointer is Read+Write to the running program.

*1: The original format for the PE/COFF '.rsrc' section was pretty much
useless garbage; I replaced it effectively with a WAD image. Locations
are specified essentially as RVA's, but the header uses a "self-pointer"
to allow figuring the relative base address if given a pointer to the
RWAD header.

The idata and edata section would contain imports and exports.

The "Import Address Table" in this case is basically a table of
long-distance branch instructions, keyed to a table of import names
(which point into the string table). These will fix-up the imports to
the corresponding exports.

The exports table basically keys "ordinal numbers" to exported
addresses, and another table to map between string-based names and
ordinal numbers (in TestKern, only the export names are used).

Generally, the format does not allow importing or exporting variables
(usually, these are local to a given image).

This sort of thing tends to work implicitly on Linux with ELF, but not
generally with PE/COFF.

But, say, one way of dealing with these is a hidden translation step, eg
(from memory, not sure of exact notation, but something like this):
(Lib 1)
__declspec(dllexport) int somevar;
(Lib 2)
__declspec(dllimport) int somevar;

Turns into something like:
(Lib 1)
int somevar;
__declspec(dllexport) int *__get$somevar()
{ return &somvar; }
(Lib 2)
__declspec(dllimport) int *__get$somevar();
int *__ref$somevar = __get$somevar(); //*

*: Called implicitly when DLL or similar is instantiated in a program.

With any use if:
somevar
Getting quietly translated to, say:
(*__ref$somevar)
Or, alternatively:
(*(__get$somevar()))

Though, generally, this is one of those "don't do this" sort of things.

Though, can note that in the C library I am using:
stdin, stdout, stderr, errno, ...
Are actually macros which perform function calls, with a similar
strategy to the above.

Partly this was to allow the C library to be "dynamic linking friendly".

For DLLs, also, one doesn't have a full C library, but rather only some
of the "string.h" and "math.h" functions and similar are local, with
things like "stdio.h" functions actually being wrappers built on top of
VTables.

Say:
int fread(void *buf, size_t n1, size_t n2, FILE *fd)
{ return((*VtStdio)->fread(VtStdio, buf, n1, n2, fd)); }

In this case, the main EXE would supply the "actual" C library instance.

Ironically, this can avoid the annoyance (in Windows) where one does a
malloc() in one DLL and a free(), or fopen() in one DLL and accessing in
another, ... being prone to cause MSVCRT to violently explode...

If I got around to rewriting the "stdio()" part of the C library, would
likely switch over to a design where 'FILE' is itself basically a VTable.

typedef VT_FILE_t *FILE;
...
int fread(void *buf, size_t n1, size_t n2, FILE *fd)
{ return((*fd)->fread(buf, n1, n2, fd)); }

Where the 'FILE' typically contains a hidden structure consisting of the
VTable pointer followed by any other data relevant to managing the file
in question (not relevant to, and hidden from, the calling program).

This would also dissolve the difference between the main-program C
library and DLL C library.

But, still haven't gotten around to this.

As noted, the current C library is very much "not structured this way".

> My intuition says that the only reason why offsets between 28bit and
> 35bit would be common is if the compiler&linker is tuned specifically to
> target that range, but I can't think of a good reason why that range
> would be particularly desirable (it seems targeting 26-28bit would give
> the same benefits in most cases and would result in tighter code, no?).
>

Agreed.

I can note that pretty much everything I am compiling right now is small
enough to fit within a 20-bit PC displacement, though a few programs
(such as ROTT) are pretty close to the limits of this displacement.

Next size up (due to instruction encoding) is 33 bits.
While arguably overkill, there isn't any intermediate option at present.

Where:
* FFw0_0ddd_F0dd_Cddd BRA (PC, Disp33s)
* FFw0_0ddd_F0dd_Dddd BSR (PC, Disp33s)
* FFw0_0ddd_E0dd_Cddd BT (PC, Disp33s)
* FFw0_0ddd_E4dd_Cddd BF (PC, Disp33s)

Where w.i gives the sign bit.

Could have theoretically had:
* FEdd_dddd_F0dd_Cddd BRA (PC, Disp44s)
But, no way for pipeline to handle this easily (this is not currently
valid).

And:
FFdd_dddd_FAdd_dddd BRA (Abs48)

Basically achieves the same purpose, and I actually had a way to deal
with this case in the pipeline.

BGB

unread,

Jul 13, 2022, 3:40:50 PM7/13/22

to

The simpler option is that the compiler assumes a "memory model" (could
be manually specified on the command line if needed; with the compiler
assuming something "sane" by default, say: .text is less than 1MB and
.text+.data.bss are less than 16MB, ...).

Linker either links successfully, or if it can't perform a relocation,
dies, and the programmer needs to specify a larger model.

Though, because of the way BGBCC works, it can use a multi-pass strategy
here, with the first pass determining an upper limit for the binary (and
thus which memory-model parameters it needs to use).

The code for performing relocs basically terminates the compiler if it
encounters a reloc that goes out of range, so the code-generator needs
to be conservative here (but, may flag small functions to use smaller
branch encodings for local branches).

> The ability for the linker to compact also depends on exactly
> how OS memory sections are allocated and relocated.
> If data sections can be arbitrarily relocated relative to code
> then linker has to leave offsets at 64 bits, just in case.
>

This approach seems like it would get very complicated vs the "memory
model" approach.

> If an OS and EXE format allows a set of multiple memory sections
> to be specified at fixed relative offset to each other,
> then linker can use that knowledge to compact PC-relative offsets
> to data objects, since linker knows that if load-time relocation
> is required that the set of memory sections relocate together.
>

FWIW:
I only use PC-relative addressing for read-only sections...

John Levine

unread,

Jul 13, 2022, 3:42:26 PM7/13/22

to

According to EricP <ThatWould...@thevillage.com>:

>I assume all compilers emit 64-bit global offsets and a
>compacting linker eliminates the excess when it determines
>which calls and mem refs are intra-dll and which are inter.
>The ISA supported offset sizes just determines the frequency
>of each compaction.

Perhaps on some systems. On systems that use ELF libraries, including
linux and the BSDs, each executable or library has a table called the
Procedure Linkage Table (PLT), which contains an entry for each
external routine called from the library. Initially each PLT entry
contains a call to the dynamic linker which is patched on the first
call to be a jump to the actual routine. So all the calls in the code
are actually calls to the PLT which can use a short address.

I have written assemblers that do address shortening, and the way it
would work in a linker would be similar. Doing it well is NP-complete
and can require a lot of code patching, since it can change the
offsets of data references, so it's not a feature I would want to use
a lot.

Ivan Godard

unread,

Jul 13, 2022, 4:17:16 PM7/13/22

to

On 7/13/2022 12:42 PM, John Levine wrote:
> According to EricP <ThatWould...@thevillage.com>:
>> I assume all compilers emit 64-bit global offsets and a
>> compacting linker eliminates the excess when it determines
>> which calls and mem refs are intra-dll and which are inter.
>> The ISA supported offset sizes just determines the frequency
>> of each compaction.
>
> Perhaps on some systems. On systems that use ELF libraries, including
> linux and the BSDs, each executable or library has a table called the
> Procedure Linkage Table (PLT), which contains an entry for each
> external routine called from the library. Initially each PLT entry
> contains a call to the dynamic linker which is patched on the first
> call to be a jump to the actual routine. So all the calls in the code
> are actually calls to the PLT which can use a short address.
>
> I have written assemblers that do address shortening, and the way it
> would work in a linker would be similar. Doing it well is NP-complete
> and can require a lot of code patching, since it can change the
> offsets of data references, so it's not a feature I would want to use
> a lot.

Don't think it's NP.

Walk the code, keeping two distances for each unresolved branch. One is
how far the target would be if all unresolved were resolved as near; the
other is if all unresolved were resolved as far. For each unresolved, if
the as-far is less than the near range then resolve as near, and if the
as-near is greater than the near range then resolve as far; if neither
then leave it unresolved and go to the next unresolved. Repeat until a
pass resolves nothing additional. At that point, resolve all unresolved
as near. Done. The algorithm generalizes in the obvious way to ISAs with
more than two range categories.

It helps if the code is in internal form that doesn't represent the
branches as instructions, i.e. short sequences of fixed length
containing non-branch instructions, together with the max and min
distances of the terminating branch; that way you walk the sequence
descriptors and not the actual instructions. Resolution of a descriptor
generates the binary of the resolved branch and glues it to the sequence
on either side to make a longer sequence; gluing may be actual copy or
just an adjustment of links.

I have never seen actual code take more than three iterations to get to
the fix point, but in any case the number of iterations cannot be
greater than the number of branches, so not NP.

BGB

unread,

Jul 13, 2022, 4:33:22 PM7/13/22

to

On 7/13/2022 2:42 PM, John Levine wrote:
> According to EricP <ThatWould...@thevillage.com>:
>> I assume all compilers emit 64-bit global offsets and a
>> compacting linker eliminates the excess when it determines
>> which calls and mem refs are intra-dll and which are inter.
>> The ISA supported offset sizes just determines the frequency
>> of each compaction.
>
> Perhaps on some systems. On systems that use ELF libraries, including
> linux and the BSDs, each executable or library has a table called the
> Procedure Linkage Table (PLT), which contains an entry for each
> external routine called from the library. Initially each PLT entry
> contains a call to the dynamic linker which is patched on the first
> call to be a jump to the actual routine. So all the calls in the code
> are actually calls to the PLT which can use a short address.
>

Off hand, I am not aware of a compiler which does link-time shortening
in this way. Seems like it would probably add too many complexities to
be "worth it".

GOT/PLT is simpler, since these are usually simple-case memory loads
followed by calling via a register or similar, don't need to care where
or how far away the called function is.

Though, ELF FDPIC adds a bit of complexity and overhead here (this is
part of why I skipped FDPIC and instead went for my own ABI design).
Didn't want a bunch of extra memory loads and stores for every function
call and similar (in effect, calling a function effectively requires
saving the current GOT, then loading the new GOT for the function being
called, then restoring the old GOT after this function returns).

> I have written assemblers that do address shortening, and the way it
> would work in a linker would be similar. Doing it well is NP-complete
> and can require a lot of code patching, since it can change the
> offsets of data references, so it's not a feature I would want to use
> a lot.

When I did it before in an x86 assembler:
Do an assembly pass, emit everything at first as largest branch size;
Do another pass, within each distance in-range of a smaller encoding
being replaced by that encoding;
Continue until the code stops shrinking any more.

Usually takes around 3 to 5 iterations or so.

But, only really applies "locally".

For other ISA designs, especially those which require PC-relative
constant loads (which may move around based on relative distance), this
turns into a mess, and isn't really "worth it". SuperH, Thumb, and RV64I
would likely fall in this category.

Also doesn't play as well with having the per-function code generation
emit machine-code directly (vs using a separate assembler stage).

A simpler strategy:
If pointing back to a previous symbol, see if distance is small enough
to use a small encoding;
Check for other "use a small branch" cases, using a small branch if true;
Else, assume maximum-length branch (as determined by the current memory
model).

Though, one drawback here is that, as soon as one exceeds, say, 1MB of
".text", then nearly every non-local or forward branch in the program
will pretty much immediately double in size.

Not really an ideal solution to this...

Opus

unread,

Jul 13, 2022, 5:16:00 PM7/13/22

to

Le 09/07/2022 à 06:25, Quadibloc a écrit :
> ...to make the use of 32-bit immediate values in instructions more convenient.
>

> John Savard

Can you point us to any recent document that describes this?

RISC-V has been made extensible right from the start to support
instruction lengths from 16-bit to 192-bit in 16-bit increments. So this
possibility is nothing new, it was there from the start.

But, only the 16-bit instructions (for the C extension) and the 32-bit
instructions were defined in official specs until now, at least AFAIK.
Is there any recent spec that defines a set of 48-bit instructions
explicitely? If so, which is it?

MitchAlsup

unread,

Jul 13, 2022, 5:23:51 PM7/13/22

to

On Wednesday, July 13, 2022 at 2:42:26 PM UTC-5, John Levine wrote:
> According to EricP <ThatWould...@thevillage.com>:
> >I assume all compilers emit 64-bit global offsets and a
> >compacting linker eliminates the excess when it determines
> >which calls and mem refs are intra-dll and which are inter.
> >The ISA supported offset sizes just determines the frequency
> >of each compaction.
> Perhaps on some systems. On systems that use ELF libraries, including
> linux and the BSDs, each executable or library has a table called the
> Procedure Linkage Table (PLT), which contains an entry for each
> external routine called from the library. Initially each PLT entry
> contains a call to the dynamic linker which is patched on the first
> call to be a jump to the actual routine. So all the calls in the code
> are actually calls to the PLT which can use a short address.
>

<----------------------------------------------------------------------------------------------------------------

<
> I have written assemblers that do address shortening, and the way it
> would work in a linker would be similar. Doing it well is NP-complete
> and can require a lot of code patching, since it can change the
> offsets of data references, so it's not a feature I would want to use
> a lot.
<

Doing it perfectly is NP-complete.
Getting 95%-99% is linear.
<
Complier makes everything in largest available mode.
Linker looks at the sizes of the various segments, and makes a preliminary
address assignment.
Linker then makes a pass over the code, and if the address can be shortened
the linker shortens it.
<
The cases this algorithm does not cover is the very tiny proportion where
the address "looks" to still be long when it gets resolved, but after all resolutions
have been performed, it just barely fits in a smaller size.
<
Back in the Mc 88100 and 88110 days, we were seeing the very vast majority
of all addresses resolve down into a small form.
<
If the address/access cannot be shortened it still runs perfectly only its code
image (and possibly run time) are suboptimal. If this represents 1% nobody
will care, if it represents 5% only a few will.
<-------------------------------------------------------------------------------------------------------------------------------

MitchAlsup

unread,

Jul 13, 2022, 5:28:58 PM7/13/22

to

Mc 88110 just did one fix up pass and left the rest using large addresses.
<
I think, technically, getting it perfect requires a pass for every resolved address*
and this smells a bit like NP-complete to me. In practice, Mc 88100 used one
pass and saw virtually no excess code path dynamically.
<
(*) every time the linker shortens up an access, it creates the potential that
some already passed over access that was left long might now fit in shorter
form. This would require a very carefully laid out access order and corresponding
inverted placement lay out order.

Brett

unread,

Jul 13, 2022, 8:03:06 PM7/13/22

to

MitchAlsup <Mitch...@aol.com> wrote:
> On Saturday, July 9, 2022 at 3:18:35 PM UTC-5, gg...@yahoo.com wrote:

>> Quadibloc <jsa...@ecn.ab.ca> wrote:
>>> ...to make the use of 32-bit immediate values in instructions more convenient.
>>>
>>> John Savard
>>>

>> Also known as Very Long Instruction Words (VLIW), 128 bits is popular.
>> Note that you have a branch about every 4 instructions, so with 128 bits
>> you can pack in different combinations of loads/stores/ops and optionally a
>> branch.
>>
>> Costs an extra cycle for decode, but the rest of the pipeline is easier, as
>> you are effectively 4 wide without most of the hassles and die size of 4
>> wide OoO.
>> And you can scale up to dual issue which is 8 wide for OoOe. Note this
>> tends to be in-order.
>>
>> With 48 bits you are going to pack 1-3 actual instructions in that packet
>> to prevent waste and poor performance.
>>
>> For 32 bit immediates MIPS used load immediate low and load immediate high,
>> which can be combined in a long pipeline to execute in one cycle.
>>
>> I think My 66000 is using 64 bit immediates for long calls, and those calls
>> would fit in 48 bit instructions.
> <
> My 66000::

> Short conditional control transfers have a range of 18-bits in a 32-bit instruction.
> Long unconditional control transfers have a 32-bit form with 28-bit address range.

CPU manufacturers recommend that important call be on cache lines so that
eight instructions are loaded in the first cycle.

This would give you 4 more bits of range for that call variant, effectively
eliminating far calls.

MitchAlsup

unread,

Jul 13, 2022, 8:50:30 PM7/13/22

to

The instruction to cache line alignment is a SW problem, SW can specify anything
it wants. It is recommended that entries are on large boundaries, but not mandated.
FORTRAN allows for multiple entry points per function, making all them cache line
aligned might be a bit........
<
Secondarily, there is no such restriction on branches and both instructions are
decoded from the same subGroup and pass through to the branch adder with
a <logic free> 2-bit upshift.

>
> This would give you 4 more bits of range for that call variant, effectively
> eliminating far calls.
<

Not for 1TB sized applications.

John Levine

unread,

Jul 13, 2022, 8:53:33 PM7/13/22

to

It appears that Ivan Godard <iv...@millcomputing.com> said:
>Walk the code, keeping two distances for each unresolved branch. One is
>how far the target would be if all unresolved were resolved as near; the
>other is if all unresolved were resolved as far. For each unresolved, if
>the as-far is less than the near range then resolve as near, and if the
>as-near is greater than the near range then resolve as far; if neither

>then leave it unresolved and go to the next unresolved. ...

There's a bunch of variations on that. I think mine started with all
the branches long and then it iterated looking for ones it could
shorten until it couldn't find any more, which gets you a good result
but not necessarily optimal.

Here's the paper with the proof, from the CACM in 1978:

https://dl.acm.org/doi/10.1145/359460.359474

Ivan Godard

unread,

Jul 13, 2022, 9:43:56 PM7/13/22

to

Yes, as I described, a maximal contrived example will need one pass over
the remaining unresolved branches per branch. That is O<branches>, which
is linear. If the code is represented as a sequence of descriptors, each
containing length and max/min distances, each pass over the remaining
descriptors will consolidate at least one pair, so the maximal number of
descriptors examined is (n^2)/2 where n is again the number of branches,
which is quadratic. Neither is anything like NP, which requires
exponentiality or worse.

In practice, the number of passes over real code is ~3, and the number
of descriptor examinations is ~1, both roughly constant and growing with
n at a very low rate, ~0.

MitchAlsup

unread,

Jul 14, 2022, 10:12:14 AM7/14/22

to

You also need to consider LDs and STs as these have several lengths depending...

MitchAlsup

unread,

Jul 14, 2022, 11:16:59 AM7/14/22

to

But, yes, far less than NP.

John Levine

unread,

Jul 14, 2022, 11:24:11 AM7/14/22

to

According to MitchAlsup <Mitch...@aol.com>:

>> In practice, the number of passes over real code is ~3, and the number
>> of descriptor examinations is ~1, both roughly constant and growing with
>> n at a very low rate, ~0.
><
>But, yes, far less than NP.

Like many NP complete problems, this one has approximate solutions
that are pretty close and are adequate except perhaps in the occasional
embedded application where you're trying to squeeze everything into a
small fixed amount of ROM.

EricP

unread,

Jul 14, 2022, 11:51:32 AM7/14/22

to

John Levine wrote:
> According to EricP <ThatWould...@thevillage.com>:
>> I assume all compilers emit 64-bit global offsets and a
>> compacting linker eliminates the excess when it determines
>> which calls and mem refs are intra-dll and which are inter.
>> The ISA supported offset sizes just determines the frequency
>> of each compaction.
>
> Perhaps on some systems. On systems that use ELF libraries, including
> linux and the BSDs, each executable or library has a table called the
> Procedure Linkage Table (PLT), which contains an entry for each
> external routine called from the library. Initially each PLT entry
> contains a call to the dynamic linker which is patched on the first
> call to be a jump to the actual routine. So all the calls in the code
> are actually calls to the PLT which can use a short address.

Yes, I looked at your Linkers & Loaders book to see how
*nix does some of this image load relocation.
Windows uses a similar jump table at the end of the code section
to bind to DLL export routines at image load.

I don't mean that every call in a exe would have a 64-bit offset that
would be patched at image load. It would still use a call-jump table
because that is most efficient.

However it is possible to be flexible and not locked into a single
memory model dictated by the ISA. For example, my ISA has 16, 32 and 64
bit offsets. A fixed size model would force the choice 32 bit offsets.
A flexible model could adapt as needed.

The compiler could emit calls with 64-bit offsets,
which allows the linker to locate that call-jump table anywhere,
and then the compactor optimizes the code for that specific ISA.

Similarly, the Global Offset Table (GOT) deals with references to
program scope variables, but a fixed size N bit offset model only
allows static objects only within +/- that fixed range.
Outside that you have to compile with a larger memory model.
A compacting linker should eliminate these memory models.

> I have written assemblers that do address shortening, and the way it
> would work in a linker would be similar. Doing it well is NP-complete
> and can require a lot of code patching, since it can change the
> offsets of data references, so it's not a feature I would want to use
> a lot.

Is it NP-complete? Because it doesn't look like it needs to try all
permutations of offset lengths. Anyway, even if it was it doesn't
need to be perfect *provided* the ISA behaves the same both ways.
If you start with all code and data offsets 64 bits,
3 sweeps of forward and backward should get almost all compaction,
and the rest is just noise.

There are complications. Code alignment optimizations by the compiler
mean that it may have already arranged loop entries on optimal boundaries.
This would restrict a linkers' moves to only, say, 16-byte lumps.

Also the order that memory sections are laid out matters.
If code is at low address and static data at higher, as is usual now,
as code contracts, the distance from a memory reference to its static
data gets longer by the same amount so data offsets grow.
Instead we could switch the order so data sections are lower address,
code higher. Then as code for call offset contracts, it gets closer
to its data too which may allow even more contractions.

My concern isn't the algorithm, it is for the size of the data
structures the linker would have to maintain to shuffle all this about.
But then again, my browser regularly uses 250+ MB virtual space so
any such linker working set concerns are likely misplaced.

Stefan Monnier

unread,

Jul 14, 2022, 12:01:39 PM7/14/22

to

> Here's the paper with the proof, from the CACM in 1978:
>
> https://dl.acm.org/doi/10.1145/359460.359474

Notice that this relies crucially on jumping to address which are the
result of "complex" computations (the canonical example in the paper
uses targets of the form "260 - (L1 - L2)"). As long as every branch
target is restricted to the form "label + offset" the problem is O(N),
as shown in that same paper :-)

Stefan

Ivan Godard

unread,

Jul 14, 2022, 12:03:27 PM7/14/22

to

On 7/14/2022 8:24 AM, John Levine wrote:
> According to MitchAlsup <Mitch...@aol.com>:
>>> In practice, the number of passes over real code is ~3, and the number
>>> of descriptor examinations is ~1, both roughly constant and growing with
>>> n at a very low rate, ~0.
>> <
>> But, yes, far less than NP.
>
> Like many NP complete problems, this one has approximate solutions
> that are pretty close and are adequate except perhaps in the occasional
> embedded application where you're trying to squeeze everything into a
> small fixed amount of ROM.
>
>
>
>

Worse case is linear in number of passes, quadratic in number of
instructions read. That's not an approximation, that's Deterministic
Polynomial. https://en.wikipedia.org/wiki/NP_(complexity)

Stefan Monnier

unread,

Jul 14, 2022, 12:32:53 PM7/14/22

to

BTW, I guess you can still get into worse than O(N) (probably back to
NP-complete) with "simple" branches to "label + offset" if your labels
can be constrained by alignment, in which case you can have things like:

L1: ...
J L1
...
J L2
...
.align <blabla>
L2: ...
...

where using a short-form for `J L1` causes the distance between `J L2`
and L2 to increase because L2 stays put due to the alignment constraint.

Of course, it's still only of anecdotal importance: there's still an
O(N) algorithm which, in practice, gets you close enough to the
optimal result.

Stefan

EricP

unread,

Jul 14, 2022, 1:29:29 PM7/14/22

to

It might work better to run compaction ignoring all alignment restrictions,
then do one pass to put alignment back and adjust offsets accordingly.

Terje Mathisen

unread,

Jul 14, 2022, 1:36:18 PM7/14/22

to

John Levine wrote:
> According to MitchAlsup <Mitch...@aol.com>:
>>> In practice, the number of passes over real code is ~3, and the number
>>> of descriptor examinations is ~1, both roughly constant and growing with
>>> n at a very low rate, ~0.
>> <
>> But, yes, far less than NP.
>
> Like many NP complete problems, this one has approximate solutions
> that are pretty close and are adequate except perhaps in the occasional
> embedded application where you're trying to squeeze everything into a
> small fixed amount of ROM.

On those systems you were much more likely to manually optimize for
size, to the point where _all_ branches were short, all stack-relative
offsets were in the -128 to 127 range, and calls would use extremely
non-standard ABIs.

Take a look at the 256-byte asm challenges, I am very impressed by what
has been shown to be possible.

Personally I was _very_ happy with my PC-DOS TSR which replaced
IBM/Microsoft's kbd & screen (font) drivers: The originals needed 20-60
KB depending upon version and capabilities, my replacement which
provided a true superset (it included a replacement font for 43 and
50-line EGA/VGA text modes) made do with just under 700 bytes. Far
branches were never an issue. :-)

Terje
PS. HP/Compaq Norway liked it as well, to the point where they stole it
twice...

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

EricP

unread,

Jul 14, 2022, 2:55:38 PM7/14/22

to

EricP wrote:

> John Levine wrote:
>
>> I have written assemblers that do address shortening, and the way it
>> would work in a linker would be similar. Doing it well is NP-complete
>> and can require a lot of code patching, since it can change the
>> offsets of data references, so it's not a feature I would want to use
>> a lot.
>
> Is it NP-complete? Because it doesn't look like it needs to try all
> permutations of offset lengths.

Ah, yes getting _maximum_ compaction is NP-complete and
this comp.lang.asm 2006 msg explains why.
In a nutshell, it can get stuck in situations where two or more
compaction's are dependent on each other. To guarantee that you have
found all possible sticky points you must try all 2^n permutations.

Branch displacement Optimization
https://groups.google.com/g/alt.lang.asm/c/_YoDDwXQopc/m/08qsQtSSEdMJ

EricP

unread,

Jul 14, 2022, 3:18:19 PM7/14/22

to

This interacts with ISA design because x86 has 8 or 32 bit offsets.
Small offset branches are much more frequent than large offset,
AND there is such a large difference between the 2 byte short
sequence and the 5 byte long sequence.
So there are many more sticky point opportunities and
it becomes much more important to catch those occurrences.

In other words, this is an interaction between the ISA branch
range of +- 128 bytes and the frequency of such small branches.

If this was a difference between 16 and 32 bit offsets then
the occurrences of these sticky points would be so few that
it would not be worth searching for them.

Terje Mathisen

unread,

Jul 14, 2022, 3:45:29 PM7/14/22

to

To fix this you re-shuffle all your functions/basic blocks, ordering
them so that you can either fall through or use a short branch to get
where you need to be.

A good heuristic is to start in the middle so that the full negative and
positive byte offset range is available.

Terje

MitchAlsup

unread,

Jul 14, 2022, 4:42:05 PM7/14/22

to

In effect, NP-complete becomes O(1) ? O(3) ?

MitchAlsup

unread,

Jul 14, 2022, 4:44:30 PM7/14/22

to

Both of these options is more a statement of how x86 damages how
people think of computer architecture than of what one would want
to build into a compiler.

John Levine

unread,

Jul 14, 2022, 5:52:44 PM7/14/22

to

According to MitchAlsup <Mitch...@aol.com>:

>> A good heuristic is to start in the middle so that the full negative and
>> positive byte offset range is available.
><
>Both of these options is more a statement of how x86 damages how
>people think of computer architecture than of what one would want
>to build into a compiler.

Not really. You run into the same issues on any machine with long and
short addresses. I wrote a branch address shortener for the assembler
for AIX on the ROMP back in the 1980s.

Ivan Godard

unread,

Jul 14, 2022, 7:31:21 PM7/14/22

to

No, not that either: if a pass makes no changes, just set all unresolved
to short.

This is not the bin packing problem, unless you use a binning algorithm.
But that says something about tthe algorithm, not the problem.

Ivan Godard

unread,

Jul 14, 2022, 7:33:28 PM7/14/22

to

No, linear or quadratic (depending on what you are measuring) becomes
O(1) or O(3)

Andy Valencia

unread,

Jul 14, 2022, 8:27:49 PM7/14/22

to

EricP <ThatWould...@thevillage.com> writes:
> Windows uses a similar jump table at the end of the code section
> to bind to DLL export routines at image load.

(I had to go back and look at my source.) With my VSTa OS,
I wanted to avoid the costs of PIC code generation on the x86
(which was being adopted pretty much everywhere). System shared
libraries were gathered at system build time, and got assigned
addresses with leading jump tables, along with stubs which knew how
to jump via the table to the actual routine for each entry.

The source says the shared libs were: libc, libm, libtermcap,
libregexp, libregex (plus some system stuff).

VSTa library code continued to be faster and more efficient than
mainstream OS code for the lifespan of the x86. But the "shared
library tax" seemed acceptable-at least by default--in the industry.

Andy Valencia
Home page: https://www.vsta.org/andy/
To contact me: https://www.vsta.org/contact/andy.html

Quadibloc

unread,

Jul 15, 2022, 1:42:41 AM7/15/22

to

Isn't quadratic O(n^2), with cubic being O(n^3)? And NP-complete, of course,
is O(2^n)...

John Savard

Quadibloc

unread,

Jul 15, 2022, 1:44:19 AM7/15/22

to

...and Mitch wasn't actually _claiming_ that NP-complete was O(3), but that
the post to which he was replying seemed to be making that incredible claim.

John Savard

Ivan Godard

unread,

Jul 15, 2022, 2:08:17 AM7/15/22

to

Not just O(2^n); any exponential function. The math is well established,
if not well known.

MitchAlsup

unread,

Jul 15, 2022, 10:58:53 AM7/15/22

to

No, I was observing that a potentially NP-complete algorithm with certain
applied restrictions appeared to have become "not very hard".
<
Not that it WAS O(N^3)
>
> John Savard

EricP

unread,

Jul 15, 2022, 12:03:43 PM7/15/22

to

This 'stuck straddle' effect can occur if there are multiple branch sizes.
For a particular ISA, the question is whether it occurs frequently
enough to warrant checking for them.

I'm looking at the H&P graph branch distances,
frequency vs #bits of branch displacement
(unsigned - the graph plots absolute distance in bits)
which shows for integer code a peak of 18% at 4 bits,
for fp code a peak of 25% at 5 bits, and both quickly fall to
almost zero by 9 bits. Those are absolute distances.

The H&P graph is at the top of this search
https://www.google.ca/search?q=%22branch+distances%22+instructions&hl=en-CA&gbv=2&tbm=isch&ei=Q4_RYpiqHL2g5NoPoKeNmAc&start=0&sa=N

I note that the x86 short branch distance of 7 bits (unsigned)
falls just before where the knee of those graphs falls off.
That causes the frequency of occurrence to be high enough
that it makes it worthwhile for assembler to check for it.

I'm suggesting that if an ISA has multiple branch sizes then
this stuck straddle effect can theoretically still occur.
However if the short branch has 9 bits (unsigned) or more then
compaction will work on 99.9999% of branches on the first pass
so smarter algorithms would be unnecessary, and is effectively O(1).

In my case a short conditional branch is 4 bytes (2 byte operation
specifier, 2 bytes immediate) and a long unconditional branch is
8 or 12 bytes, 4 byte specifier with 4 or 8 byte immediate.
The 16-bit short signed immediate means I don't need to worry
about compaction being hard to do.

MitchAlsup

unread,

Jul 15, 2022, 12:53:07 PM7/15/22

to

2 things are harming x86 here:
1) is that instructions can start on byte boundaries.
2) therefore the displacement is in bytes instead of instructions.
<
RISC machines, on the other hand, have a more uniform displacement
available (every branch has at least 16-bits) and that the displacement
is scaled to the minimum boundary (<<2) making that 16-bit displacement
useful in transferring control -2^17..+2^17-1. This seems so much more
than simply sufficient, as there are vanishingly few subroutines larger
than 2^18 bytes.

>
> I'm suggesting that if an ISA has multiple branch sizes then
> this stuck straddle effect can theoretically still occur.
> However if the short branch has 9 bits (unsigned) or more then
> compaction will work on 99.9999% of branches on the first pass
> so smarter algorithms would be unnecessary, and is effectively O(1).
<

This was my point many responses up above.

>
> In my case a short conditional branch is 4 bytes (2 byte operation
> specifier, 2 bytes immediate) and a long unconditional branch is
> 8 or 12 bytes, 4 byte specifier with 4 or 8 byte immediate.
<

Same here: but with the addition of unconditional branches having
26-bit displacements instead of 16-bit displacements while remaining
in one word (32-bits).

Ivan Godard

unread,

Jul 15, 2022, 1:08:14 PM7/15/22

to

The algorithm works for multiple sizes too.

For each pass, compute what the offset would be if every unresolved
branch used the smallest offset size possible for its (partly) known
distance, and if every branch used the largest. If for any branch the
computed possible offsets fall in the same offset-size bucket then
resolve that branch to that size. Repeat passes until a pass does not
resolve any additional branch. Then resolve every remaining branch to
the smaller offset-size.

Just as for the two-size condition, this is worst case O(n) in number of
passes, and O(n^2/2) in number of branches tested.

MitchAlsup

unread,

Jul 15, 2022, 2:16:20 PM7/15/22

to

On Friday, July 15, 2022 at 12:08:14 PM UTC-5, Ivan Godard wrote:
> On 7/15/2022 9:03 AM, EricP wrote:
> > MitchAlsup wrote:

<snip, people, snip>

> >> In effect, NP-complete becomes O(1) ? O(3) ?
> >
> > This 'stuck straddle' effect can occur if there are multiple branch sizes.
> > For a particular ISA, the question is whether it occurs frequently
> > enough to warrant checking for them.
> The algorithm works for multiple sizes too.
>
> For each pass, compute what the offset would be if every unresolved
> branch used the smallest offset size possible for its (partly) known
> distance, and if every branch used the largest. If for any branch the
> computed possible offsets fall in the same offset-size bucket then
> resolve that branch to that size.
<

And remove those branches from any further consideration.

Brett

unread,

Jul 15, 2022, 7:51:36 PM7/15/22

to

In a large function there are typically only three long destinations for
branches, the loop back, the loop early exit and function early exit.

With only three locations needing long alignment, all the short branches
are happy with 16 bit alignment to short instruction boundaries.

You can always unpack a short instruction back to a long instruction in
front of these long destinations to restore 4 byte alignment.

MitchAlsup

unread,

Jul 15, 2022, 8:57:38 PM7/15/22

to

Livermore loops is an application with 22 kernels inside a single function.
Just over ½ of these have more than one loop in the kernel. It is true, that
except for the kernel looping, it is a straight line function.

>
> With only three locations needing long alignment, all the short branches
> are happy with 16 bit alignment to short instruction boundaries.
<

If you can support 3 longer branches, you can support 300,000 long branches.

Paul A. Clayton

unread,

Jul 16, 2022, 10:20:21 AM7/16/22

to

MitchAlsup wrote:
> On Wednesday, July 13, 2022 at 7:03:06 PM UTC-5, gg...@yahoo.com wrote:

[snip]

>> CPU manufacturers recommend that important call be on cache lines so that
>> eight instructions are loaded in the first cycle.
> <
> The instruction to cache line alignment is a SW problem, SW can specify anything
> it wants. It is recommended that entries are on large boundaries, but not mandated.
> FORTRAN allows for multiple entry points per function, making all them cache line
> aligned might be a bit........
> <
> Secondarily, there is no such restriction on branches and both instructions are
> decoded from the same subGroup and pass through to the branch adder with
> a <logic free> 2-bit upshift.

One could use the same shift if the less significant offset bits
for the (typically) inside-function control flow were opcode or
other non-offset bits in the call/long-range-jump instructions.
Such highly-aligned calls would need to have lower bit masking
anyway since call instructions would not have the alignment
requirement.

(By the way, Mitch, I wonder if the ENTER instruction could be
used in a special mode to guard against arbitrary call targets.
If a call had to target an ENTER instruction, there might be a
slight reduction in danger with only a modest extra cost. This
is similar to Itanium's Enter Privileged Code operation which
allowed an ordinary jump/call to be a system call — the EPC
operation had to be in a page with the right to boost privilege.)

Technically possible (or plausible) does not mean wise or even
sensible.

(I admit I dislike offset-based addressing because the base
address is typically fixed (at least at run time) and the
repeated work of addition of (run time) constants does not
provide much value. Even just comparing the most significant
two inset bits with the same bits of the instruction address
to determine carry/borrow would only slightly reduce range but
more significantly reduce work. Such presumably has other issues
keeping it from being used, though being a change that adds
very little (*I think* >0) value could be sufficient reason.)

Paul A. Clayton

unread,

Jul 16, 2022, 10:20:25 AM7/16/22

to

John Dallman wrote:
[snip]

> As best I can deduce, ARM decided that it was worth addressing the
> embedded market and the general-purpose computing market separately, with
> their Cortex-M and Cortex-A lines. Cortex-A is heading towards being A64
> only, while Cortex-M remains T32.

ARM Cortex also has an R profile targeting real-time uses.

I have not looked closely at what the architectural differences
are, but I suspect they mainly relate to memory protection. (I
vaguely recall reading that an R processor could be included in
a mixed system that used paging in a virtualization layer.) The
R profile did not adopt the 64-bit architecture when it advanced
to Architecture Version 8, but I do not think there is much
preventing the adoption of a minor variant of the AArch64.

(Maybe I should look at how things stand now; the most recent
document I have about R profile is a white paper from 2013.)

John Dallman

unread,

Jul 16, 2022, 11:14:43 AM7/16/22

to

In article <tauhf6$3d4qc$2...@dont-email.me>, paaron...@gmail.com (Paul
A. Clayton) wrote:

> The [ARM Cortex-]R profile did not adopt the 64-bit architecture

> when it advanced to Architecture Version 8, but I do not think
> there is much preventing the adoption of a minor variant of the
> AArch64.

The first 64-bit ARM Cortex-R processor was announced in September 2020,
in the Cortex-R82.

John

MitchAlsup

unread,

Jul 16, 2022, 11:38:03 AM7/16/22

to

This thought has come up more than a few times. But consider,
at least ½ of all dynamically called subroutines are leaf level
and these (98%-ile) do not use ENTER/EXIT epilogues/prologues.
These leaf level subroutines are ABIed to have no instructions in
the prologue and only RET in the epilogue.

>
> Technically possible (or plausible) does not mean wise or even
> sensible.
>
> (I admit I dislike offset-based addressing because the base
> address is typically fixed (at least at run time) and the
> repeated work of addition of (run time) constants does not
> provide much value.
<

There is more than "some" value in position independence even
at the cost of adding the current IP.

JimBrakefield

unread,

Jul 16, 2022, 11:35:57 PM7/16/22

to

And for a leaf subroutine that is called from many places,
the compiler is able to make the operands and results be
in the exact same set of registers?
Otherwise the operands and results are moved to/from the
subroutine calling convention locations?

Terje Mathisen

unread,

Jul 17, 2022, 7:01:01 AM7/17/22

to

I haven't studied exactly how My 66000 does it but I would expect that
the leaf function ABI is something like

reg n..n+7 carries up to 8 parameters
reg m..m+7 belong to the caller and must be saved/restored if modied
by this function

either reg i or [SP] holds the return address

reg j is the function return value

All other regs are scratch and can be used/modified at will.

The parameter regs can be either read-only or read-write, that's up to
the ABI designer.

If the n++ and m++ ranges are the same for all types of calls, then the
caller does not need to know if this is a simple leaf function, with no
need for ENTER, or something more complicated, right?

MitchAlsup

unread,

Jul 17, 2022, 10:57:20 AM7/17/22

to

On Sunday, July 17, 2022 at 6:01:01 AM UTC-5, Terje Mathisen wrote:
> JimBrakefield wrote:
> > On Saturday, July 16, 2022 at 10:38:03 AM UTC-5, MitchAlsup wrote:

<snip, people, snip>

> > And for a leaf subroutine that is called from many places,
> > the compiler is able to make the operands and results be
> > in the exact same set of registers?
> > Otherwise the operands and results are moved to/from the
> > subroutine calling convention locations?
> >
> I haven't studied exactly how My 66000 does it but I would expect that
> the leaf function ABI is something like
>
> reg n..n+7 carries up to 8 parameters

R1..R8

> reg m..m+7 belong to the caller and must be saved/restored if modied
> by this function

R16..R31
When Safe Stack is in use, R16..R30 first read as 0 after ENTER

>
> either reg i or [SP] holds the return address

R0 when Safe Stack is not in use
[SSP] when Safe Stack is in use.
SSP = Safe Stack Pointer

>
> reg j is the function return value

R1..R8

>
> All other regs are scratch and can be used/modified at will.

R8..R15 and those not used as arguments/results

>
> The parameter regs can be either read-only or read-write, that's up to
> the ABI designer.

RW

>
> If the n++ and m++ ranges are the same for all types of calls, then the
> caller does not need to know if this is a simple leaf function, with no
> need for ENTER, or something more complicated, right?
<

Caller does not know

MitchAlsup

unread,

Jul 17, 2022, 10:58:55 AM7/17/22

to

On Saturday, July 16, 2022 at 10:35:57 PM UTC-5, JimBrakefield wrote:
> On Saturday, July 16, 2022 at 10:38:03 AM UTC-5, MitchAlsup wrote:

<snip>

> > > Even just comparing the most significant
> > > two inset bits with the same bits of the instruction address
> > > to determine carry/borrow would only slightly reduce range but
> > > more significantly reduce work. Such presumably has other issues
> > > keeping it from being used, though being a change that adds
> > > very little (*I think* >0) value could be sufficient reason.)
<
> And for a leaf subroutine that is called from many places,
> the compiler is able to make the operands and results be
> in the exact same set of registers?
> Otherwise the operands and results are moved to/from the
> subroutine calling convention locations?
<

moved to/from ABI locations.

Paul A. Clayton

unread,

Jul 18, 2022, 11:23:09 AM7/18/22

to

MitchAlsup wrote:
> On Saturday, July 16, 2022 at 9:20:21 AM UTC-5, Paul A. Clayton wrote:

[snip]

>> (I admit I dislike offset-based addressing because the base
>> address is typically fixed (at least at run time) and the
>> repeated work of addition of (run time) constants does not
>> provide much value.
>
> There is more than "some" value in position independence even
> at the cost of adding the current IP.

While position independent code can be useful, it is not
necessary for all workloads. I suspect microcontrollers
could get away without supporting position independent code
without losing customers. (I readily grant my ignorance.
Perhaps 8-bit and 16-bit ISAs have uses that benefit
noticeably from PIC. There may also be benefits from not
special-casing address generation for control flow
operations that outweigh the minor benefit of reduced work
for such operations, at least for some reasonable and
attractive microarchitectures.)

Also, it would seem worth noting that position independence
is typically (always in practice?) at the module level.
Short intra-function branches and jumps could, I believe,
accept forced 64KiB (e.g.) alignment with the two most
significant bit comparison mechanism to generate carry or
borrow. In a 64-bit address space, 32-bit call instructions
(26 'offset' bits for 6-bit opcode) would not *seem* to have
a problem either; even with only 48 bits available for
virtual addresses (and 4-byte target alignment), one would
have 20 bits that could have arbitrary values.

The 26-bit inset for MIPS' JMP and JAL instructions was a
little less flexible than the 2-bit-carry/borrow-
determination I proposed; forcing code not to cross a
256 MiB boundary. Also with a 32-bit address space,
relocation would be more limited. While this added
software complexity, this did not seem to make MIPS
unusable (even with MIPS using segments/regions which
reduce the user-mode address space to 32 bits). (I grant
that "usable with hair-pulling" is problematic in a
competitive market.)

At least one POWER microarchitecture predecoded branch
instructions, caching the address generation in the
instruction cache. My suggestion is more radical —
architecturally caching the address generation in memory —
and has significant implications for PIC. I am not yet
convinced that there is not place for such a design, though
I can agree that getting such accepted might be as
difficult as using a single address space (which is a
design with some advantages but significant difficulties
from software designed around fork and multiple address
space/alias-supporting assumptions).

Of course, addition is not exactly an extremely expensive
operation. (I admit I had forgotten about PIC.)

Tim Rentsch

unread,

Jul 19, 2022, 8:16:43 PM7/19/22

to

John Levine <jo...@taugh.com> writes:

> It appears that Ivan Godard <iv...@millcomputing.com> said:
>
>> Walk the code, keeping two distances for each unresolved branch. One is
>> how far the target would be if all unresolved were resolved as near; the
>> other is if all unresolved were resolved as far. For each unresolved, if
>> the as-far is less than the near range then resolve as near, and if the
>> as-near is greater than the near range then resolve as far; if neither
>> then leave it unresolved and go to the next unresolved. ...
>
> There's a bunch of variations on that. I think mine started with all
> the branches long and then it iterated looking for ones it could
> shorten until it couldn't find any more, which gets you a good result
> but not necessarily optimal.
>
> Here's the paper with the proof, from the CACM in 1978:
>
> https://dl.acm.org/doi/10.1145/359460.359474

After looking at this paper, and also the article mentioned by
EricP, and after doing some additional reading, my understanding
is this.

If all branch targets are simply to fixed labels in the code, and
the contents and ordering of basic blocks is fixed, then an
optimal solution can be found in polynomial time (probably no
worse than quadratic, but certainly polynomial).

If any of those three conditions is not met, then we have a
different problem, and finding an optimal solution to that
problem is indeed NP-complete.

Tim Rentsch

unread,

Jul 19, 2022, 8:38:01 PM7/19/22

to

Ivan Godard <iv...@millcomputing.com> writes:

> On 7/14/2022 10:42 PM, Quadibloc wrote:
>
>> On Thursday, July 14, 2022 at 5:33:28 PM UTC-6, Ivan Godard wrote:
>>
>>> On 7/14/2022 1:42 PM, MitchAlsup wrote:
>>>
>>>> In effect, NP-complete becomes O(1) ? O(3) ?
>>>
>>> No, linear or quadratic (depending on what you are measuring) becomes
>>> O(1) or O(3)
>>
>> Isn't quadratic O(n^2), with cubic being O(n^3)? And NP-complete,
>> of course, is O(2^n)...
>

> Not just O(2^n); any exponential function. The math is well
> established, if not well known.

Perhaps some exponential function, but not /any/ exponential
function (assuming P != NP). All NP-complete problems are
equivalent in the sense that if any one of them can be done in
polynomial time then they all can. However they are not all
equivalent in terms of what their performance bounds are. For
example, the knapsack problem clearly can be done in O(2**n)
time, but the traveling salesman problem appears to need more,
namely O( n factorial ), which actually grows faster than any
purely exponential function. (I haven't checked but I believe
both knapsack and TSP are NP-complete.)

Stefan Monnier

unread,

Jul 19, 2022, 11:22:43 PM7/19/22

to

> If all branch targets are simply to fixed labels in the code, and
> the contents and ordering of basic blocks is fixed, then an
> optimal solution can be found in polynomial time (probably no
> worse than quadratic, but certainly polynomial).

I think you need the distance between labels to never get larger when an
instruction is shrunk (see my counter example with alignment constraints).

Still, I haven't seen any good reason to think that optimal results are
important. You can always opt for a greedy algorithm which starts with
all branches short and then grows one by one all the branches that are
"too short". As long as you never shrink a branch back (even if other
changes made it possible), that gives you an O(N²) algorithm in the
worst case which is not always optimal but should be pretty damn close
in practice.

> If any of those three conditions is not met, then we have a
> different problem, and finding an optimal solution to that
> problem is indeed NP-complete.

IIUC we now know that NP-complete problems can always be solved in
polynomial time if you replace "optimal" with "within some ϵ of the
optimal solution". I think in the case of branch-size optimization,
this ϵ needs to be vanishingly small before the complexity starts
creeping up, which is why the NP-complete result is so disconnected from
the experience of practitioners.

Stefan

Ivan Godard

unread,

Jul 19, 2022, 11:57:50 PM7/19/22

to

Yes, they vary. I should have said "any *at least* exponential function".

Thomas Koenig

unread,

Jul 20, 2022, 2:39:27 AM7/20/22

to

Tim Rentsch <tr.1...@z991.linuxsc.com> schrieb:

> Perhaps some exponential function, but not /any/ exponential
> function (assuming P != NP). All NP-complete problems are
> equivalent in the sense that if any one of them can be done in
> polynomial time then they all can. However they are not all
> equivalent in terms of what their performance bounds are. For
> example, the knapsack problem clearly can be done in O(2**n)
> time, but the traveling salesman problem appears to need more,
> namely O( n factorial ), which actually grows faster than any
> purely exponential function.

But only boringly so for very large numbers of n. (OK, these are
so large that they are likely not to matter for practical
computations).

BGB

unread,

Jul 20, 2022, 3:38:51 AM7/20/22

to

It seems like the computational cost of "traveling salesman" could be
reduced a fair bit (from O(n!) ):
If one has a data structure to efficiently find the closest points first;
If one stops a given search path as soon as the current distance exceeds
a previously best-seen distance.

Though, I remember in earlier forms of my ISA, I had an issue with
vector shuffles:
Early on, I only had a fixed set of vector shuffle operations;
One would need to daisy-chain them for more complex shuffles;
It was computationally "very expensive" to find the optimal sequence of
shuffle operators to get the elements into the order specified by an
arbitrary shuffle mask.

So, I had a long-running tool whose job was mostly to fill a table of
shuffle patterns with the corresponding sequence of shuffle ops.

This wasn't too bad in the 4-element case, but the 8 byte-element case
was "a bit extreme" (find the patterns for a 24-bit shuffle mask). Also
the lookup table is too large to put on GitHub, ...

I later added general purpose shuffle instructions (which accepted a
shuffle mask directly), rendering the existing 8-byte table "kinda moot".

Though, the "fake an arbitrary 8B shuffle using only 4-element shuffle
ops" thing is still kinda annoying.

So, the challenge is now in a way:
A 4-element byte shuffle "PSHUF.B Rm, Imm8, Rn" and a 4-element word
shuffle "PSHUF.W Rm, Imm8, Rn", figure an effective way to fake an
8-byte "PSHUF.8B Rm, Imm24, Rn" operation, for an arbitrary mask, using
the minimal number of shuffle operations (preferably in a way that
doesn't require building and using an implausibly large lookup table or
similar).

Or, if this is solved, what about arbitrary 16-byte shuffles in a
128-bit vector using a 64-bit mask shuffle?... Good luck using a lookup
table for this one (here, one has 4-element PSHUF.B and PSHUF.W, along
with the MOVLLD/MOVLHD/MOVHLD/MOVHHD instructions being used to fake the
existence of a "PSHUF.L Xm, Imm8, Xn" instruction).

Then again, maybe there is a simple algo for these cases I just haven't
realized yet.

...

Thomas Koenig

unread,

Jul 20, 2022, 6:07:20 AM7/20/22

to

BGB <cr8...@gmail.com> schrieb:

> It seems like the computational cost of "traveling salesman" could be
> reduced a fair bit (from O(n!) ):
> If one has a data structure to efficiently find the closest points first;
> If one stops a given search path as soon as the current distance exceeds
> a previously best-seen distance.

The lower bound reached so far is actually O(n^2 2^n).
A lot of work has been done on the travelling salesman
problem, and there is ample literature on it. See
https://en.wikipedia.org/wiki/Travelling_salesman_problem for an
overview. Numerical Recipes, for example, has an implementation
using simulated annealing.

No need to re-invent that particular wheel, even though it is pretty
clunky.

Anton Ertl

unread,

Jul 20, 2022, 8:23:53 AM7/20/22

to

Stefan Monnier <mon...@iro.umontreal.ca> writes:
>IIUC we now know that NP-complete problems can always be solved in
>polynomial time if you replace "optimal" with "within some ϵ of the
>optimal solution".

That assumes it is an optimization problem, i.e., there are many
solutions, some better, some worse. But there are NP-complete
problems that are not optimization problems. The prototypical
NP-complete problem is boolean satisfyability (for a given boolean
formula, set the variables such that the formula evaluates to true),
and this is not an optimization problem.

>I think in the case of branch-size optimization,
>this ϵ needs to be vanishingly small before the complexity starts
>creeping up, which is why the NP-complete result is so disconnected from
>the experience of practitioners.

If you have n things of the form label1-label2+constant, you can get
2^n to produce an optimal result. But typically people have only
label1-label2, and the problem is quadratic at worst (and typically
much faster).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

Anton Ertl

unread,

Jul 27, 2022, 1:13:44 PM7/27/22

to

MitchAlsup <Mitch...@aol.com> writes:
>I think, technically, getting it perfect requires a pass for every resolved address*
>and this smells a bit like NP-complete to me.

That would make it quadratic.

Making it optimal is NP-complete, as you could have an arbitrary
number of nasty instructions that can become smaller when other
instructions grow. In practice there are few such instructions, so
even trying out all 2^n combinations of short and long for n nasty
instructions is feasible, and if you really want optimality, you can
use the typical approaches for solving NP-complete problems without
trying out all 2^n combinations (e.g., branch-and-bound), and that's
even more feasible. In practice, polynomial algorithms are good
enough.

Tim Rentsch

unread,

Jul 28, 2022, 9:09:41 PM7/28/22

to

Thomas Koenig <tko...@netcologne.de> writes:

> Tim Rentsch <tr.1...@z991.linuxsc.com> schrieb:
>
>> Perhaps some exponential function, but not /any/ exponential
>> function (assuming P != NP). All NP-complete problems are
>> equivalent in the sense that if any one of them can be done in
>> polynomial time then they all can. However they are not all
>> equivalent in terms of what their performance bounds are. For
>> example, the knapsack problem clearly can be done in O(2**n)
>> time, but the traveling salesman problem appears to need more,
>> namely O( n factorial ), which actually grows faster than any
>> purely exponential function.
>

> But only boringly so for very large numbers of n. [...]

Your math is bad. Try calculating some actual numbers
next time.

Terje Mathisen

unread,

Jul 29, 2022, 3:56:03 PM7/29/22

to

Since you can use the Gamma function to calculate factorials, and that
function is defined (approximately) as an asymptotic exponential

{\displaystyle \Gamma (x+1)\sim {\sqrt {2\pi x}}\left({\frac
{x}{e}}\right)^{x},}

I.e. sqrt(2*pi*x)*pow(x/e,x)

where the pow() function will dominate the result.

Tim Rentsch

unread,

Jul 30, 2022, 6:22:24 AM7/30/22

to

Terje Mathisen <terje.m...@tmsw.no> writes:

> Tim Rentsch wrote:
>
>> Thomas Koenig <tko...@netcologne.de> writes:
>>
>>> Tim Rentsch <tr.1...@z991.linuxsc.com> schrieb:
>>>
>>>> Perhaps some exponential function, but not /any/ exponential
>>>> function (assuming P != NP). All NP-complete problems are
>>>> equivalent in the sense that if any one of them can be done in
>>>> polynomial time then they all can. However they are not all
>>>> equivalent in terms of what their performance bounds are. For
>>>> example, the knapsack problem clearly can be done in O(2**n)
>>>> time, but the traveling salesman problem appears to need more,
>>>> namely O( n factorial ), which actually grows faster than any
>>>> purely exponential function.
>>>
>>> But only boringly so for very large numbers of n. [...]
>>
>> Your math is bad. Try calculating some actual numbers
>> next time.
>
> Since you can use the Gamma function to calculate factorials, and that
> function is defined (approximately) as an asymptotic exponential
>
> {\displaystyle \Gamma (x+1)\sim {\sqrt {2\pi x}}\left({\frac
> {x}{e}}\right)^{x},}
>
> I.e. sqrt(2*pi*x)*pow(x/e,x)

I sort of assumed most people were familiar with this formula.
However it isn't necessary to know the formula to compute some
actual specific values.

David Brown

unread,

Jul 30, 2022, 8:31:35 AM7/30/22

to

The factor (x ^ x) will grow faster than (k ^ x) for any given fixed
"k", once "x" gets larger than "k". So (x ^ x), and therefore
factorial, grows faster than any purely exponential function. (Not that
this is relevant for the travelling salesman problem, which has known
exponential complexity algorithms as well as factorial complexity
algorithms.)

Terje Mathisen

unread,

Jul 30, 2022, 9:48:42 AM7/30/22

to

I do realize this point, but for any real algorithm with factorial
complexity, the n value is either quite small, or your process will take
longer to finish than the heath death of the universe. :-)

I.e. the difference between say 1000^n and n^n _really_ doesn't matter!

David Brown

unread,

Jul 31, 2022, 5:21:17 AM7/31/22

to

On 30/07/2022 15:48, Terje Mathisen wrote:
> David Brown wrote:
>> On 29/07/2022 21:55, Terje Mathisen wrote:
>>> Tim Rentsch wrote:
>>>> Thomas Koenig <tko...@netcologne.de> writes:
>>>>
>>>>> Tim Rentsch <tr.1...@z991.linuxsc.com> schrieb:
>>>>>
>>>>>> Perhaps some exponential function, but not /any/ exponential

>>>>>> function (assuming P != NP).Â All NP-complete problems are

>>>>>> equivalent in the sense that if any one of them can be done in

>>>>>> polynomial time then they all can.Â However they are not all
>>>>>> equivalent in terms of what their performance bounds are.Â For

>>>>>> example, the knapsack problem clearly can be done in O(2**n)
>>>>>> time, but the traveling salesman problem appears to need more,
>>>>>> namely O( n factorial ), which actually grows faster than any
>>>>>> purely exponential function.
>>>>>

>>>>> But only boringly so for very large numbers of n.Â [...]
>>>>
>>>> Your math is bad.Â Try calculating some actual numbers

>>>> next time.
>>>
>>> Since you can use the Gamma function to calculate factorials, and
>>> that function is defined (approximately) as an asymptotic exponential
>>>
>>> {\displaystyle \Gamma (x+1)\sim {\sqrt {2\pi x}}\left({\frac
>>> {x}{e}}\right)^{x},}
>>>
>>> I.e. sqrt(2*pi*x)*pow(x/e,x)
>>>
>>> where the pow() function will dominate the result.
>>>
>>> Terje
>>>
>>
>> The factor (x ^ x) will grow faster than (k ^ x) for any given fixed
>> "k", once "x" gets larger than "k". So (x ^ x), and therefore
>> factorial, grows faster than any purely exponential function. (Not
>> that this is relevant for the travelling salesman problem, which has
>> known exponential complexity algorithms as well as factorial
>> complexity algorithms.)
>
> I do realize this point, but for any real algorithm with factorial
> complexity, the n value is either quite small, or your process will take
> longer to finish than the heath death of the universe. :-)
>
> I.e. the difference between say 1000^n and n^n _really_ doesn't matter!
>

Agreed. There's a difference between theory and practice. Big O
comparisons are often meaningless in practice, for sizes that can be
handled in real life. And big problems often end up as an exercise in
cache efficiency, rather than computational efficiency.

Terje Mathisen

unread,

Jul 31, 2022, 9:40:19 AM7/31/22

to

My favorite example of the latter is probably the AMD optimized array
math operation (something like A = B * x, with A & B arrays too large
for both L1 & L2 cache).

First a tight loop loading the first element from each 64-byte cache
line, without doing anything, doing this on a 4 kB block so 64 load
operations. This forces the 4kB of source data to be loaded into $L1.

Second, another loop which does Temp = B * x, with a fixed Temp
destination which also will stay in $L1. At this point all the source
data is already in $L1, so this loop runs at the full FPU FMUL bandwidth.

Finally, a third loop which moves the 4K of result data into the A
target array, using non-temporal (i.e. cache-bypassing) store operations.

On the target AMD CPU this version ran about 3 times faster than the
naive/obvious single A = B * x loop!

Michael S

unread,

Jul 31, 2022, 1:18:43 PM7/31/22

to

What about modern AMD (or non-AMD, but modern) CPU?
I would expect that tricky version would be between significantly slower
and slightly slower than naive variant.

Thomas Koenig

unread,

Jul 31, 2022, 3:34:59 PM7/31/22

to

Michael S <already...@yahoo.com> schrieb:

What Terje describes would still make a large difference today,
given that the matrix sizes are still larger than cache.
And yes, this makes a _huge_ difference, running almost completely
out of L1 cache you can get within a reasonably large factor
of the theoretical peak performance.

https://netlib.org/blas/gemm_based/ssgemmbased.tgz has another
very cache-friendly matrix multiplicttion algorithm, which is the
basis of gfortran's matmul routines (translated to C, no less,
for the library of a Fortran compiler).

Michael S

unread,

Jul 31, 2022, 4:01:55 PM7/31/22

to

Are you sure you read what Terje wrote?
Or, may be, I misread what Terje wrote?

According to my understanding, he was talking about
vector-by-scalar multiplication, rather than matrix-by-matrix
or vector-by-matrix.
The optimization by AMD was fighting against trashing of
direct-mapped cache. Probably, blocking direct-mapped cache.
I fully expect this optimization to become pessimization on
less crappy HW.

> And yes, this makes a _huge_ difference, running almost completely
> out of L1 cache you can get within a reasonably large factor
> of the theoretical peak performance.
>
> https://netlib.org/blas/gemm_based/ssgemmbased.tgz has another
> very cache-friendly matrix multiplicttion algorithm, which is the
> basis of gfortran's matmul routines (translated to C, no less,
> for the library of a Fortran compiler).

I probably know more than typical Fortran compiler about trade offs
of SGEMM. Esp. so about cases where matrices are not too big and
non-trivially non-square.
But Terje, according to my understanding, was talking about much
simpler things.